'Big Data' analysis has become a central activity in business and science. Companies such as Facebook, Yahoo, and Google now own petabyte-scale data warehouses that are accessed on a regular basis. Terabyte-scale data warehouses are now common in many smaller organizations. This big data analysis is mostly supported by the MapReduce programming and execution model and its implementations, most notably Hadoop which is now one of the major big data platforms. Users of MapReduce often have analysis tasks that are too complex to express as one MapReduce job. Instead, they often use high-level query languages such as Pig Latin, Hive, or Jaql to express their complex analysis tasks. The compilers of these query languages translate queries into workflows of MapReduce jobs. Each job in such a workflow produces an output that is stored in the distributed file system used by the MapReduce system (e.g., HDFS in the case of Hadoop). These intermediate results are used as input by subsequent jobs in the workflow. The current practice is to delete these intermediate outputs after finishing the execution of the workflow. In our work, we developed ReStore, a system that improves the performance of workflows of MapReduce jobs generated from high-level query languages by storing the intermediate results of executed workflows and reusing them for future workflows submitted to the system. ReStore can be built on top of dataflow language processor such as Pig, which translates queries into workflows of MapReduce jobs. Each of these MapReduce jobs has a physical query execution plan that contains one or more physical operators that are executed by this job. ReStore rewrites the MapReduce jobs in a submitted workflow at the level of the physical query execution plan in order to reuse job outputs previously stored in the system. ReStore also stores the outputs of executed jobs for future reuse, and creates more reuse opportunities by storing the outputs of parts of jobs (which we call sub-jobs). We have implemented ReStore as an extension to the Pig dataflow system on top of Hadoop, and we experimentally demonstrated significant speedups on queries from the PigMix benchmark.


Article metrics loading...

Loading full text...

Full text loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error