ReStore: Reusing results of MapReduce jobs

Ashraf Aboulnaga; Iman Elghandour

doi:10.5339/qfarf.2012.AESNP3

Abstract

'Big Data' analysis has become a central activity in business and science. Companies such as Facebook, Yahoo, and Google now own petabyte-scale data warehouses that are accessed on a regular basis. Terabyte-scale data warehouses are now common in many smaller organizations. This big data analysis is mostly supported by the MapReduce programming and execution model and its implementations, most notably Hadoop which is now one of the major big data platforms. Users of MapReduce often have analysis tasks that are too complex to express as one MapReduce job. Instead, they often use high-level query languages such as Pig Latin, Hive, or Jaql to express their complex analysis tasks. The compilers of these query languages translate queries into workflows of MapReduce jobs. Each job in such a workflow produces an output that is stored in the distributed file system used by the MapReduce system (e.g., HDFS in the case of Hadoop). These intermediate results are used as input by subsequent jobs in the workflow. The current practice is to delete these intermediate outputs after finishing the execution of the workflow. In our work, we developed ReStore, a system that improves the performance of workflows of MapReduce jobs generated from high-level query languages by storing the intermediate results of executed workflows and reusing them for future workflows submitted to the system. ReStore can be built on top of dataflow language processor such as Pig, which translates queries into workflows of MapReduce jobs. Each of these MapReduce jobs has a physical query execution plan that contains one or more physical operators that are executed by this job. ReStore rewrites the MapReduce jobs in a submitted workflow at the level of the physical query execution plan in order to reuse job outputs previously stored in the system. ReStore also stores the outputs of executed jobs for future reuse, and creates more reuse opportunities by storing the outputs of parts of jobs (which we call sub-jobs). We have implemented ReStore as an extension to the Pig dataflow system on top of Hadoop, and we experimentally demonstrated significant speedups on queries from the PigMix benchmark.

oa ReStore: Reusing results of MapReduce jobs

Abstract

Metrics

Most Read This Month

Most Cited Most Cited RSS feed

Barriers and facilitators influencing the physical activity of Arabic adults: A literature review

Multiple organ dysfunction syndrome: Contemporary insights on the clinicopathological spectrum

Prevalence of Multi-Antibiotic Resistant Escherichia coli and Klebsiella species obtained from a Tertiary Medical Institution in Oyo State, Nigeria

Effect of green marketing on consumer purchase behavior

Evolution of emergency medical services in Saudi Arabia