Driven by the increasing and successful prevalence of MapReduce as an analytics engine on the cloud, this work characterizes the Map phase in Hadoop MapReduce to guide its configuration and improve overall performance. MapReduce is one of the most effective realizations of large-scale data-intensive cloud computing platforms. Hadoop is an open source implementation of MapReduce and is currently enjoying wide popularity. Hadoop has a high-dimensional space of configuration parameters (~200 parameters) that poses a burden on practitioners, like computation scientists, system researchers, and business analysts, to set for efficient and cost-effective execution. In this work we observe that MapReduce application performance is highly influenced by Map concurrency, defined in terms of two configurable parameters, the number of available map slots and the number of map tasks running over the slots. As Map concurrency is varied, we show that some inherent MapReduce characteristics allow systematic and well-informed prediction of MapReduce performance response (runtime increase or decrease). We propose Map Concurrency Characterization, MC2, a predictor for MapReduce performance response. MC2 allows for optimized configuration of the Map phase and, consequently, enhanced Hadoop performance. Current related schemes require mathematical modeling, simulation, dynamic instrumentation, static analysis of unmodified MapReduce application code, and/or actual performance measurements. In contrast, MC2 simply bases its decisions on MapReduce characteristics that are affected by Map concurrency. We implemented MC2 and conducted comprehensive experiments on a private cloud and on Amazon EC2 using Hadoop 0.20.2. Our results show that MC2 can correctly predict MapReduce performance response and provide up to 2.3X speedup in runtime for the tested benchmarks. This performance improvement allows MC2 to further serve in reducing cost in a cloud setting. We believe that MC2 offers a timely contribution to the data analytics domain on the cloud, especially as Hadoop usage continues to grow beyond companies like Google, Microsoft, Facebook and Yahoo!.


Article metrics loading...

Loading full text...

Full text loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error