In order to analyze and deduce valuable information from big image data, we have developed a framework for distributed image processing in Hadoop MapReduce. A vast amount of scientific data is now represented in the form of images from sources like medical tomography. Applying algorithms on these images has been continually limited by the processing capacity of a single machine. MapReduce created by Google presents a potential solution. MapReduce efficiently parallelizes computation by distributing tasks and data across multiple machines. Hadoop, an open source implementation of MapReduce, is gaining a widespread popularity due to features such as scalability and fault tolerance. Hadoop is primarily used with text-based input data. Its ability to process image data and its performance behavior with image processing have not been fully explored. We propose a framework that efficiently enables image processing on Hadoop and characterizes its behavior using a state-of-the-art image processing algorithm, Edge Detection. Existing approaches in distributed image processing suffer from two main problems: (1) input images need to be converted to a custom file format and (2) image processing algorithms require adherence to a specific API that might impose some restrictions on applying some algorithms to Hadoop. Our framework avoids these problems by: (1) bundling all small images into one large file that can be seamlessly parsed by Hadoop and (2) relaxing any restriction by allowing a direct porting of any image processing algorithm to Hadoop. A R educe-less job is then launched where the code for processing images and a mechanism to write the images back individually to HDFS are included in Mappers. We have tested the framework using Edge Detection on a dataset of 3760 biomedical images. Besides, we characterized Edge Detection along several dimensions, such as degree of parallelism and network traffic patterns. We observed that varying the number of map tasks has a significant impact on Hadoop's performance. The best performance was obtained when the number of map tasks equals the number of available slots as long as the application resource demand is satisfied. Compared to the default Hadoop configuration, a speedup of 2.1X was achieved.


Article metrics loading...

Loading full text...

Full text loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error