Efficient parallel implementation of the SHRiMP sequence alignment tool using MapReduce

Rawan AlSaad; Qutaibah Malluhi; Mohamed Abouelhoda

doi:10.5339/qfarf.2012.CSPS7

Abstract

With the advent of ultra high-throughput DNA sequencing technologies used in Next-Generation Sequencing (NGS) machines, we are facing a daunting new era in petabyte scale bioinformatics data. The enormous amounts of data produced by NGS machines lead to storage, scalability, and performance challenges. At the same time, cloud computing architectures are rapidly emerging as robust and economical solutions to high performance computing of all kinds. To date, these architectures have had limited impact on the sequence alignment problem, whereby sequence reads must be compared to a reference genome. In this research, we present a methodology for efficient transformation of one of the recently developed NGS alignment tools, SHRiMP, into the cloud environment based on the MapReduce programming model. Critical to the function and performance of our methodology is the implementation of several techniques and mechanisms for facilitating the task of porting the SHRiMP sequence alignment tool into the cloud. These techniques and mechanisms allow the "cloudified" SHRiMP to run as a black box within the MapReduce model, without the need for building new parallel algorithms or recoding this tool from scratch. The approach is based on the MapReduce parallel programming model, its open source implementation Hadoop, and its underlying distributed file system (HDFS). The deployment of the developed methodology utilizes the cloud infrastructure installed at Qatar University. Experimental results demonstrate that multiplexing large-scale SHRiMP sequence alignment jobs in parallel using the MapReduce framework dramatically improves the performance when the user utilizes the resources provided by the cloud. In conclusion, using cloud computing for NGS data analysis is a viable and efficient alternative to analyzing data on in-house compute clusters. The efficiency and flexibility of the cloud computing environments and the MapReduce programming model provide a powerful version of the SHRiMP sequence alignment tool with a considerable boost. Using this methodology, ordinary biologists can perform the computationally demanding sequence alignment tasks without the need to delve deep into server and database management, without the complexities and hassles of running jobs on grids and clusters, and without the need to modify the existing code in order to adapt it for parallel processing.

oa Efficient parallel implementation of the SHRiMP sequence alignment tool using MapReduce

Abstract

Metrics

Most Read This Month

Most Cited Most Cited RSS feed

Barriers and facilitators influencing the physical activity of Arabic adults: A literature review

Multiple organ dysfunction syndrome: Contemporary insights on the clinicopathological spectrum

Prevalence of Multi-Antibiotic Resistant Escherichia coli and Klebsiella species obtained from a Tertiary Medical Institution in Oyo State, Nigeria

Effect of green marketing on consumer purchase behavior

Evolution of emergency medical services in Saudi Arabia