With the advent of ultra high-throughput DNA sequencing technologies used in Next-Generation Sequencing (NGS) machines, we are facing a daunting new era in petabyte scale bioinformatics data. The enormous amounts of data produced by NGS machines lead to storage, scalability, and performance challenges. At the same time, cloud computing architectures are rapidly emerging as robust and economical solutions to high performance computing of all kinds. To date, these architectures have had limited impact on the sequence alignment problem, whereby sequence reads must be compared to a reference genome. In this research, we present a methodology for efficient transformation of one of the recently developed NGS alignment tools, SHRiMP, into the cloud environment based on the MapReduce programming model. Critical to the function and performance of our methodology is the implementation of several techniques and mechanisms for facilitating the task of porting the SHRiMP sequence alignment tool into the cloud. These techniques and mechanisms allow the "cloudified" SHRiMP to run as a black box within the MapReduce model, without the need for building new parallel algorithms or recoding this tool from scratch. The approach is based on the MapReduce parallel programming model, its open source implementation Hadoop, and its underlying distributed file system (HDFS). The deployment of the developed methodology utilizes the cloud infrastructure installed at Qatar University. Experimental results demonstrate that multiplexing large-scale SHRiMP sequence alignment jobs in parallel using the MapReduce framework dramatically improves the performance when the user utilizes the resources provided by the cloud. In conclusion, using cloud computing for NGS data analysis is a viable and efficient alternative to analyzing data on in-house compute clusters. The efficiency and flexibility of the cloud computing environments and the MapReduce programming model provide a powerful version of the SHRiMP sequence alignment tool with a considerable boost. Using this methodology, ordinary biologists can perform the computationally demanding sequence alignment tasks without the need to delve deep into server and database management, without the complexities and hassles of running jobs on grids and clusters, and without the need to modify the existing code in order to adapt it for parallel processing.


Article metrics loading...

Loading full text...

Full text loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error