Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive
- Univ. of Science and Technology of China, Hefei (China)
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Metagenomics, the study of all microbial species cohabitants in an environment, usually produces large amount of sequence data varying from several GBs to a few TBs. Analyzing metagenomics data includes both data-intensive and compute-intensive steps, making the entire process hard to scale. Here we aim to optimize a metagenomics application that partitions the shortgun metagenomics sequences based on their species of origin. Our solution combines MapReduce-based BioPig analytic toolkit with MPI to provide scalability in respective to both data and compute. We also made some improvements to the existing BioPig toolkit by using simplified data types and compressed k-mer storage. These optimizations leads up to 193× speedup for the computing-intensive step and 9.6× speedup over the entire pipeline. Our optimized application is also capable of processing datasets that are 16 times larger on the same hardware platform. These conclusions indicate integrating heterogeneous technologies such as Hadoop and MPI is quite efficient to solve large genomics problems that are both data-intensive and compute-intensive.
- Research Organization:
- Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC); National Key Research and Development Program of China
- Grant/Contract Number:
- AC02-05CH11231
- OSTI ID:
- 1532334
- Journal Information:
- International Journal of Parallel Programming, Vol. 46, Issue 4; ISSN 0885-7458
- Publisher:
- SpringerCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Similar Records
A case study of tuning MapReduce for efficient Bioinformatics in the cloud
Large-scale seismic waveform quality metric calculation using Hadoop