Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive
Abstract
Metagenomics, the study of all microbial species cohabitants in an environment, usually produces large amount of sequence data varying from several GBs to a few TBs. Analyzing metagenomics data includes both data-intensive and compute-intensive steps, making the entire process hard to scale. Here we aim to optimize a metagenomics application that partitions the shortgun metagenomics sequences based on their species of origin. Our solution combines MapReduce-based BioPig analytic toolkit with MPI to provide scalability in respective to both data and compute. We also made some improvements to the existing BioPig toolkit by using simplified data types and compressed k-mer storage. These optimizations leads up to 193× speedup for the computing-intensive step and 9.6× speedup over the entire pipeline. Our optimized application is also capable of processing datasets that are 16 times larger on the same hardware platform. These conclusions indicate integrating heterogeneous technologies such as Hadoop and MPI is quite efficient to solve large genomics problems that are both data-intensive and compute-intensive.
- Authors:
-
- Univ. of Science and Technology of China, Hefei (China)
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
- Publication Date:
- Research Org.:
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
- Sponsoring Org.:
- USDOE Office of Science (SC); National Key Research and Development Program of China
- OSTI Identifier:
- 1532334
- Grant/Contract Number:
- AC02-05CH11231
- Resource Type:
- Accepted Manuscript
- Journal Name:
- International Journal of Parallel Programming
- Additional Journal Information:
- Journal Volume: 46; Journal Issue: 4; Journal ID: ISSN 0885-7458
- Publisher:
- Springer
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING; Metagenomics; Hadoop; MPI; Optimization; Pig Latin; BioPig; Big data; Data-intensive; Compute-intensive
Citation Formats
Lin, Han, Su, Zhichao, Meng, Xiandong, Jin, Xu, Wang, Zhong, Han, Wenting, An, Hong, Chi, Mengxian, and Wu, Zheng. Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive. United States: N. p., 2017.
Web. doi:10.1007/s10766-017-0524-z.
Lin, Han, Su, Zhichao, Meng, Xiandong, Jin, Xu, Wang, Zhong, Han, Wenting, An, Hong, Chi, Mengxian, & Wu, Zheng. Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive. United States. https://doi.org/10.1007/s10766-017-0524-z
Lin, Han, Su, Zhichao, Meng, Xiandong, Jin, Xu, Wang, Zhong, Han, Wenting, An, Hong, Chi, Mengxian, and Wu, Zheng. Sat .
"Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive". United States. https://doi.org/10.1007/s10766-017-0524-z. https://www.osti.gov/servlets/purl/1532334.
@article{osti_1532334,
title = {Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive},
author = {Lin, Han and Su, Zhichao and Meng, Xiandong and Jin, Xu and Wang, Zhong and Han, Wenting and An, Hong and Chi, Mengxian and Wu, Zheng},
abstractNote = {Metagenomics, the study of all microbial species cohabitants in an environment, usually produces large amount of sequence data varying from several GBs to a few TBs. Analyzing metagenomics data includes both data-intensive and compute-intensive steps, making the entire process hard to scale. Here we aim to optimize a metagenomics application that partitions the shortgun metagenomics sequences based on their species of origin. Our solution combines MapReduce-based BioPig analytic toolkit with MPI to provide scalability in respective to both data and compute. We also made some improvements to the existing BioPig toolkit by using simplified data types and compressed k-mer storage. These optimizations leads up to 193× speedup for the computing-intensive step and 9.6× speedup over the entire pipeline. Our optimized application is also capable of processing datasets that are 16 times larger on the same hardware platform. These conclusions indicate integrating heterogeneous technologies such as Hadoop and MPI is quite efficient to solve large genomics problems that are both data-intensive and compute-intensive.},
doi = {10.1007/s10766-017-0524-z},
journal = {International Journal of Parallel Programming},
number = 4,
volume = 46,
place = {United States},
year = {2017},
month = {10}
}
Web of Science
Figures / Tables:

Works referenced in this record:
MapReduce: simplified data processing on large clusters
journal, January 2008
- Dean, Jeffrey; Ghemawat, Sanjay; Mehta, Brijesh
- Communications of the ACM, Vol. 51, Issue 1
A Map-Reduce Framework for Clustering Metagenomes
conference, May 2013
- Rasheed, Zeehasham; Rangwala, Huzefa
- 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)
DIME: A Novel Framework for De Novo Metagenomic Sequence Assembly
journal, February 2015
- Guo, Xuan; Yu, Ning; Ding, Xiaojun
- Journal of Computational Biology, Vol. 22, Issue 2
Apache hadoop performance-tuning methodologies and best practices
conference, January 2012
- Joshi, Shrinivas B.
- Proceedings of the third joint WOSP/SIPEW international conference on Performance Engineering - ICPE '12
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack
conference, May 2015
- Fox, Geoffrey C.; Qiu, Judy; Kamburugamuve, Supun
- 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)
DataMPI: Extending MPI to Hadoop-Like Big Data Computing
conference, May 2014
- Lu, Xiaoyi; Liang, Fan; Wang, Bing
- 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
Sequencing technologies — the next generation
journal, December 2009
- Metzker, Michael L.
- Nature Reviews Genetics, Vol. 11, Issue 1
Big Data Analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf
journal, January 2015
- Reyes-Ortiz, Jorge L.; Oneto, Luca; Anguita, Davide
- Procedia Computer Science, Vol. 53
Connected Components in MapReduce and Beyond
conference, January 2014
- Kiveris, Raimondas; Lattanzi, Silvio; Mirrokni, Vahab
- Proceedings of the ACM Symposium on Cloud Computing - SOCC '14
Performance evaluation and tuning of BioPig for genomic analysis
conference, January 2015
- Shi, Lizhen; Wang, Zhong; Yu, Weikuan
- Proceedings of the 2015 International Workshop on Data-Intensive Scalable Computing Systems - DISCS '15
Efficiency of a Good But Not Linear Set Union Algorithm
journal, April 1975
- Tarjan, Robert Endre
- Journal of the ACM, Vol. 22, Issue 2
MRONLINE: MapReduce online performance tuning
conference, January 2014
- Li, Min; Zeng, Liangzhao; Meng, Shicong
- Proceedings of the 23rd international symposium on High-performance parallel and distributed computing - HPDC '14
Pig latin: a not-so-foreign language for data processing
conference, January 2008
- Olston, Christopher; Reed, Benjamin; Srivastava, Utkarsh
- Proceedings of the 2008 ACM SIGMOD international conference on Management of data - SIGMOD '08
OpenMP: an industry standard API for shared-memory programming
journal, January 1998
- Dagum, L.; Menon, R.
- IEEE Computational Science and Engineering, Vol. 5, Issue 1
Metagenomic Discovery of Biomass-Degrading Genes and Genomes from Cow Rumen
journal, January 2011
- Hess, M.; Sczyrba, A.; Egan, R.
- Science, Vol. 331, Issue 6016
A high-performance, portable implementation of the MPI message passing interface standard
journal, September 1996
- Gropp, William; Lusk, Ewing; Doss, Nathan
- Parallel Computing, Vol. 22, Issue 6
Bridging the gap between HPC and big data frameworks
journal, April 2017
- Anderson, Michael; Smith, Shaden; Sundaram, Narayanan
- Proceedings of the VLDB Endowment, Vol. 10, Issue 8
BioPig: a Hadoop-based analytic toolkit for large-scale sequence data
journal, September 2013
- Nordberg, H.; Bhatia, K.; Wang, K.
- Bioinformatics, Vol. 29, Issue 23
Next-generation sequencing: big data meets high performance computing
journal, April 2017
- Schmidt, Bertil; Hildebrandt, Andreas
- Drug Discovery Today, Vol. 22, Issue 4
Apache Hadoop YARN: yet another resource negotiator
conference, January 2013
- Vavilapalli, Vinod Kumar; Seth, Siddharth; Saha, Bikas
- Proceedings of the 4th annual Symposium on Cloud Computing - SOCC '13
Figures / Tables found in this record: