Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Lin, Han; Su, Zhichao; Meng, Xiandong; Jin, Xu; Wang, Zhong; Han, Wenting; An, Hong; Chi, Mengxian; Wu, Zheng

doi:10.1007/s10766-017-0524-z

Title: Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Abstract

Metagenomics, the study of all microbial species cohabitants in an environment, usually produces large amount of sequence data varying from several GBs to a few TBs. Analyzing metagenomics data includes both data-intensive and compute-intensive steps, making the entire process hard to scale. Here we aim to optimize a metagenomics application that partitions the shortgun metagenomics sequences based on their species of origin. Our solution combines MapReduce-based BioPig analytic toolkit with MPI to provide scalability in respective to both data and compute. We also made some improvements to the existing BioPig toolkit by using simplified data types and compressed k-mer storage. These optimizations leads up to 193× speedup for the computing-intensive step and 9.6× speedup over the entire pipeline. Our optimized application is also capable of processing datasets that are 16 times larger on the same hardware platform. These conclusions indicate integrating heterogeneous technologies such as Hadoop and MPI is quite efficient to solve large genomics problems that are both data-intensive and compute-intensive.

Authors:

^[1]; Su, Zhichao ^[1]; Meng, Xiandong ^[2]; Jin, Xu ^[1]; Wang, Zhong ^[2]; Han, Wenting ^[1]; An, Hong ^[1]; Chi, Mengxian ^[1]; Wu, Zheng ^[1]

Univ. of Science and Technology of China, Hefei (China)
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)

Publication Date:: Sat Oct 07 00:00:00 EDT 2017

Research Org.:: Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)

Sponsoring Org.:: USDOE Office of Science (SC); National Key Research and Development Program of China

OSTI Identifier:: 1532334

Grant/Contract Number:: AC02-05CH11231

Resource Type:: Accepted Manuscript

Journal Name:: International Journal of Parallel Programming

Additional Journal Information:: Journal Volume: 46; Journal Issue: 4; Journal ID: ISSN 0885-7458

Publisher:: Springer

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING; Metagenomics; Hadoop; MPI; Optimization; Pig Latin; BioPig; Big data; Data-intensive; Compute-intensive

Citation Formats


                    Lin, Han, Su, Zhichao, Meng, Xiandong, Jin, Xu, Wang, Zhong, Han, Wenting, An, Hong, Chi, Mengxian, and Wu, Zheng. Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive.  United States: N. p., 2017. 
Web.  doi:10.1007/s10766-017-0524-z.

Copy to clipboard


                    Lin, Han, Su, Zhichao, Meng, Xiandong, Jin, Xu, Wang, Zhong, Han, Wenting, An, Hong, Chi, Mengxian, & Wu, Zheng. Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive.  United States.  https://doi.org/10.1007/s10766-017-0524-z

Copy to clipboard


                    Lin, Han, Su, Zhichao, Meng, Xiandong, Jin, Xu, Wang, Zhong, Han, Wenting, An, Hong, Chi, Mengxian, and Wu, Zheng. Sat .  
"Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive".  United States.  https://doi.org/10.1007/s10766-017-0524-z.  https://www.osti.gov/servlets/purl/1532334.

Copy to clipboard


                    
@article{osti_1532334,

  title        = {Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive},

  author       = {Lin, Han and Su, Zhichao and Meng, Xiandong and Jin, Xu and Wang, Zhong and Han, Wenting and An, Hong and Chi, Mengxian and Wu, Zheng},

  abstractNote = {Metagenomics, the study of all microbial species cohabitants in an environment, usually produces large amount of sequence data varying from several GBs to a few TBs. Analyzing metagenomics data includes both data-intensive and compute-intensive steps, making the entire process hard to scale. Here we aim to optimize a metagenomics application that partitions the shortgun metagenomics sequences based on their species of origin. Our solution combines MapReduce-based BioPig analytic toolkit with MPI to provide scalability in respective to both data and compute. We also made some improvements to the existing BioPig toolkit by using simplified data types and compressed k-mer storage. These optimizations leads up to 193× speedup for the computing-intensive step and 9.6× speedup over the entire pipeline. Our optimized application is also capable of processing datasets that are 16 times larger on the same hardware platform. These conclusions indicate integrating heterogeneous technologies such as Hadoop and MPI is quite efficient to solve large genomics problems that are both data-intensive and compute-intensive.},

  doi          = {10.1007/s10766-017-0524-z},

  journal      = {International Journal of Parallel Programming},

  number       = 4,

  volume       = 46,

  place        = {United States},

  year         = {Sat Oct 07 00:00:00 EDT 2017},

  month        = {Sat Oct 07 00:00:00 EDT 2017}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1007/s10766-017-0524-z

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 1 work

Citation information provided by
Web of Science

Figures / Tables:

Fig. 1: A read in fastq format

All figures and tables (14 total)

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

MapReduce: simplified data processing on large clusters
journal, January 2008

Dean, Jeffrey; Ghemawat, Sanjay; Mehta, Brijesh
Communications of the ACM, Vol. 51, Issue 1
DOI: 10.1145/1327452.1327492

A Map-Reduce Framework for Clustering Metagenomes
conference, May 2013

Rasheed, Zeehasham; Rangwala, Huzefa
2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)
DOI: 10.1109/IPDPSW.2013.100

DIME: A Novel Framework for De Novo Metagenomic Sequence Assembly
journal, February 2015

Guo, Xuan; Yu, Ning; Ding, Xiaojun
Journal of Computational Biology, Vol. 22, Issue 2
DOI: 10.1089/cmb.2014.0251

Apache hadoop performance-tuning methodologies and best practices
conference, January 2012

Joshi, Shrinivas B.
Proceedings of the third joint WOSP/SIPEW international conference on Performance Engineering - ICPE '12
DOI: 10.1145/2188286.2188323

HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack
conference, May 2015

Fox, Geoffrey C.; Qiu, Judy; Kamburugamuve, Supun
2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)
DOI: 10.1109/CCGrid.2015.122

DataMPI: Extending MPI to Hadoop-Like Big Data Computing
conference, May 2014

Lu, Xiaoyi; Liang, Fan; Wang, Bing
2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
DOI: 10.1109/IPDPS.2014.90

Sequencing technologies — the next generation
journal, December 2009

Metzker, Michael L.
Nature Reviews Genetics, Vol. 11, Issue 1
DOI: 10.1038/nrg2626

Big Data Analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf
journal, January 2015

Reyes-Ortiz, Jorge L.; Oneto, Luca; Anguita, Davide
Procedia Computer Science, Vol. 53
DOI: 10.1016/j.procs.2015.07.286

Connected Components in MapReduce and Beyond
conference, January 2014

Kiveris, Raimondas; Lattanzi, Silvio; Mirrokni, Vahab
Proceedings of the ACM Symposium on Cloud Computing - SOCC '14
DOI: 10.1145/2670979.2670997

Performance evaluation and tuning of BioPig for genomic analysis
conference, January 2015

Shi, Lizhen; Wang, Zhong; Yu, Weikuan
Proceedings of the 2015 International Workshop on Data-Intensive Scalable Computing Systems - DISCS '15
DOI: 10.1145/2831244.2831252

Efficiency of a Good But Not Linear Set Union Algorithm
journal, April 1975

Tarjan, Robert Endre
Journal of the ACM, Vol. 22, Issue 2
DOI: 10.1145/321879.321884

MRONLINE: MapReduce online performance tuning
conference, January 2014

Li, Min; Zeng, Liangzhao; Meng, Shicong
Proceedings of the 23rd international symposium on High-performance parallel and distributed computing - HPDC '14
DOI: 10.1145/2600212.2600229

Pig latin: a not-so-foreign language for data processing
conference, January 2008

Olston, Christopher; Reed, Benjamin; Srivastava, Utkarsh
Proceedings of the 2008 ACM SIGMOD international conference on Management of data - SIGMOD '08
DOI: 10.1145/1376616.1376726

OpenMP: an industry standard API for shared-memory programming
journal, January 1998

Dagum, L.; Menon, R.
IEEE Computational Science and Engineering, Vol. 5, Issue 1
DOI: 10.1109/99.660313

Metagenomic Discovery of Biomass-Degrading Genes and Genomes from Cow Rumen
journal, January 2011

Hess, M.; Sczyrba, A.; Egan, R.
Science, Vol. 331, Issue 6016
DOI: 10.1126/science.1200387

A high-performance, portable implementation of the MPI message passing interface standard
journal, September 1996

Gropp, William; Lusk, Ewing; Doss, Nathan
Parallel Computing, Vol. 22, Issue 6
DOI: 10.1016/0167-8191(96)00024-5

Bridging the gap between HPC and big data frameworks
journal, April 2017

Anderson, Michael; Smith, Shaden; Sundaram, Narayanan
Proceedings of the VLDB Endowment, Vol. 10, Issue 8
DOI: 10.14778/3090163.3090168

BioPig: a Hadoop-based analytic toolkit for large-scale sequence data
journal, September 2013

Nordberg, H.; Bhatia, K.; Wang, K.
Bioinformatics, Vol. 29, Issue 23
DOI: 10.1093/bioinformatics/btt528

Next-generation sequencing: big data meets high performance computing
journal, April 2017

Schmidt, Bertil; Hildebrandt, Andreas
Drug Discovery Today, Vol. 22, Issue 4
DOI: 10.1016/j.drudis.2017.01.014

Apache Hadoop YARN: yet another resource negotiator
conference, January 2013

Vavilapalli, Vinod Kumar; Seth, Siddharth; Saha, Bikas
Proceedings of the 4th annual Symposium on Cloud Computing - SOCC '13
DOI: 10.1145/2523616.2523633

Figures / Tables found in this record:

Figures/Tables have been extracted from DOE-funded journal article accepted manuscripts.

Similar Records in DOE PAGES and OSTI.GOV collections:

Center for Technology for Advanced Scientific Componet Software (TASCS)

Technical Report Govindaraju, Madhusudhan

Advanced Scientific Computing Research Computer Science FY 2010Report Center for Technology for Advanced Scientific Component Software: Distributed CCA State University of New York, Binghamton, NY, 13902 Summary The overall objective of Binghamton's involvement is to work on enhancements of the CCA environment, motivated by the applications and research initiatives discussed in the proposal. This year we are working on re-focusing our design and development efforts to develop proof-of-concept implementations that have the potential to significantly impact scientific components. We worked on developing parallel implementations for non-hydrostatic code and worked on a model coupling interface for biogeochemical computations coded in MATLAB.more »« less
https://doi.org/10.2172/1092881

Full Text Available
A case study of tuning MapReduce for efficient Bioinformatics in the cloud

Journal Article Shi, Lizhen ; Wang, Zhong ; Yu, Weikuan ; ... - Parallel Computing

The combination of the Hadoop MapReduce programming model and cloud computing allows biological scientists to analyze next-generation sequencing (NGS) data in a timely and cost-effective manner. Cloud computing platforms remove the burden of IT facility procurement and management from end users and provide ease of access to Hadoop clusters. However, biological scientists are still expected to choose appropriate Hadoop parameters for running their jobs. More importantly, the available Hadoop tuning guidelines are either obsolete or too general to capture the particular characteristics of bioinformatics applications. In this paper, we aim to minimize the cloud computing cost spent on bioinformatics datamore »« less
Cited by 9
https://doi.org/10.1016/j.parco.2016.10.002

Full Text Available
Large-scale seismic waveform quality metric calculation using Hadoop

Journal Article Magana-Zook, Steven ; Gaylord, Jessie M. ; Knapp, Douglas R. ; ... - Computers and Geosciences

Here in this work we investigated the suitability of Hadoop MapReduce and Apache Spark for large-scale computation of seismic waveform quality metrics by comparing their performance with that of a traditional distributed implementation. The Incorporated Research Institutions for Seismology (IRIS) Data Management Center (DMC) provided 43 terabytes of broadband waveform data of which 5.1 TB of data were processed with the traditional architecture, and the full 43 TB were processed using MapReduce and Spark. Maximum performance of ~0.56 terabytes per hour was achieved using all 5 nodes of the traditional implementation. We noted that I/O dominated processing, and that I/Omore »« less
Cited by 15
https://doi.org/10.1016/j.cageo.2016.05.012

Full Text Available
A Lightweight, High-performance I/O Management Package for Data-intensive Computing

Technical Report Wang, Jun

Our group has been working with ANL collaborators on the topic bridging the gap between parallel file system and local file system during the course of this project period. We visited Argonne National Lab -- Dr. Robert Ross's group for one week in the past summer 2007. We looked over our current project progress and planned the activities for the incoming years 2008-09. The PI met Dr. Robert Ross several times such as HEC FSIO workshop 08, SC08 and SC10. We explored the opportunities to develop a production system by leveraging our current prototype to (SOGP+PVFS) a new PVFS version.more »« less
https://doi.org/10.2172/1060561

Full Text Available
An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics

Journal Article Taylor, Ronald C - BMC Bioinformatics, 11(Suppl 12):S1

Bioinformatics researchers are increasingly confronted with analysis of ultra large-scale data sets, a problem that will only increase at an alarming rate in coming years. Recent developments in open source software, that is, the Hadoop project and associated software, provide a foundation for scaling to petabyte scale data warehouses on Linux clusters, providing fault-tolerant parallelized analysis on such data using a programming style named MapReduce. An overview is given of the current usage within the bioinformatics community of Hadoop, a top-level Apache Software Foundation project, and of associated open source software projects. The concepts behind Hadoop and the associated HBasemore »« less
https://doi.org/10.1186/1471-2105-11-S12-S1

Similar Records

Title: Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Abstract

Citation Formats

Figures / Tables:

MapReduce: simplified data processing on large clusters journal, January 2008

A Map-Reduce Framework for Clustering Metagenomes conference, May 2013

DIME: A Novel Framework for De Novo Metagenomic Sequence Assembly journal, February 2015

Apache hadoop performance-tuning methodologies and best practices conference, January 2012

HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack conference, May 2015

DataMPI: Extending MPI to Hadoop-Like Big Data Computing conference, May 2014

Sequencing technologies — the next generation journal, December 2009

Big Data Analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf journal, January 2015

Connected Components in MapReduce and Beyond conference, January 2014

Performance evaluation and tuning of BioPig for genomic analysis conference, January 2015

Efficiency of a Good But Not Linear Set Union Algorithm journal, April 1975

MRONLINE: MapReduce online performance tuning conference, January 2014

Pig latin: a not-so-foreign language for data processing conference, January 2008

OpenMP: an industry standard API for shared-memory programming journal, January 1998

Metagenomic Discovery of Biomass-Degrading Genes and Genomes from Cow Rumen journal, January 2011

A high-performance, portable implementation of the MPI message passing interface standard journal, September 1996

Bridging the gap between HPC and big data frameworks journal, April 2017

BioPig: a Hadoop-based analytic toolkit for large-scale sequence data journal, September 2013

Next-generation sequencing: big data meets high performance computing journal, April 2017

Apache Hadoop YARN: yet another resource negotiator conference, January 2013

MapReduce: simplified data processing on large clusters
journal, January 2008

A Map-Reduce Framework for Clustering Metagenomes
conference, May 2013

DIME: A Novel Framework for De Novo Metagenomic Sequence Assembly
journal, February 2015

Apache hadoop performance-tuning methodologies and best practices
conference, January 2012

HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack
conference, May 2015

DataMPI: Extending MPI to Hadoop-Like Big Data Computing
conference, May 2014

Sequencing technologies — the next generation
journal, December 2009

Big Data Analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf
journal, January 2015

Connected Components in MapReduce and Beyond
conference, January 2014

Performance evaluation and tuning of BioPig for genomic analysis
conference, January 2015

Efficiency of a Good But Not Linear Set Union Algorithm
journal, April 1975

MRONLINE: MapReduce online performance tuning
conference, January 2014

Pig latin: a not-so-foreign language for data processing
conference, January 2008

OpenMP: an industry standard API for shared-memory programming
journal, January 1998

Metagenomic Discovery of Biomass-Degrading Genes and Genomes from Cow Rumen
journal, January 2011

A high-performance, portable implementation of the MPI message passing interface standard
journal, September 1996

Bridging the gap between HPC and big data frameworks
journal, April 2017

BioPig: a Hadoop-based analytic toolkit for large-scale sequence data
journal, September 2013

Next-generation sequencing: big data meets high performance computing
journal, April 2017

Apache Hadoop YARN: yet another resource negotiator
conference, January 2013