skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Abstract

Metagenomics, the study of all microbial species cohabitants in an environment, usually produces large amount of sequence data varying from several GBs to a few TBs. Analyzing metagenomics data includes both data-intensive and compute-intensive steps, making the entire process hard to scale. Here we aim to optimize a metagenomics application that partitions the shortgun metagenomics sequences based on their species of origin. Our solution combines MapReduce-based BioPig analytic toolkit with MPI to provide scalability in respective to both data and compute. We also made some improvements to the existing BioPig toolkit by using simplified data types and compressed k-mer storage. These optimizations leads up to 193× speedup for the computing-intensive step and 9.6× speedup over the entire pipeline. Our optimized application is also capable of processing datasets that are 16 times larger on the same hardware platform. These conclusions indicate integrating heterogeneous technologies such as Hadoop and MPI is quite efficient to solve large genomics problems that are both data-intensive and compute-intensive.

Authors:
ORCiD logo [1];  [1];  [2];  [1];  [2];  [1];  [1];  [1];  [1]
  1. Univ. of Science and Technology of China, Hefei (China)
  2. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Publication Date:
Research Org.:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC); National Key Research and Development Program of China
OSTI Identifier:
1532334
Grant/Contract Number:  
[AC02-05CH11231]
Resource Type:
Accepted Manuscript
Journal Name:
International Journal of Parallel Programming
Additional Journal Information:
[ Journal Volume: 46; Journal Issue: 4]; Journal ID: ISSN 0885-7458
Publisher:
Springer
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; Metagenomics; Hadoop; MPI; Optimization; Pig Latin; BioPig; Big data; Data-intensive; Compute-intensive

Citation Formats

Lin, Han, Su, Zhichao, Meng, Xiandong, Jin, Xu, Wang, Zhong, Han, Wenting, An, Hong, Chi, Mengxian, and Wu, Zheng. Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive. United States: N. p., 2017. Web. doi:10.1007/s10766-017-0524-z.
Lin, Han, Su, Zhichao, Meng, Xiandong, Jin, Xu, Wang, Zhong, Han, Wenting, An, Hong, Chi, Mengxian, & Wu, Zheng. Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive. United States. doi:10.1007/s10766-017-0524-z.
Lin, Han, Su, Zhichao, Meng, Xiandong, Jin, Xu, Wang, Zhong, Han, Wenting, An, Hong, Chi, Mengxian, and Wu, Zheng. Sat . "Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive". United States. doi:10.1007/s10766-017-0524-z. https://www.osti.gov/servlets/purl/1532334.
@article{osti_1532334,
title = {Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive},
author = {Lin, Han and Su, Zhichao and Meng, Xiandong and Jin, Xu and Wang, Zhong and Han, Wenting and An, Hong and Chi, Mengxian and Wu, Zheng},
abstractNote = {Metagenomics, the study of all microbial species cohabitants in an environment, usually produces large amount of sequence data varying from several GBs to a few TBs. Analyzing metagenomics data includes both data-intensive and compute-intensive steps, making the entire process hard to scale. Here we aim to optimize a metagenomics application that partitions the shortgun metagenomics sequences based on their species of origin. Our solution combines MapReduce-based BioPig analytic toolkit with MPI to provide scalability in respective to both data and compute. We also made some improvements to the existing BioPig toolkit by using simplified data types and compressed k-mer storage. These optimizations leads up to 193× speedup for the computing-intensive step and 9.6× speedup over the entire pipeline. Our optimized application is also capable of processing datasets that are 16 times larger on the same hardware platform. These conclusions indicate integrating heterogeneous technologies such as Hadoop and MPI is quite efficient to solve large genomics problems that are both data-intensive and compute-intensive.},
doi = {10.1007/s10766-017-0524-z},
journal = {International Journal of Parallel Programming},
number = [4],
volume = [46],
place = {United States},
year = {2017},
month = {10}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Figures / Tables:

Fig. 1 Fig. 1: A read in fastq format

Save / Share:

Works referenced in this record:

MapReduce: simplified data processing on large clusters
journal, January 2008

  • Dean, Jeffrey; Ghemawat, Sanjay; Mehta, Brijesh
  • Communications of the ACM, Vol. 51, Issue 1
  • DOI: 10.1145/1327452.1327492

A Map-Reduce Framework for Clustering Metagenomes
conference, May 2013

  • Rasheed, Zeehasham; Rangwala, Huzefa
  • 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)
  • DOI: 10.1109/IPDPSW.2013.100

DIME: A Novel Framework for De Novo Metagenomic Sequence Assembly
journal, February 2015

  • Guo, Xuan; Yu, Ning; Ding, Xiaojun
  • Journal of Computational Biology, Vol. 22, Issue 2
  • DOI: 10.1089/cmb.2014.0251

Apache hadoop performance-tuning methodologies and best practices
conference, January 2012

  • Joshi, Shrinivas B.
  • Proceedings of the third joint WOSP/SIPEW international conference on Performance Engineering - ICPE '12
  • DOI: 10.1145/2188286.2188323

HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack
conference, May 2015

  • Fox, Geoffrey C.; Qiu, Judy; Kamburugamuve, Supun
  • 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)
  • DOI: 10.1109/CCGrid.2015.122

DataMPI: Extending MPI to Hadoop-Like Big Data Computing
conference, May 2014

  • Lu, Xiaoyi; Liang, Fan; Wang, Bing
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
  • DOI: 10.1109/IPDPS.2014.90

Sequencing technologies — the next generation
journal, December 2009

  • Metzker, Michael L.
  • Nature Reviews Genetics, Vol. 11, Issue 1
  • DOI: 10.1038/nrg2626

Big Data Analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf
journal, January 2015


Connected Components in MapReduce and Beyond
conference, January 2014

  • Kiveris, Raimondas; Lattanzi, Silvio; Mirrokni, Vahab
  • Proceedings of the ACM Symposium on Cloud Computing - SOCC '14
  • DOI: 10.1145/2670979.2670997

Performance evaluation and tuning of BioPig for genomic analysis
conference, January 2015

  • Shi, Lizhen; Wang, Zhong; Yu, Weikuan
  • Proceedings of the 2015 International Workshop on Data-Intensive Scalable Computing Systems - DISCS '15
  • DOI: 10.1145/2831244.2831252

Efficiency of a Good But Not Linear Set Union Algorithm
journal, April 1975


MRONLINE: MapReduce online performance tuning
conference, January 2014

  • Li, Min; Zeng, Liangzhao; Meng, Shicong
  • Proceedings of the 23rd international symposium on High-performance parallel and distributed computing - HPDC '14
  • DOI: 10.1145/2600212.2600229

Pig latin: a not-so-foreign language for data processing
conference, January 2008

  • Olston, Christopher; Reed, Benjamin; Srivastava, Utkarsh
  • Proceedings of the 2008 ACM SIGMOD international conference on Management of data - SIGMOD '08
  • DOI: 10.1145/1376616.1376726

OpenMP: an industry standard API for shared-memory programming
journal, January 1998

  • Dagum, L.; Menon, R.
  • IEEE Computational Science and Engineering, Vol. 5, Issue 1
  • DOI: 10.1109/99.660313

Metagenomic Discovery of Biomass-Degrading Genes and Genomes from Cow Rumen
journal, January 2011


A high-performance, portable implementation of the MPI message passing interface standard
journal, September 1996


Bridging the gap between HPC and big data frameworks
journal, April 2017

  • Anderson, Michael; Smith, Shaden; Sundaram, Narayanan
  • Proceedings of the VLDB Endowment, Vol. 10, Issue 8
  • DOI: 10.14778/3090163.3090168

BioPig: a Hadoop-based analytic toolkit for large-scale sequence data
journal, September 2013


Next-generation sequencing: big data meets high performance computing
journal, April 2017


Apache Hadoop YARN: yet another resource negotiator
conference, January 2013

  • Vavilapalli, Vinod Kumar; Seth, Siddharth; Saha, Bikas
  • Proceedings of the 4th annual Symposium on Cloud Computing - SOCC '13
  • DOI: 10.1145/2523616.2523633