skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Journal Article · · International Journal of Parallel Programming

Metagenomics, the study of all microbial species cohabitants in an environment, usually produces large amount of sequence data varying from several GBs to a few TBs. Analyzing metagenomics data includes both data-intensive and compute-intensive steps, making the entire process hard to scale. Here we aim to optimize a metagenomics application that partitions the shortgun metagenomics sequences based on their species of origin. Our solution combines MapReduce-based BioPig analytic toolkit with MPI to provide scalability in respective to both data and compute. We also made some improvements to the existing BioPig toolkit by using simplified data types and compressed k-mer storage. These optimizations leads up to 193× speedup for the computing-intensive step and 9.6× speedup over the entire pipeline. Our optimized application is also capable of processing datasets that are 16 times larger on the same hardware platform. These conclusions indicate integrating heterogeneous technologies such as Hadoop and MPI is quite efficient to solve large genomics problems that are both data-intensive and compute-intensive.

Research Organization:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC); National Key Research and Development Program of China
Grant/Contract Number:
AC02-05CH11231
OSTI ID:
1532334
Journal Information:
International Journal of Parallel Programming, Vol. 46, Issue 4; ISSN 0885-7458
Publisher:
SpringerCopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 1 work
Citation information provided by
Web of Science

References (20)

MapReduce: simplified data processing on large clusters journal January 2008
A Map-Reduce Framework for Clustering Metagenomes conference May 2013
DIME: A Novel Framework for De Novo Metagenomic Sequence Assembly journal February 2015
Apache hadoop performance-tuning methodologies and best practices conference January 2012
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack conference May 2015
DataMPI: Extending MPI to Hadoop-Like Big Data Computing
  • Lu, Xiaoyi; Liang, Fan; Wang, Bing
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.90
conference May 2014
Sequencing technologies — the next generation journal December 2009
Big Data Analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf journal January 2015
Connected Components in MapReduce and Beyond conference January 2014
Performance evaluation and tuning of BioPig for genomic analysis conference January 2015
Efficiency of a Good But Not Linear Set Union Algorithm journal April 1975
MRONLINE: MapReduce online performance tuning
  • Li, Min; Zeng, Liangzhao; Meng, Shicong
  • Proceedings of the 23rd international symposium on High-performance parallel and distributed computing - HPDC '14 https://doi.org/10.1145/2600212.2600229
conference January 2014
Pig latin: a not-so-foreign language for data processing conference January 2008
OpenMP: an industry standard API for shared-memory programming journal January 1998
Metagenomic Discovery of Biomass-Degrading Genes and Genomes from Cow Rumen journal January 2011
A high-performance, portable implementation of the MPI message passing interface standard journal September 1996
Bridging the gap between HPC and big data frameworks journal April 2017
BioPig: a Hadoop-based analytic toolkit for large-scale sequence data journal September 2013
Next-generation sequencing: big data meets high performance computing journal April 2017
Apache Hadoop YARN: yet another resource negotiator conference January 2013

Figures / Tables (14)


Similar Records

Center for Technology for Advanced Scientific Componet Software (TASCS)
Technical Report · Sun Oct 31 00:00:00 EDT 2010 · OSTI ID:1532334

A case study of tuning MapReduce for efficient Bioinformatics in the cloud
Journal Article · Thu Oct 06 00:00:00 EDT 2016 · Parallel Computing · OSTI ID:1532334

Large-scale seismic waveform quality metric calculation using Hadoop
Journal Article · Fri May 27 00:00:00 EDT 2016 · Computers and Geosciences · OSTI ID:1532334