A case study of tuning MapReduce for efficient Bioinformatics in the cloud

Shi, Lizhen; Wang, Zhong; Yu, Weikuan; Meng, Xiandong

doi:10.1016/j.parco.2016.10.002

Title: A case study of tuning MapReduce for efficient Bioinformatics in the cloud

Abstract

The combination of the Hadoop MapReduce programming model and cloud computing allows biological scientists to analyze next-generation sequencing (NGS) data in a timely and cost-effective manner. Cloud computing platforms remove the burden of IT facility procurement and management from end users and provide ease of access to Hadoop clusters. However, biological scientists are still expected to choose appropriate Hadoop parameters for running their jobs. More importantly, the available Hadoop tuning guidelines are either obsolete or too general to capture the particular characteristics of bioinformatics applications. In this paper, we aim to minimize the cloud computing cost spent on bioinformatics data analysis by optimizing the extracted significant Hadoop parameters. When using MapReduce-based bioinformatics tools in the cloud, the default settings often lead to resource underutilization and wasteful expenses. We choose k-mer counting, a representative application used in a large number of NGS data analysis tools, as our study case. Experimental results show that, with the fine-tuned parameters, we achieve a total of 4× speedup compared with the original performance (using the default settings). Finally, this paper presents an exemplary case for tuning MapReduce-based bioinformatics applications in the cloud, and documents the key parameters that could lead to significant performance benefits.

Authors:

^[1]; Wang, Zhong ^[2]; Yu, Weikuan ^[1]; Meng, Xiandong ^[2]

Florida State Univ., Tallahassee, FL (United States)
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)

Publication Date:: Thu Oct 06 00:00:00 EDT 2016

Research Org.:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Org.:: USDOE Office of Science (SC); National Science Foundation (NSF)

OSTI Identifier:: 1393100

Alternate Identifier(s):: OSTI ID: 1398720

Grant/Contract Number:: AC02-05CH11231; 1561041; 1564647

Resource Type:: Accepted Manuscript

Journal Name:: Parallel Computing

Additional Journal Information:: Journal Volume: 61; Journal ID: ISSN 0167-8191

Publisher:: Elsevier

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING; Hadoop; YARN; Parameter optimization; K-mer counting; NGS

Citation Formats


                    Shi, Lizhen, Wang, Zhong, Yu, Weikuan, and Meng, Xiandong. A case study of tuning MapReduce for efficient Bioinformatics in the cloud.  United States: N. p., 2016. 
Web.  doi:10.1016/j.parco.2016.10.002.

Copy to clipboard


                    Shi, Lizhen, Wang, Zhong, Yu, Weikuan, & Meng, Xiandong. A case study of tuning MapReduce for efficient Bioinformatics in the cloud.  United States.  https://doi.org/10.1016/j.parco.2016.10.002

Copy to clipboard


                    Shi, Lizhen, Wang, Zhong, Yu, Weikuan, and Meng, Xiandong. Thu .  
"A case study of tuning MapReduce for efficient Bioinformatics in the cloud".  United States.  https://doi.org/10.1016/j.parco.2016.10.002.  https://www.osti.gov/servlets/purl/1393100.

Copy to clipboard


                    
@article{osti_1393100,

  title        = {A case study of tuning MapReduce for efficient Bioinformatics in the cloud},

  author       = {Shi, Lizhen and Wang, Zhong and Yu, Weikuan and Meng, Xiandong},

  abstractNote = {The combination of the Hadoop MapReduce programming model and cloud computing allows biological scientists to analyze next-generation sequencing (NGS) data in a timely and cost-effective manner. Cloud computing platforms remove the burden of IT facility procurement and management from end users and provide ease of access to Hadoop clusters. However, biological scientists are still expected to choose appropriate Hadoop parameters for running their jobs. More importantly, the available Hadoop tuning guidelines are either obsolete or too general to capture the particular characteristics of bioinformatics applications. In this paper, we aim to minimize the cloud computing cost spent on bioinformatics data analysis by optimizing the extracted significant Hadoop parameters. When using MapReduce-based bioinformatics tools in the cloud, the default settings often lead to resource underutilization and wasteful expenses. We choose k-mer counting, a representative application used in a large number of NGS data analysis tools, as our study case. Experimental results show that, with the fine-tuned parameters, we achieve a total of 4× speedup compared with the original performance (using the default settings). Finally, this paper presents an exemplary case for tuning MapReduce-based bioinformatics applications in the cloud, and documents the key parameters that could lead to significant performance benefits.},

  doi          = {10.1016/j.parco.2016.10.002},

  journal      = {Parallel Computing},

  number       = ,

  volume       = 61,

  place        = {United States},

  year         = {Thu Oct 06 00:00:00 EDT 2016},

  month        = {Thu Oct 06 00:00:00 EDT 2016}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (Publisher)

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1016/j.parco.2016.10.002

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 9 works

Citation information provided by
Web of Science

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

The impact of next-generation sequencing on genomics
journal, March 2011

Zhang, Jun; Chiodini, Rod; Badr, Ahmed
Journal of Genetics and Genomics, Vol. 38, Issue 3
DOI: 10.1016/j.jgg.2011.02.003

Metagenomic Discovery of Biomass-Degrading Genes and Genomes from Cow Rumen
journal, January 2011

Hess, M.; Sczyrba, A.; Egan, R.
Science, Vol. 331, Issue 6016
DOI: 10.1126/science.1200387

MapReduce: simplified data processing on large clusters
journal, January 2008

Dean, Jeffrey; Ghemawat, Sanjay; Mehta, Brijesh
Communications of the ACM, Vol. 51, Issue 1
DOI: 10.1145/1327452.1327492

CloudBurst: highly sensitive read mapping with MapReduce
journal, April 2009

Schatz, M. C.
Bioinformatics, Vol. 25, Issue 11
DOI: 10.1093/bioinformatics/btp236

Searching for SNPs with cloud computing
journal, January 2009

Langmead, Ben; Schatz, Michael C.; Lin, Jimmy
Genome Biology, Vol. 10, Issue 11
DOI: 10.1186/gb-2009-10-11-r134

The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data
journal, July 2010

McKenna, A.; Hanna, M.; Banks, E.
Genome Research, Vol. 20, Issue 9
DOI: 10.1101/gr.107524.110

Cloud-scale RNA-sequencing differential expression analysis with Myrna
journal, January 2010

Langmead, Ben; Hansen, Kasper D.; Leek, Jeffrey T.
Genome Biology, Vol. 11, Issue 8
DOI: 10.1186/gb-2010-11-8-r83

Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences
journal, January 2010

Goecks, Jeremy; Nekrutenko, Anton; Taylor, James
Genome Biology, Vol. 11, Issue 8
DOI: 10.1186/gb-2010-11-8-r86

SEAL: a distributed short read mapping and duplicate removal tool
journal, June 2011

Pireddu, L.; Leo, S.; Zanetti, G.
Bioinformatics, Vol. 27, Issue 15
DOI: 10.1093/bioinformatics/btr325

CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping
journal, June 2011

Nguyen, Tung; Shi, Weisong; Ruden, Douglas
BMC Research Notes, Vol. 4, Issue 1
DOI: 10.1186/1756-0500-4-171

FX: an RNA-Seq analysis tool on the cloud
journal, January 2012

Hong, Dongwan; Rhie, Arang; Park, Sung-Soo
Bioinformatics, Vol. 28, Issue 5
DOI: 10.1093/bioinformatics/bts023

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop
journal, October 2013

Schumacher, André; Pireddu, Luca; Niemenmaa, Matti
Bioinformatics, Vol. 30, Issue 1
DOI: 10.1093/bioinformatics/btt601

Works referencing / citing this record:

Computational Strategies for Scalable Genomics Analysis
journal, December 2019

Shi, Lizhen; Wang, Zhong
Genes, Vol. 10, Issue 12
DOI: 10.3390/genes10121017

Similar Records in DOE PAGES and OSTI.GOV collections:

An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics

Journal Article Taylor, Ronald C - BMC Bioinformatics, 11(Suppl 12):S1

Bioinformatics researchers are increasingly confronted with analysis of ultra large-scale data sets, a problem that will only increase at an alarming rate in coming years. Recent developments in open source software, that is, the Hadoop project and associated software, provide a foundation for scaling to petabyte scale data warehouses on Linux clusters, providing fault-tolerant parallelized analysis on such data using a programming style named MapReduce. An overview is given of the current usage within the bioinformatics community of Hadoop, a top-level Apache Software Foundation project, and of associated open source software projects. The concepts behind Hadoop and the associated HBasemore »« less
https://doi.org/10.1186/1471-2105-11-S12-S1
Center for Technology for Advanced Scientific Componet Software (TASCS)

Technical Report Govindaraju, Madhusudhan

Advanced Scientific Computing Research Computer Science FY 2010Report Center for Technology for Advanced Scientific Component Software: Distributed CCA State University of New York, Binghamton, NY, 13902 Summary The overall objective of Binghamton's involvement is to work on enhancements of the CCA environment, motivated by the applications and research initiatives discussed in the proposal. This year we are working on re-focusing our design and development efforts to develop proof-of-concept implementations that have the potential to significantly impact scientific components. We worked on developing parallel implementations for non-hydrostatic code and worked on a model coupling interface for biogeochemical computations coded in MATLAB.more »« less
https://doi.org/10.2172/1092881

Full Text Available
Scalable Regression Tree Learning on Hadoop using OpenPlanet

Conference Yin, Wei ; Simmhan, Yogesh ; Prasanna, Viktor

As scientific and engineering domains attempt to effectively analyze the deluge of data arriving from sensors and instruments, machine learning is becoming a key data mining tool to build prediction models. Regression tree is a popular learning model that combines decision trees and linear regression to forecast numerical target variables based on a set of input features. Map Reduce is well suited for addressing such data intensive learning applications, and a proprietary regression tree algorithm, PLANET, using MapReduce has been proposed earlier. In this paper, we describe an open source implement of this algorithm, OpenPlanet, on the Hadoop framework usingmore »« less
https://doi.org/10.1145/2287016.2287027

Full Text Available
Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Journal Article Lin, Han ; Su, Zhichao ; Meng, Xiandong ; ... - International Journal of Parallel Programming

Metagenomics, the study of all microbial species cohabitants in an environment, usually produces large amount of sequence data varying from several GBs to a few TBs. Analyzing metagenomics data includes both data-intensive and compute-intensive steps, making the entire process hard to scale. Here we aim to optimize a metagenomics application that partitions the shortgun metagenomics sequences based on their species of origin. Our solution combines MapReduce-based BioPig analytic toolkit with MPI to provide scalability in respective to both data and compute. We also made some improvements to the existing BioPig toolkit by using simplified data types and compressed k-mer storage.more »« less
Cited by 1
https://doi.org/10.1007/s10766-017-0524-z

Full Text Available
MROrchestrator: A Fine-Grained Resource Orchestration Framework for MapReduce Clusters

Conference Sharma, Bikash ; Prabhakar, Ramya ; Kandemir, Mahmut ; ...

Efficient resource management in data centers and clouds running large distributed data processing frameworks like MapReduce is crucial for enhancing the performance of hosted applications and boosting resource utilization. However, existing resource scheduling schemes in Hadoop MapReduce allocate resources at the granularity of fixed-size, static portions of nodes, called slots. In this work, we show that MapReduce jobs have widely varying demands for multiple resources, making the static and fixed-size slot-level resource allocation a poor choice both from the performance and resource utilization standpoints. Furthermore, lack of co-ordination in the management of mul- tiple resources across nodes prevents dynamic slotmore »« less

Similar Records

Title: A case study of tuning MapReduce for efficient Bioinformatics in the cloud

Abstract

Citation Formats

The impact of next-generation sequencing on genomics journal, March 2011

Metagenomic Discovery of Biomass-Degrading Genes and Genomes from Cow Rumen journal, January 2011

MapReduce: simplified data processing on large clusters journal, January 2008

CloudBurst: highly sensitive read mapping with MapReduce journal, April 2009

Searching for SNPs with cloud computing journal, January 2009

The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data journal, July 2010

Cloud-scale RNA-sequencing differential expression analysis with Myrna journal, January 2010

Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences journal, January 2010

SEAL: a distributed short read mapping and duplicate removal tool journal, June 2011

CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping journal, June 2011

FX: an RNA-Seq analysis tool on the cloud journal, January 2012

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop journal, October 2013

Computational Strategies for Scalable Genomics Analysis journal, December 2019

The impact of next-generation sequencing on genomics
journal, March 2011

Metagenomic Discovery of Biomass-Degrading Genes and Genomes from Cow Rumen
journal, January 2011

MapReduce: simplified data processing on large clusters
journal, January 2008

CloudBurst: highly sensitive read mapping with MapReduce
journal, April 2009

Searching for SNPs with cloud computing
journal, January 2009

The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data
journal, July 2010

Cloud-scale RNA-sequencing differential expression analysis with Myrna
journal, January 2010

Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences
journal, January 2010

SEAL: a distributed short read mapping and duplicate removal tool
journal, June 2011

CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping
journal, June 2011

FX: an RNA-Seq analysis tool on the cloud
journal, January 2012

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop
journal, October 2013

Computational Strategies for Scalable Genomics Analysis
journal, December 2019