A case study of tuning MapReduce for efficient Bioinformatics in the cloud

Shi, Lizhen; Wang, Zhong; Yu, Weikuan; Meng, Xiandong

doi:10.1016/j.parco.2016.10.002

Title: A case study of tuning MapReduce for efficient Bioinformatics in the cloud

Journal Article · Thu Oct 06 00:00:00 EDT 2016 · Parallel Computing

DOI:https://doi.org/10.1016/j.parco.2016.10.002· OSTI ID:1393100

^[1]; Wang, Zhong ^[2]; Yu, Weikuan ^[1]; Meng, Xiandong ^[2]

Florida State Univ., Tallahassee, FL (United States)
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)

The combination of the Hadoop MapReduce programming model and cloud computing allows biological scientists to analyze next-generation sequencing (NGS) data in a timely and cost-effective manner. Cloud computing platforms remove the burden of IT facility procurement and management from end users and provide ease of access to Hadoop clusters. However, biological scientists are still expected to choose appropriate Hadoop parameters for running their jobs. More importantly, the available Hadoop tuning guidelines are either obsolete or too general to capture the particular characteristics of bioinformatics applications. In this paper, we aim to minimize the cloud computing cost spent on bioinformatics data analysis by optimizing the extracted significant Hadoop parameters. When using MapReduce-based bioinformatics tools in the cloud, the default settings often lead to resource underutilization and wasteful expenses. We choose k-mer counting, a representative application used in a large number of NGS data analysis tools, as our study case. Experimental results show that, with the fine-tuned parameters, we achieve a total of 4× speedup compared with the original performance (using the default settings). Finally, this paper presents an exemplary case for tuning MapReduce-based bioinformatics applications in the cloud, and documents the key parameters that could lead to significant performance benefits.

View Accepted Manuscript (DOE)

View Accepted Manuscript (Publisher)

Cite

Export

Save

Research Organization:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Organization:: USDOE Office of Science (SC); National Science Foundation (NSF)

Grant/Contract Number:: AC02-05CH11231; 1561041; 1564647

OSTI ID:: 1393100

Alternate ID(s):: OSTI ID: 1398720

Journal Information:: Parallel Computing, Vol. 61; ISSN 0167-8191

Publisher:: ElsevierCopyright Statement

Country of Publication:: United States

Language:: English

Citation Metrics:

Cited by: 9 works

Citation information provided by
Web of Science

References (12)

The impact of next-generation sequencing on genomics Zhang, Jun; Chiodini, Rod; Badr, Ahmed Journal of Genetics and Genomics, Vol. 38, Issue 3 https://doi.org/10.1016/j.jgg.2011.02.003	journal	March 2011
Metagenomic Discovery of Biomass-Degrading Genes and Genomes from Cow Rumen Hess, M.; Sczyrba, A.; Egan, R. Science, Vol. 331, Issue 6016 https://doi.org/10.1126/science.1200387	journal	January 2011
MapReduce: simplified data processing on large clusters Dean, Jeffrey; Ghemawat, Sanjay; Mehta, Brijesh Communications of the ACM, Vol. 51, Issue 1 https://doi.org/10.1145/1327452.1327492	journal	January 2008
CloudBurst: highly sensitive read mapping with MapReduce Schatz, M. C. Bioinformatics, Vol. 25, Issue 11 https://doi.org/10.1093/bioinformatics/btp236	journal	April 2009
Searching for SNPs with cloud computing Langmead, Ben; Schatz, Michael C.; Lin, Jimmy Genome Biology, Vol. 10, Issue 11 https://doi.org/10.1186/gb-2009-10-11-r134	journal	January 2009
The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data McKenna, A.; Hanna, M.; Banks, E. Genome Research, Vol. 20, Issue 9 https://doi.org/10.1101/gr.107524.110	journal	July 2010
Cloud-scale RNA-sequencing differential expression analysis with Myrna Langmead, Ben; Hansen, Kasper D.; Leek, Jeffrey T. Genome Biology, Vol. 11, Issue 8 https://doi.org/10.1186/gb-2010-11-8-r83	journal	January 2010
Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences Goecks, Jeremy; Nekrutenko, Anton; Taylor, James Genome Biology, Vol. 11, Issue 8 https://doi.org/10.1186/gb-2010-11-8-r86	journal	January 2010
SEAL: a distributed short read mapping and duplicate removal tool Pireddu, L.; Leo, S.; Zanetti, G. Bioinformatics, Vol. 27, Issue 15 https://doi.org/10.1093/bioinformatics/btr325	journal	June 2011
CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping Nguyen, Tung; Shi, Weisong; Ruden, Douglas BMC Research Notes, Vol. 4, Issue 1 https://doi.org/10.1186/1756-0500-4-171	journal	June 2011
FX: an RNA-Seq analysis tool on the cloud Hong, Dongwan; Rhie, Arang; Park, Sung-Soo Bioinformatics, Vol. 28, Issue 5 https://doi.org/10.1093/bioinformatics/bts023	journal	January 2012
SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop Schumacher, André; Pireddu, Luca; Niemenmaa, Matti Bioinformatics, Vol. 30, Issue 1 https://doi.org/10.1093/bioinformatics/btt601	journal	October 2013

Cited By (1)

Computational Strategies for Scalable Genomics Analysis Shi, Lizhen; Wang, Zhong Genes, Vol. 10, Issue 12 https://doi.org/10.3390/genes10121017	journal	December 2019

Similar Records

An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics

Journal Article · Tue Dec 21 00:00:00 EST 2010 · BMC Bioinformatics, 11(Suppl 12):S1 · OSTI ID:1393100

Taylor, Ronald C

Center for Technology for Advanced Scientific Componet Software (TASCS)

Technical Report · Sun Oct 31 00:00:00 EDT 2010 · OSTI ID:1393100

Govindaraju, Madhusudhan

Scalable Regression Tree Learning on Hadoop using OpenPlanet

Conference · Mon Jun 18 00:00:00 EDT 2012 · OSTI ID:1393100

Yin, Wei; Simmhan, Yogesh; Prasanna, Viktor

Related Subjects

97 MATHEMATICS AND COMPUTING
Hadoop
YARN
Parameter optimization
K-mer counting
NGS

Title: A case study of tuning MapReduce for efficient Bioinformatics in the cloud

Citation Formats

References (12)

Cited By (1)

Similar Records

Related Subjects