DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A case study of tuning MapReduce for efficient Bioinformatics in the cloud

Abstract

The combination of the Hadoop MapReduce programming model and cloud computing allows biological scientists to analyze next-generation sequencing (NGS) data in a timely and cost-effective manner. Cloud computing platforms remove the burden of IT facility procurement and management from end users and provide ease of access to Hadoop clusters. However, biological scientists are still expected to choose appropriate Hadoop parameters for running their jobs. More importantly, the available Hadoop tuning guidelines are either obsolete or too general to capture the particular characteristics of bioinformatics applications. In this paper, we aim to minimize the cloud computing cost spent on bioinformatics data analysis by optimizing the extracted significant Hadoop parameters. When using MapReduce-based bioinformatics tools in the cloud, the default settings often lead to resource underutilization and wasteful expenses. We choose k-mer counting, a representative application used in a large number of NGS data analysis tools, as our study case. Experimental results show that, with the fine-tuned parameters, we achieve a total of 4× speedup compared with the original performance (using the default settings). Finally, this paper presents an exemplary case for tuning MapReduce-based bioinformatics applications in the cloud, and documents the key parameters that could lead to significant performance benefits.

Authors:
ORCiD logo [1];  [2];  [1];  [2]
  1. Florida State Univ., Tallahassee, FL (United States)
  2. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Publication Date:
Research Org.:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC); National Science Foundation (NSF)
OSTI Identifier:
1393100
Alternate Identifier(s):
OSTI ID: 1398720
Grant/Contract Number:  
AC02-05CH11231; 1561041; 1564647
Resource Type:
Accepted Manuscript
Journal Name:
Parallel Computing
Additional Journal Information:
Journal Volume: 61; Journal ID: ISSN 0167-8191
Publisher:
Elsevier
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; Hadoop; YARN; Parameter optimization; K-mer counting; NGS

Citation Formats

Shi, Lizhen, Wang, Zhong, Yu, Weikuan, and Meng, Xiandong. A case study of tuning MapReduce for efficient Bioinformatics in the cloud. United States: N. p., 2016. Web. doi:10.1016/j.parco.2016.10.002.
Shi, Lizhen, Wang, Zhong, Yu, Weikuan, & Meng, Xiandong. A case study of tuning MapReduce for efficient Bioinformatics in the cloud. United States. https://doi.org/10.1016/j.parco.2016.10.002
Shi, Lizhen, Wang, Zhong, Yu, Weikuan, and Meng, Xiandong. Thu . "A case study of tuning MapReduce for efficient Bioinformatics in the cloud". United States. https://doi.org/10.1016/j.parco.2016.10.002. https://www.osti.gov/servlets/purl/1393100.
@article{osti_1393100,
title = {A case study of tuning MapReduce for efficient Bioinformatics in the cloud},
author = {Shi, Lizhen and Wang, Zhong and Yu, Weikuan and Meng, Xiandong},
abstractNote = {The combination of the Hadoop MapReduce programming model and cloud computing allows biological scientists to analyze next-generation sequencing (NGS) data in a timely and cost-effective manner. Cloud computing platforms remove the burden of IT facility procurement and management from end users and provide ease of access to Hadoop clusters. However, biological scientists are still expected to choose appropriate Hadoop parameters for running their jobs. More importantly, the available Hadoop tuning guidelines are either obsolete or too general to capture the particular characteristics of bioinformatics applications. In this paper, we aim to minimize the cloud computing cost spent on bioinformatics data analysis by optimizing the extracted significant Hadoop parameters. When using MapReduce-based bioinformatics tools in the cloud, the default settings often lead to resource underutilization and wasteful expenses. We choose k-mer counting, a representative application used in a large number of NGS data analysis tools, as our study case. Experimental results show that, with the fine-tuned parameters, we achieve a total of 4× speedup compared with the original performance (using the default settings). Finally, this paper presents an exemplary case for tuning MapReduce-based bioinformatics applications in the cloud, and documents the key parameters that could lead to significant performance benefits.},
doi = {10.1016/j.parco.2016.10.002},
journal = {Parallel Computing},
number = ,
volume = 61,
place = {United States},
year = {Thu Oct 06 00:00:00 EDT 2016},
month = {Thu Oct 06 00:00:00 EDT 2016}
}

Journal Article:

Citation Metrics:
Cited by: 9 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

The impact of next-generation sequencing on genomics
journal, March 2011


Metagenomic Discovery of Biomass-Degrading Genes and Genomes from Cow Rumen
journal, January 2011


MapReduce: simplified data processing on large clusters
journal, January 2008

  • Dean, Jeffrey; Ghemawat, Sanjay; Mehta, Brijesh
  • Communications of the ACM, Vol. 51, Issue 1
  • DOI: 10.1145/1327452.1327492

CloudBurst: highly sensitive read mapping with MapReduce
journal, April 2009


Searching for SNPs with cloud computing
journal, January 2009


The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data
journal, July 2010


Cloud-scale RNA-sequencing differential expression analysis with Myrna
journal, January 2010


Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences
journal, January 2010


SEAL: a distributed short read mapping and duplicate removal tool
journal, June 2011


CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping
journal, June 2011


FX: an RNA-Seq analysis tool on the cloud
journal, January 2012


SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop
journal, October 2013


Works referencing / citing this record:

Computational Strategies for Scalable Genomics Analysis
journal, December 2019