A case study of tuning MapReduce for efficient Bioinformatics in the cloud
- Florida State Univ., Tallahassee, FL (United States)
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
The combination of the Hadoop MapReduce programming model and cloud computing allows biological scientists to analyze next-generation sequencing (NGS) data in a timely and cost-effective manner. Cloud computing platforms remove the burden of IT facility procurement and management from end users and provide ease of access to Hadoop clusters. However, biological scientists are still expected to choose appropriate Hadoop parameters for running their jobs. More importantly, the available Hadoop tuning guidelines are either obsolete or too general to capture the particular characteristics of bioinformatics applications. In this paper, we aim to minimize the cloud computing cost spent on bioinformatics data analysis by optimizing the extracted significant Hadoop parameters. When using MapReduce-based bioinformatics tools in the cloud, the default settings often lead to resource underutilization and wasteful expenses. We choose k-mer counting, a representative application used in a large number of NGS data analysis tools, as our study case. Experimental results show that, with the fine-tuned parameters, we achieve a total of 4× speedup compared with the original performance (using the default settings). Finally, this paper presents an exemplary case for tuning MapReduce-based bioinformatics applications in the cloud, and documents the key parameters that could lead to significant performance benefits.
- Research Organization:
- Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC); National Science Foundation (NSF)
- Grant/Contract Number:
- AC02-05CH11231; 1561041; 1564647
- OSTI ID:
- 1393100
- Alternate ID(s):
- OSTI ID: 1398720
- Journal Information:
- Parallel Computing, Vol. 61; ISSN 0167-8191
- Publisher:
- ElsevierCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Web of Science
Computational Strategies for Scalable Genomics Analysis
|
journal | December 2019 |
Similar Records
Center for Technology for Advanced Scientific Componet Software (TASCS)
Scalable Regression Tree Learning on Hadoop using OpenPlanet