An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics
Bioinformatics researchers are increasingly confronted with analysis of ultra large-scale data sets, a problem that will only increase at an alarming rate in coming years. Recent developments in open source software, that is, the Hadoop project and associated software, provide a foundation for scaling to petabyte scale data warehouses on Linux clusters, providing fault-tolerant parallelized analysis on such data using a programming style named MapReduce. An overview is given of the current usage within the bioinformatics community of Hadoop, a top-level Apache Software Foundation project, and of associated open source software projects. The concepts behind Hadoop and the associated HBase project are defined, and current bioinformatics software that employ Hadoop is described. The focus is on next-generation sequencing, as the leading application area to date.
- Research Organization:
- Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC05-76RL01830
- OSTI ID:
- 1019222
- Report Number(s):
- PNNL-SA-74925; KP1601030; TRN: US1103576
- Journal Information:
- BMC Bioinformatics, 11(Suppl 12):S1, Vol. 11, Issue 12
- Country of Publication:
- United States
- Language:
- English
Similar Records
Analyzing petabytes of data with Hadoop
MARIANE: MApReduce Implementation Adapted for HPC Environments