Computational Strategies for Scalable Genomics Analysis
Abstract
The revolution in next-generation DNA sequencing technologies is leading to explosive data growth in genomics, posing a significant challenge to the computing infrastructure and software algorithms for genomics analysis. Various big data technologies have been explored to scale up/out current bioinformatics solutions to mine the big genomics data. In this review, we survey some of these exciting developments in the applications of parallel distributed computing and special hardware to genomics. We comment on the pros and cons of each strategy in the context of ease of development, robustness, scalability, and efficiency. Although this review is written for an audience from the genomics and bioinformatics fields, it may also be informative for the audience of computer science with interests in genomics applications.
- Authors:
-
- Florida State Univ., Tallahassee, FL (United States)
- USDOE Joint Genome Institute (JGI), Walnut Creek, CA (United States); Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Merced, CA (United States)
- Publication Date:
- Research Org.:
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
- Sponsoring Org.:
- USDOE Office of Science (SC)
- OSTI Identifier:
- 1599823
- Grant/Contract Number:
- AC02-05CH11231
- Resource Type:
- Accepted Manuscript
- Journal Name:
- Genes
- Additional Journal Information:
- Journal Volume: 10; Journal Issue: 12; Journal ID: ISSN 2073-4425
- Publisher:
- MDPI
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 59 BASIC BIOLOGICAL SCIENCES; scalable genomics analysis; big data; high performance computing; cloud computing
Citation Formats
Shi, Lizhen, and Wang, Zhong. Computational Strategies for Scalable Genomics Analysis. United States: N. p., 2019.
Web. doi:10.3390/genes10121017.
Shi, Lizhen, & Wang, Zhong. Computational Strategies for Scalable Genomics Analysis. United States. doi:https://doi.org/10.3390/genes10121017
Shi, Lizhen, and Wang, Zhong. Fri .
"Computational Strategies for Scalable Genomics Analysis". United States. doi:https://doi.org/10.3390/genes10121017. https://www.osti.gov/servlets/purl/1599823.
@article{osti_1599823,
title = {Computational Strategies for Scalable Genomics Analysis},
author = {Shi, Lizhen and Wang, Zhong},
abstractNote = {The revolution in next-generation DNA sequencing technologies is leading to explosive data growth in genomics, posing a significant challenge to the computing infrastructure and software algorithms for genomics analysis. Various big data technologies have been explored to scale up/out current bioinformatics solutions to mine the big genomics data. In this review, we survey some of these exciting developments in the applications of parallel distributed computing and special hardware to genomics. We comment on the pros and cons of each strategy in the context of ease of development, robustness, scalability, and efficiency. Although this review is written for an audience from the genomics and bioinformatics fields, it may also be informative for the audience of computer science with interests in genomics applications.},
doi = {10.3390/genes10121017},
journal = {Genes},
number = 12,
volume = 10,
place = {United States},
year = {2019},
month = {12}
}
Works referenced in this record:
MapReduce: simplified data processing on large clusters
journal, January 2008
- Dean, Jeffrey; Ghemawat, Sanjay; Mehta, Brijesh
- Communications of the ACM, Vol. 51, Issue 1
SPFP: Speed without compromise—A mixed precision model for GPU accelerated molecular dynamics simulations
journal, February 2013
- Le Grand, Scott; Götz, Andreas W.; Walker, Ross C.
- Computer Physics Communications, Vol. 184, Issue 2
MetaSpark: a spark-based distributed processing tool to recruit metagenomic reads to reference genomes
journal, January 2017
- Zhou, Wei; Li, Ruilin; Yuan, Shuo
- Bioinformatics
An OpenMP-based tool for finding longest common subsequence in bioinformatics
journal, April 2019
- Shikder, Rayhan; Thulasiraman, Parimala; Irani, Pourang
- BMC Research Notes, Vol. 12, Issue 1
SpaRC: scalable sequence clustering using Apache Spark
journal, August 2018
- Shi, Lizhen; Meng, Xiandong; Tseng, Elizabeth
- Bioinformatics, Vol. 35, Issue 5
160-fold acceleration of the Smith-Waterman algorithm using a field programmable gate array (FPGA)
journal, January 2007
- Li, Isaac TS; Shum, Warren; Truong, Kevin
- BMC Bioinformatics, Vol. 8, Issue 1
SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing
journal, May 2012
- Bankevich, Anton; Nurk, Sergey; Antipov, Dmitry
- Journal of Computational Biology, Vol. 19, Issue 5
BioPig: a Hadoop-based analytic toolkit for large-scale sequence data
journal, September 2013
- Nordberg, H.; Bhatia, K.; Wang, K.
- Bioinformatics, Vol. 29, Issue 23
BigBWA: approaching the Burrows–Wheeler aligner to Big Data technologies
journal, August 2015
- Abuín, José M.; Pichel, Juan C.; Pena, Tomás F.
- Bioinformatics
ClustalW-MPI: ClustalW analysis using distributed and parallel computing
journal, August 2003
- Li, K. -B.
- Bioinformatics, Vol. 19, Issue 12
Searching for SNPs with cloud computing
journal, January 2009
- Langmead, Ben; Schatz, Michael C.; Lin, Jimmy
- Genome Biology, Vol. 10, Issue 11
Graphics processing units in bioinformatics, computational biology and systems biology
journal, July 2016
- Nobile, Marco S.; Cazzaniga, Paolo; Tangherloni, Andrea
- Briefings in Bioinformatics
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data
journal, May 2016
- Abuín, José M.; Pichel, Juan C.; Pena, Tomás F.
- PLOS ONE, Vol. 11, Issue 5
TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing
journal, March 2002
- Schmidt, H. A.; Strimmer, K.; Vingron, M.
- Bioinformatics, Vol. 18, Issue 3
Bioinformatics applications on Apache Spark
journal, August 2018
- Guo, Runxin; Zhao, Yi; Zou, Quan
- GigaScience
Singularity: Scientific containers for mobility of compute
journal, May 2017
- Kurtzer, Gregory M.; Sochat, Vanessa; Bauer, Michael W.
- PLOS ONE, Vol. 12, Issue 5
High-quality draft assemblies of mammalian genomes from massively parallel sequence data
journal, December 2010
- Gnerre, S.; MacCallum, I.; Przybylski, D.
- Proceedings of the National Academy of Sciences, Vol. 108, Issue 4
Speeding Up Large-Scale Next Generation Sequencing Data Analysis with pBWA
journal, January 2017
- Peters, Darren; Luo, Xuemei; Qiu, Ke
- Journal of Applied Bioinformatics & Computational Biology, Vol. 01, Issue 01
Coming of age: ten years of next-generation sequencing technologies
journal, May 2016
- Goodwin, Sara; McPherson, John D.; McCombie, W. Richard
- Nature Reviews Genetics, Vol. 17, Issue 6
Accelerating molecular dynamic simulation on graphics processing units
journal, April 2009
- Friedrichs, Mark S.; Eastman, Peter; Vaidyanathan, Vishal
- Journal of Computational Chemistry, Vol. 30, Issue 6
SOAP3: ultra-fast GPU-based parallel alignment tool for short reads
journal, January 2012
- Liu, C. -M.; Wong, T.; Wu, E.
- Bioinformatics, Vol. 28, Issue 6
The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data
journal, July 2010
- McKenna, A.; Hanna, M.; Banks, E.
- Genome Research, Vol. 20, Issue 9
HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy
journal, March 2015
- Zou, Quan; Hu, Qinghua; Guo, Maozu
- Bioinformatics, Vol. 31, Issue 15
Big Data: Astronomical or Genomical?
journal, July 2015
- Stephens, Zachary D.; Lee, Skylar Y.; Faghri, Faraz
- PLOS Biology, Vol. 13, Issue 7
Genomic big data hitting the storage bottleneck
journal, April 2018
- Papageorgiou, Louis; Eleni, Picasi; Raftopoulou, Sofia
- EMBnet.journal, Vol. 24
ORCA: a comprehensive bioinformatics container environment for education and research
journal, April 2019
- Jackman, Shaun D.; Mozgacheva, Tatyana; Chen, Susie
- Bioinformatics, Vol. 35, Issue 21
Shifter: Containers for HPC
journal, October 2017
- Gerhardt, Lisa; Bhimji, Wahid; Canon, Shane
- Journal of Physics: Conference Series, Vol. 898
End-to-End Differentiable Learning of Protein Structure
journal, April 2019
- AlQuraishi, Mohammed
- Cell Systems, Vol. 8, Issue 4
A case study of tuning MapReduce for efficient Bioinformatics in the cloud
journal, January 2017
- Shi, Lizhen; Wang, Zhong; Yu, Weikuan
- Parallel Computing, Vol. 61
Amdahl's Law in the Multicore Era
journal, July 2008
- Hill, Mark D.; Marty, Michael R.
- Computer, Vol. 41, Issue 7
De novo assembly of human genomes with massively parallel short read sequencing
journal, December 2009
- Li, R.; Zhu, H.; Ruan, J.
- Genome Research, Vol. 20, Issue 2
Ray Meta: scalable de novo metagenome assembly and profiling
journal, January 2012
- Boisvert, Sébastien; Raymond, Frédéric; Godzaridis, Élénie
- Genome Biology, Vol. 13, Issue 12
Enabling large-scale next-generation sequence assembly with Blacklight: LARGE-SCALE SEQUENCE ASSEMBLY WITH BLACKLIGHT
journal, March 2014
- Brian Couger, M.; Pipes, Lenore; Squina, Fabio
- Concurrency and Computation: Practice and Experience, Vol. 26, Issue 13
An improved assembly and annotation of the allohexaploid wheat genome identifies complete families of agronomic genes and provides genomic evidence for chromosomal translocations
journal, April 2017
- Clavijo, Bernardo J.; Venturini, Luca; Schudoma, Christian
- Genome Research, Vol. 27, Issue 5
SOAP3-dp: Fast, Accurate and Sensitive GPU-Based Short Read Aligner
journal, May 2013
- Luo, Ruibang; Wong, Thomas; Zhu, Jianqiao
- PLoS ONE, Vol. 8, Issue 5
End-to-End Differentiable Learning of Protein Structure
journal, January 2018
- AlQuraishi, Mohammed
- SSRN Electronic Journal