The parallelism motifs of genomic data analysis
Abstract
Genomic datasets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share these data with the research community, but some of these genomic data analysis problems require large-scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high-end parallel systems today and place different requirements on programming support, software libraries and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high-performance genomics analysis, including alignment, profiling, clustering and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or ‘motifs’ that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.
- Authors:
-
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Berkeley, CA (United States)
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
- Indiana Univ., Bloomington, IN (United States)
- USDOE Joint Genome Institute (JGI), Walnut Creek, CA (United States)
- Intel Labs, Santa Clara, CA (United States)
- Publication Date:
- Research Org.:
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
- Sponsoring Org.:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); National Science Foundation (NSF)
- OSTI Identifier:
- 1598527
- Grant/Contract Number:
- AC02-05CH11231; SC0008700; 1823034; AC05-00OR22725
- Resource Type:
- Accepted Manuscript
- Journal Name:
- Philosophical Transactions of the Royal Society. A, Mathematical, Physical and Engineering Sciences
- Additional Journal Information:
- Journal Volume: 378; Journal Issue: 2166; Journal ID: ISSN 1364-503X
- Publisher:
- The Royal Society Publishing
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 59 BASIC BIOLOGICAL SCIENCES; bioinformatics; high-performance data analytics; parallel computing
Citation Formats
Yelick, Katherine, Buluç, Aydın, Awan, Muaaz, Azad, Ariful, Brock, Benjamin, Egan, Rob, Ekanayake, Saliya, Ellis, Marquita, Georganas, Evangelos, Guidi, Giulia, Hofmeyr, Steven, Selvitopi, Oguz, Teodoropol, Cristina, and Oliker, Leonid. The parallelism motifs of genomic data analysis. United States: N. p., 2020.
Web. doi:10.1098/rsta.2019.0394.
Yelick, Katherine, Buluç, Aydın, Awan, Muaaz, Azad, Ariful, Brock, Benjamin, Egan, Rob, Ekanayake, Saliya, Ellis, Marquita, Georganas, Evangelos, Guidi, Giulia, Hofmeyr, Steven, Selvitopi, Oguz, Teodoropol, Cristina, & Oliker, Leonid. The parallelism motifs of genomic data analysis. United States. https://doi.org/10.1098/rsta.2019.0394
Yelick, Katherine, Buluç, Aydın, Awan, Muaaz, Azad, Ariful, Brock, Benjamin, Egan, Rob, Ekanayake, Saliya, Ellis, Marquita, Georganas, Evangelos, Guidi, Giulia, Hofmeyr, Steven, Selvitopi, Oguz, Teodoropol, Cristina, and Oliker, Leonid. Mon .
"The parallelism motifs of genomic data analysis". United States. https://doi.org/10.1098/rsta.2019.0394. https://www.osti.gov/servlets/purl/1598527.
@article{osti_1598527,
title = {The parallelism motifs of genomic data analysis},
author = {Yelick, Katherine and Buluç, Aydın and Awan, Muaaz and Azad, Ariful and Brock, Benjamin and Egan, Rob and Ekanayake, Saliya and Ellis, Marquita and Georganas, Evangelos and Guidi, Giulia and Hofmeyr, Steven and Selvitopi, Oguz and Teodoropol, Cristina and Oliker, Leonid},
abstractNote = {Genomic datasets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share these data with the research community, but some of these genomic data analysis problems require large-scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high-end parallel systems today and place different requirements on programming support, software libraries and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high-performance genomics analysis, including alignment, profiling, clustering and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or ‘motifs’ that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.},
doi = {10.1098/rsta.2019.0394},
journal = {Philosophical Transactions of the Royal Society. A, Mathematical, Physical and Engineering Sciences},
number = 2166,
volume = 378,
place = {United States},
year = {2020},
month = {1}
}
Works referenced in this record:
The Protein Folding Problem
journal, June 2008
- Dill, Ken A.; Ozkan, S. Banu; Shell, M. Scott
- Annual Review of Biophysics, Vol. 37, Issue 1
Bloomfish: A Highly Scalable Distributed K-mer Counting Framework
conference, December 2017
- Gao, Tao; Guo, Yanfei; Wei, Yanjie
- 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS)
160-fold acceleration of the Smith-Waterman algorithm using a field programmable gate array (FPGA)
journal, January 2007
- Li, Isaac TS; Shum, Warren; Truong, Kevin
- BMC Bioinformatics, Vol. 8, Issue 1
An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum
journal, June 2002
- Duff, Iain S.; Heroux, Michael A.; Pozo, Roldan
- ACM Transactions on Mathematical Software, Vol. 28, Issue 2
Profile-based direct kernels for remote homology detection and fold recognition
journal, September 2005
- Rangwala, H.; Karypis, G.
- Bioinformatics, Vol. 21, Issue 23
diBELLA: Distributed Long Read to Long Read Alignment
conference, January 2019
- Ellis, Marquita; Guidi, Giulia; Buluç, Aydın
- Proceedings of the 48th International Conference on Parallel Processing - ICPP 2019
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences
journal, May 2006
- Li, W.; Godzik, A.
- Bioinformatics, Vol. 22, Issue 13
Identification of common molecular subsequences
journal, March 1981
- Smith, T. F.; Waterman, M. S.
- Journal of Molecular Biology, Vol. 147, Issue 1, p. 195-197
Algorithm 679; a set of level 3 basic linear algebra subprograms: model implementation and test programs
journal, March 1990
- Dongarra, J. J.; Cruz, Jermey Du; Hammerling, Sven
- ACM Transactions on Mathematical Software, Vol. 16, Issue 1
Design of the GraphBLAS API for C
conference, May 2017
- Buluc, Aydin; Mattson, Tim; McMillan, Scott
- 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
Distributed enhanced suffix arrays: efficient algorithms for construction and querying
conference, November 2019
- Flick, Patrick; Aluru, Srinivas
- SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
Approximate nearest neighbors: towards removing the curse of dimensionality
conference, January 1998
- Indyk, Piotr; Motwani, Rajeev
- Proceedings of the thirtieth annual ACM symposium on Theory of computing - STOC '98
Communication optimal parallel multiplication of sparse random matrices
conference, January 2013
- Ballard, Grey; Buluc, Aydin; Demmel, James
- Proceedings of the 25th ACM symposium on Parallelism in algorithms and architectures - SPAA '13
A Greedy Algorithm for Aligning DNA Sequences
journal, February 2000
- Zhang, Zheng; Schwartz, Scott; Wagner, Lukas
- Journal of Computational Biology, Vol. 7, Issue 1-2
Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments
journal, January 2012
- Buluç, Aydin; Gilbert, John R.
- SIAM Journal on Scientific Computing, Vol. 34, Issue 4
A three-dimensional approach to parallel matrix multiplication
journal, September 1995
- Agarwal, R. C.; Balle, S. M.; Gustavson, F. G.
- IBM Journal of Research and Development, Vol. 39, Issue 5
Graph Clustering Via a Discrete Uncoupling Process
journal, January 2008
- Van Dongen, Stijn
- SIAM Journal on Matrix Analysis and Applications, Vol. 30, Issue 1
SARVAVID: A Domain Specific Language for Developing Scalable Computational Genomics Applications
conference, January 2016
- Mahadik, Kanak; Wright, Christopher; Zhang, Jinyi
- Proceedings of the 2016 International Conference on Supercomputing - ICS '16
Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems
journal, July 2019
- Pan, Tony; Flick, Patrick; Jain, Chirag
- IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 16, Issue 4
The Worldwide LHC Computing Grid (worldwide LCG)
journal, July 2007
- Shiers, Jamie
- Computer Physics Communications, Vol. 177, Issue 1-2
Apache Spark: a unified engine for big data processing
journal, October 2016
- Zaharia, Matei; Franklin, Michael J.; Ghodsi, Ali
- Communications of the ACM, Vol. 59, Issue 11
HipMer: an extreme-scale de novo genome assembler
conference, January 2015
- Georganas, Evangelos; Buluç, Aydın; Chapman, Jarrod
- Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15
Mash: fast genome and metagenome distance estimation using MinHash
journal, June 2016
- Ondov, Brian D.; Treangen, Todd J.; Melsted, Páll
- Genome Biology, Vol. 17, Issue 1
Efficient parallelization using rank convergence in dynamic programming algorithms
journal, September 2016
- Maleki, Saeed; Musuvathi, Madanlal; Mytkowicz, Todd
- Communications of the ACM, Vol. 59, Issue 10
Adaptive seeds tame genomic sequence comparison
journal, January 2011
- Kielbasa, S. M.; Wan, R.; Sato, K.
- Genome Research, Vol. 21, Issue 3
Genomes Galore: Big Data Challenges in the Life Sciences
conference, December 2016
- Aluru, Srinivas
- 2016 IEEE 23rd International Conference on High Performance Computing (HiPC)
Dashing: fast and accurate genomic distances with HyperLogLog
journal, December 2019
- Baker, Daniel N.; Langmead, Ben
- Genome Biology, Vol. 20, Issue 1
HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks
journal, January 2018
- Azad, Ariful; Pavlopoulos, Georgios A.; Ouzounis, Christos A.
- Nucleic Acids Research, Vol. 46, Issue 6
MerBench: PGAS Benchmarks for High Performance Genome Assembly
conference, January 2017
- Georganas, Evangelos; Ellis, Marquita; Egan, Rob
- Proceedings of the Second Annual PGAS Applications Workshop on ZZZ - PAW17
Parallel Many-Body Simulations Without All-to-All Communication
journal, May 1995
- Hendrickson, B.; Plimpton, S.
- Journal of Parallel and Distributed Computing, Vol. 27, Issue 1
The UPC++ PGAS library for Exascale Computing
conference, January 2017
- Bachan, John; Bonachea, Dan; Hargrove, Paul H.
- Proceedings of the Second Annual PGAS Applications Workshop on ZZZ - PAW17
Introducing difference recurrence relations for faster semi-global alignment of long sequences
journal, February 2018
- Suzuki, Hajime; Kasahara, Masahiro
- BMC Bioinformatics, Vol. 19, Issue S1
merAligner: A Fully Parallel Sequence Aligner
conference, May 2015
- Georganas, Evangelos; Buluc, Aydin; Chapman, Jarrod
- 2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets
journal, October 2017
- Steinegger, Martin; Söding, Johannes
- Nature Biotechnology, Vol. 35, Issue 11
Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors
journal, December 2019
- Nagasaka, Yusuke; Matsuoka, Satoshi; Azad, Ariful
- Parallel Computing, Vol. 90
Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication
journal, January 2016
- Azad, Ariful; Ballard, Grey; Buluç, Aydin
- SIAM Journal on Scientific Computing, Vol. 38, Issue 6
Striped Smith-Waterman speeds database searches six times over other SIMD implementations
journal, November 2006
- Farrar, M.
- Bioinformatics, Vol. 23, Issue 2
Minimap2: pairwise alignment for nucleotide sequences
journal, May 2018
- Li, Heng
- Bioinformatics, Vol. 34, Issue 18
A general method applicable to the search for similarities in the amino acid sequence of two proteins
journal, March 1970
- Needleman, Saul B.; Wunsch, Christian D.
- Journal of Molecular Biology, Vol. 48, Issue 3, p. 443-453
Parallel distributed memory construction of suffix and longest common prefix arrays
conference, January 2015
- Flick, Patrick; Aluru, Srinivas
- Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15
The Exascale Computing Project
journal, May 2017
- Messina, Paul
- Computing in Science & Engineering, Vol. 19, Issue 3
Extreme Scale De Novo Metagenome Assembly
conference, November 2018
- Georganas, Evangelos; Egan, Rob; Hofmeyr, Steven
- SC18: International Conference for High Performance Computing, Networking, Storage and Analysis
End-to-End Differentiable Learning of Protein Structure
journal, April 2019
- AlQuraishi, Mohammed
- Cell Systems, Vol. 8, Issue 4
An efficient algorithm for large-scale detection of protein families
journal, April 2002
- Enright, A. J.
- Nucleic Acids Research, Vol. 30, Issue 7
Darwin: A Genomics Co-processor Provides up to 15,000X Acceleration on Long Read Assembly
journal, March 2018
- Turakhia, Yatish; Bejerano, Gill; Dally, William J.
- ACM SIGPLAN Notices, Vol. 53, Issue 2
Architectural optimizations for high performance and energy efficient Smith-Waterman implementation on FPGAs using OpenCL
conference, March 2017
- Di Tucci, Lorenzo; O'Brien, Kenneth; Blott, Michaela
- 2017 Design, Automation & Test in Europe Conference & Exhibition (DATE), Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017
Parallel De Bruijn Graph Construction and Traversal for De Novo Genome Assembly
conference, November 2014
- Georganas, Evangelos; Buluc, Aydin; Chapman, Jarrod
- SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
The Combinatorial BLAS: design, implementation, and applications
journal, May 2011
- Buluç, Aydın; Gilbert, John R.
- The International Journal of High Performance Computing Applications, Vol. 25, Issue 4
LACC: A Linear-Algebraic Algorithm for Finding Connected Components in Distributed Memory
conference, May 2019
- Azad, Ariful; Buluc, Aydin
- 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
A Communication-Optimal N-Body Algorithm for Direct Interactions
conference, May 2013
- Driscoll, Michael; Georganas, Evangelos; Koanantakool, Penporn
- 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing
CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions
journal, April 2013
- Liu, Yongchao; Wirawan, Adrianto; Schmidt, Bertil
- BMC Bioinformatics, Vol. 14, Issue 1
Communication-Efficient Jaccard similarity for High-Performance Distributed Genome Comparisons
conference, May 2020
- Besta, Maciej; Kanakagiri, Raghavendra; Mustafa, Harun
- 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
End-to-End Differentiable Learning of Protein Structure
journal, January 2018
- AlQuraishi, Mohammed
- SSRN Electronic Journal
Aquabacterium terrae sp. nov., isolated from soil
journal, April 2021
- Dahal, Ram Hari; Han, Ji Yeon; Lee, Hyosun
- Archives of Microbiology, Vol. 203, Issue 6
Информационно-вычислительная система массивно-параллельной обработки радарных данных в среде Apache Spark
text, January 2018
- Потапов, В. П.; Попов, С. Е.; Костылев, М. А.
- Вычислительные технологии
Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems
conference, October 2016
- Pan, Tony; Flick, Patrick; Jain, Chirag
- BCB '16: ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics
Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments
text, January 2011
- Buluc, Aydin; Gilbert, John
- arXiv
Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication
text, January 2015
- Azad, Ariful; Ballard, Grey; Buluc, Aydin
- arXiv
Extreme Scale De Novo Metagenome Assembly
preprint, January 2018
- Georganas, Evangelos; Egan, Rob; Hofmeyr, Steven
- arXiv
diBELLA: Distributed Long Read to Long Read Alignment
text, January 2020
- Ellis, Marquita; Guidi, Giulia; Buluç, Aydın
- arXiv