DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: The parallelism motifs of genomic data analysis

Journal Article · · Philosophical Transactions of the Royal Society. A, Mathematical, Physical and Engineering Sciences
ORCiD logo [1];  [1];  [2];  [3];  [1];  [4];  [2];  [1];  [5];  [1];  [2];  [2];  [1];  [2]
  1. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Berkeley, CA (United States)
  2. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
  3. Indiana Univ., Bloomington, IN (United States)
  4. USDOE Joint Genome Institute (JGI), Walnut Creek, CA (United States)
  5. Intel Labs, Santa Clara, CA (United States)

Genomic datasets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share these data with the research community, but some of these genomic data analysis problems require large-scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high-end parallel systems today and place different requirements on programming support, software libraries and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high-performance genomics analysis, including alignment, profiling, clustering and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or ‘motifs’ that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); National Science Foundation (NSF)
Grant/Contract Number:
AC02-05CH11231; SC0008700; 1823034; AC05-00OR22725
OSTI ID:
1598527
Journal Information:
Philosophical Transactions of the Royal Society. A, Mathematical, Physical and Engineering Sciences, Vol. 378, Issue 2166; ISSN 1364-503X
Publisher:
The Royal Society PublishingCopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 11 works
Citation information provided by
Web of Science

References (86)

The Protein Folding Problem journal June 2008
Bloomfish: A Highly Scalable Distributed K-mer Counting Framework conference December 2017
160-fold acceleration of the Smith-Waterman algorithm using a field programmable gate array (FPGA) journal January 2007
An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum journal June 2002
Profile-based direct kernels for remote homology detection and fold recognition journal September 2005
diBELLA: Distributed Long Read to Long Read Alignment conference January 2019
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences journal May 2006
Identification of common molecular subsequences journal March 1981
Algorithm 679; a set of level 3 basic linear algebra subprograms: model implementation and test programs journal March 1990
Design of the GraphBLAS API for C conference May 2017
Distributed enhanced suffix arrays: efficient algorithms for construction and querying
  • Flick, Patrick; Aluru, Srinivas
  • SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3295500.3356211
conference November 2019
Approximate nearest neighbors: towards removing the curse of dimensionality conference January 1998
Communication optimal parallel multiplication of sparse random matrices conference January 2013
A Greedy Algorithm for Aligning DNA Sequences journal February 2000
Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments journal January 2012
A three-dimensional approach to parallel matrix multiplication journal September 1995
Graph Clustering Via a Discrete Uncoupling Process journal January 2008
SARVAVID: A Domain Specific Language for Developing Scalable Computational Genomics Applications conference January 2016
Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems journal July 2019
The Worldwide LHC Computing Grid (worldwide LCG) journal July 2007
Apache Spark: a unified engine for big data processing journal October 2016
HipMer: an extreme-scale de novo genome assembler
  • Georganas, Evangelos; Buluç, Aydın; Chapman, Jarrod
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15 https://doi.org/10.1145/2807591.2807664
conference January 2015
Mash: fast genome and metagenome distance estimation using MinHash journal June 2016
Efficient parallelization using rank convergence in dynamic programming algorithms journal September 2016
Adaptive seeds tame genomic sequence comparison journal January 2011
Genomes Galore: Big Data Challenges in the Life Sciences conference December 2016
Dashing: fast and accurate genomic distances with HyperLogLog journal December 2019
HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks journal January 2018
MerBench: PGAS Benchmarks for High Performance Genome Assembly conference January 2017
Parallel Many-Body Simulations Without All-to-All Communication journal May 1995
The UPC++ PGAS library for Exascale Computing conference January 2017
Introducing difference recurrence relations for faster semi-global alignment of long sequences journal February 2018
merAligner: A Fully Parallel Sequence Aligner conference May 2015
MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets journal October 2017
Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors journal December 2019
Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication journal January 2016
Striped Smith-Waterman speeds database searches six times over other SIMD implementations journal November 2006
Minimap2: pairwise alignment for nucleotide sequences journal May 2018
A general method applicable to the search for similarities in the amino acid sequence of two proteins journal March 1970
Parallel distributed memory construction of suffix and longest common prefix arrays conference January 2015
The Exascale Computing Project journal May 2017
Extreme Scale De Novo Metagenome Assembly conference November 2018
End-to-End Differentiable Learning of Protein Structure journal April 2019
An efficient algorithm for large-scale detection of protein families journal April 2002
Darwin: A Genomics Co-processor Provides up to 15,000X Acceleration on Long Read Assembly journal March 2018
Architectural optimizations for high performance and energy efficient Smith-Waterman implementation on FPGAs using OpenCL
  • Di Tucci, Lorenzo; O'Brien, Kenneth; Blott, Michaela
  • 2017 Design, Automation & Test in Europe Conference & Exhibition (DATE), Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017 https://doi.org/10.23919/DATE.2017.7927082
conference March 2017
Parallel De Bruijn Graph Construction and Traversal for De Novo Genome Assembly
  • Georganas, Evangelos; Buluc, Aydin; Chapman, Jarrod
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.41
conference November 2014
The Combinatorial BLAS: design, implementation, and applications journal May 2011
LACC: A Linear-Algebraic Algorithm for Finding Connected Components in Distributed Memory conference May 2019
CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions journal April 2013
Communication-Efficient Jaccard similarity for High-Performance Distributed Genome Comparisons conference May 2020
End-to-End Differentiable Learning of Protein Structure journal January 2018
Aquabacterium terrae sp. nov., isolated from soil journal April 2021
Информационно-вычислительная система массивно-параллельной обработки радарных данных в среде Apache Spark text January 2018
Обзор основных возможностей Apache Spark text January 2020
Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems
  • Pan, Tony; Flick, Patrick; Jain, Chirag
  • BCB '16: ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics https://doi.org/10.1145/2975167.2975211
conference October 2016
Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments text January 2011
Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication text January 2015
Minimap2: pairwise alignment for nucleotide sequences text January 2017
Extreme Scale De Novo Metagenome Assembly preprint January 2018
diBELLA: Distributed Long Read to Long Read Alignment text January 2020
Parallel Many-Body Simulations Without All-to-All Communication journal May 1995
Identification of common molecular subsequences journal March 1981
End-to-End Differentiable Learning of Protein Structure journal April 2019
The Worldwide LHC Computing Grid (worldwide LCG) journal July 2007
Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors journal December 2019
MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets journal October 2017
Profile-based direct kernels for remote homology detection and fold recognition journal September 2005
MetaVelvet-SL: an extension of the Velvet assembler to a de novo metagenomic assembler utilizing supervised learning journal November 2014
A benchmark study of k-mer counting methods for high-throughput sequencing journal October 2018
An efficient algorithm for large-scale detection of protein families journal April 2002
HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks journal January 2018
A Communication-Optimal N-Body Algorithm for Direct Interactions
  • Driscoll, Michael; Georganas, Evangelos; Koanantakool, Penporn
  • 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/ipdps.2013.108
conference May 2013
Graph Clustering Via a Discrete Uncoupling Process journal January 2008
Parallel distributed memory construction of suffix and longest common prefix arrays conference January 2015
Apache Spark: a unified engine for big data processing journal October 2016
The UPC++ PGAS library for Exascale Computing conference January 2017
MerBench: PGAS Benchmarks for High Performance Genome Assembly conference January 2017
An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum journal June 2002
Algorithm 679; a set of level 3 basic linear algebra subprograms: model implementation and test programs journal March 1990
The Protein Folding Problem journal June 2008
The Combinatorial BLAS: design, implementation, and applications journal May 2011
Mash: fast genome and metagenome distance estimation using MinHash journal June 2016
Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments text January 2011
Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale preprint January 2020
Additional file 1: Figure S1. of Mash: fast genome and metagenome distance estimation using MinHash text January 2016

Cited By (1)