skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: The parallelism motifs of genomic data analysis

Abstract

Genomic datasets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share these data with the research community, but some of these genomic data analysis problems require large-scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high-end parallel systems today and place different requirements on programming support, software libraries and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high-performance genomics analysis, including alignment, profiling, clustering and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or ‘motifs’ that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

Authors:
ORCiD logo [1];  [1];  [2];  [3];  [1];  [4];  [2];  [1];  [5];  [1];  [2];  [2];  [1];  [2]
  1. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Berkeley, CA (United States)
  2. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
  3. Indiana Univ., Bloomington, IN (United States)
  4. USDOE Joint Genome Institute (JGI), Walnut Creek, CA (United States)
  5. Intel Labs, Santa Clara, CA (United States)
Publication Date:
Research Org.:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21); National Science Foundation (NSF)
OSTI Identifier:
1598527
Grant/Contract Number:  
[AC02-05CH11231; SC0008700; 1823034; AC05-00OR22725]
Resource Type:
Accepted Manuscript
Journal Name:
Philosophical Transactions of the Royal Society. A, Mathematical, Physical and Engineering Sciences
Additional Journal Information:
[ Journal Volume: 378; Journal Issue: 2166]; Journal ID: ISSN 1364-503X
Publisher:
The Royal Society Publishing
Country of Publication:
United States
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES; bioinformatics; high-performance data analytics; parallel computing

Citation Formats

Yelick, Katherine, Buluç, Aydın, Awan, Muaaz, Azad, Ariful, Brock, Benjamin, Egan, Rob, Ekanayake, Saliya, Ellis, Marquita, Georganas, Evangelos, Guidi, Giulia, Hofmeyr, Steven, Selvitopi, Oguz, Teodoropol, Cristina, and Oliker, Leonid. The parallelism motifs of genomic data analysis. United States: N. p., 2020. Web. doi:10.1098/rsta.2019.0394.
Yelick, Katherine, Buluç, Aydın, Awan, Muaaz, Azad, Ariful, Brock, Benjamin, Egan, Rob, Ekanayake, Saliya, Ellis, Marquita, Georganas, Evangelos, Guidi, Giulia, Hofmeyr, Steven, Selvitopi, Oguz, Teodoropol, Cristina, & Oliker, Leonid. The parallelism motifs of genomic data analysis. United States. doi:10.1098/rsta.2019.0394.
Yelick, Katherine, Buluç, Aydın, Awan, Muaaz, Azad, Ariful, Brock, Benjamin, Egan, Rob, Ekanayake, Saliya, Ellis, Marquita, Georganas, Evangelos, Guidi, Giulia, Hofmeyr, Steven, Selvitopi, Oguz, Teodoropol, Cristina, and Oliker, Leonid. Mon . "The parallelism motifs of genomic data analysis". United States. doi:10.1098/rsta.2019.0394. https://www.osti.gov/servlets/purl/1598527.
@article{osti_1598527,
title = {The parallelism motifs of genomic data analysis},
author = {Yelick, Katherine and Buluç, Aydın and Awan, Muaaz and Azad, Ariful and Brock, Benjamin and Egan, Rob and Ekanayake, Saliya and Ellis, Marquita and Georganas, Evangelos and Guidi, Giulia and Hofmeyr, Steven and Selvitopi, Oguz and Teodoropol, Cristina and Oliker, Leonid},
abstractNote = {Genomic datasets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share these data with the research community, but some of these genomic data analysis problems require large-scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high-end parallel systems today and place different requirements on programming support, software libraries and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high-performance genomics analysis, including alignment, profiling, clustering and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or ‘motifs’ that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.},
doi = {10.1098/rsta.2019.0394},
journal = {Philosophical Transactions of the Royal Society. A, Mathematical, Physical and Engineering Sciences},
number = [2166],
volume = [378],
place = {United States},
year = {2020},
month = {1}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Save / Share:

Works referenced in this record:

Apache Spark: a unified engine for big data processing
journal, October 2016

  • Zaharia, Matei; Franklin, Michael J.; Ghodsi, Ali
  • Communications of the ACM, Vol. 59, Issue 11
  • DOI: 10.1145/2934664

The Worldwide LHC Computing Grid (worldwide LCG)
journal, July 2007


The Exascale Computing Project
journal, May 2017


Parallel De Bruijn Graph Construction and Traversal for De Novo Genome Assembly
conference, November 2014

  • Georganas, Evangelos; Buluc, Aydin; Chapman, Jarrod
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2014.41

HipMer: an extreme-scale de novo genome assembler
conference, January 2015

  • Georganas, Evangelos; Buluç, Aydın; Chapman, Jarrod
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15
  • DOI: 10.1145/2807591.2807664

Extreme Scale De Novo Metagenome Assembly
conference, November 2018

  • Georganas, Evangelos; Egan, Rob; Hofmeyr, Steven
  • SC18: International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2018.00013

diBELLA: Distributed Long Read to Long Read Alignment
conference, January 2019

  • Ellis, Marquita; Guidi, Giulia; Buluç, Aydın
  • Proceedings of the 48th International Conference on Parallel Processing - ICPP 2019
  • DOI: 10.1145/3337821.3337919

Bloomfish: A Highly Scalable Distributed K-mer Counting Framework
conference, December 2017

  • Gao, Tao; Guo, Yanfei; Wei, Yanjie
  • 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS)
  • DOI: 10.1109/ICPADS.2017.00033

Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems
journal, July 2019

  • Pan, Tony; Flick, Patrick; Jain, Chirag
  • IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 16, Issue 4
  • DOI: 10.1109/TCBB.2017.2760829

A general method applicable to the search for similarities in the amino acid sequence of two proteins
journal, March 1970


Identification of common molecular subsequences
journal, March 1981


A Greedy Algorithm for Aligning DNA Sequences
journal, February 2000

  • Zhang, Zheng; Schwartz, Scott; Wagner, Lukas
  • Journal of Computational Biology, Vol. 7, Issue 1-2
  • DOI: 10.1089/10665270050081478

Minimap2: pairwise alignment for nucleotide sequences
journal, May 2018


Introducing difference recurrence relations for faster semi-global alignment of long sequences
journal, February 2018


CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions
journal, April 2013


Architectural optimizations for high performance and energy efficient Smith-Waterman implementation on FPGAs using OpenCL
conference, March 2017

  • Di Tucci, Lorenzo; O'Brien, Kenneth; Blott, Michaela
  • 2017 Design, Automation & Test in Europe Conference & Exhibition (DATE), Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017
  • DOI: 10.23919/DATE.2017.7927082

160-fold acceleration of the Smith-Waterman algorithm using a field programmable gate array (FPGA)
journal, January 2007


Efficient parallelization using rank convergence in dynamic programming algorithms
journal, September 2016

  • Maleki, Saeed; Musuvathi, Madanlal; Mytkowicz, Todd
  • Communications of the ACM, Vol. 59, Issue 10
  • DOI: 10.1145/2983553

merAligner: A Fully Parallel Sequence Aligner
conference, May 2015

  • Georganas, Evangelos; Buluc, Aydin; Chapman, Jarrod
  • 2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
  • DOI: 10.1109/IPDPS.2015.96

Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments
journal, January 2012

  • Buluç, Aydin; Gilbert, John R.
  • SIAM Journal on Scientific Computing, Vol. 34, Issue 4
  • DOI: 10.1137/110848244

Parallel distributed memory construction of suffix and longest common prefix arrays
conference, January 2015

  • Flick, Patrick; Aluru, Srinivas
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15
  • DOI: 10.1145/2807591.2807609

Distributed enhanced suffix arrays: efficient algorithms for construction and querying
conference, November 2019

  • Flick, Patrick; Aluru, Srinivas
  • SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1145/3295500.3356211

Mash: fast genome and metagenome distance estimation using MinHash
journal, June 2016


Approximate nearest neighbors: towards removing the curse of dimensionality
conference, January 1998

  • Indyk, Piotr; Motwani, Rajeev
  • Proceedings of the thirtieth annual ACM symposium on Theory of computing - STOC '98
  • DOI: 10.1145/276698.276876

Dashing: fast and accurate genomic distances with HyperLogLog
journal, December 2019


MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets
journal, October 2017

  • Steinegger, Martin; Söding, Johannes
  • Nature Biotechnology, Vol. 35, Issue 11
  • DOI: 10.1038/nbt.3988

Adaptive seeds tame genomic sequence comparison
journal, January 2011


An efficient algorithm for large-scale detection of protein families
journal, April 2002


Graph Clustering Via a Discrete Uncoupling Process
journal, January 2008

  • Van Dongen, Stijn
  • SIAM Journal on Matrix Analysis and Applications, Vol. 30, Issue 1
  • DOI: 10.1137/040608635

HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks
journal, January 2018

  • Azad, Ariful; Pavlopoulos, Georgios A.; Ouzounis, Christos A.
  • Nucleic Acids Research, Vol. 46, Issue 6
  • DOI: 10.1093/nar/gkx1313

The Combinatorial BLAS: design, implementation, and applications
journal, May 2011

  • Buluç, Aydın; Gilbert, John R.
  • The International Journal of High Performance Computing Applications, Vol. 25, Issue 4
  • DOI: 10.1177/1094342011403516

Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors
journal, December 2019


LACC: A Linear-Algebraic Algorithm for Finding Connected Components in Distributed Memory
conference, May 2019

  • Azad, Ariful; Buluc, Aydin
  • 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
  • DOI: 10.1109/IPDPS.2019.00012

Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication
journal, January 2016

  • Azad, Ariful; Ballard, Grey; Buluç, Aydin
  • SIAM Journal on Scientific Computing, Vol. 38, Issue 6
  • DOI: 10.1137/15M104253X

Algorithm 679; a set of level 3 basic linear algebra subprograms: model implementation and test programs
journal, March 1990

  • Dongarra, J. J.; Cruz, Jermey Du; Hammerling, Sven
  • ACM Transactions on Mathematical Software, Vol. 16, Issue 1
  • DOI: 10.1145/77626.77627

An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum
journal, June 2002

  • Duff, Iain S.; Heroux, Michael A.; Pozo, Roldan
  • ACM Transactions on Mathematical Software, Vol. 28, Issue 2
  • DOI: 10.1145/567806.567810

Design of the GraphBLAS API for C
conference, May 2017

  • Buluc, Aydin; Mattson, Tim; McMillan, Scott
  • 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
  • DOI: 10.1109/IPDPSW.2017.117

Profile-based direct kernels for remote homology detection and fold recognition
journal, September 2005


The Protein Folding Problem
journal, June 2008


End-to-End Differentiable Learning of Protein Structure
journal, April 2019


Darwin: A Genomics Co-processor Provides up to 15,000X Acceleration on Long Read Assembly
journal, March 2018

  • Turakhia, Yatish; Bejerano, Gill; Dally, William J.
  • ACM SIGPLAN Notices, Vol. 53, Issue 2
  • DOI: 10.1145/3296957.3173193

Genomes Galore: Big Data Challenges in the Life Sciences
conference, December 2016

  • Aluru, Srinivas
  • 2016 IEEE 23rd International Conference on High Performance Computing (HiPC)
  • DOI: 10.1109/HiPC.2016.010

SARVAVID: A Domain Specific Language for Developing Scalable Computational Genomics Applications
conference, January 2016

  • Mahadik, Kanak; Wright, Christopher; Zhang, Jinyi
  • Proceedings of the 2016 International Conference on Supercomputing - ICS '16
  • DOI: 10.1145/2925426.2926283

MerBench: PGAS Benchmarks for High Performance Genome Assembly
conference, January 2017

  • Georganas, Evangelos; Ellis, Marquita; Egan, Rob
  • Proceedings of the Second Annual PGAS Applications Workshop on ZZZ - PAW17
  • DOI: 10.1145/3144779.3169109

The UPC++ PGAS library for Exascale Computing
conference, January 2017

  • Bachan, John; Bonachea, Dan; Hargrove, Paul H.
  • Proceedings of the Second Annual PGAS Applications Workshop on ZZZ - PAW17
  • DOI: 10.1145/3144779.3169108

A three-dimensional approach to parallel matrix multiplication
journal, September 1995

  • Agarwal, R. C.; Balle, S. M.; Gustavson, F. G.
  • IBM Journal of Research and Development, Vol. 39, Issue 5
  • DOI: 10.1147/rd.395.0575

Parallel Many-Body Simulations Without All-to-All Communication
journal, May 1995

  • Hendrickson, B.; Plimpton, S.
  • Journal of Parallel and Distributed Computing, Vol. 27, Issue 1
  • DOI: 10.1006/jpdc.1995.1068

A Communication-Optimal N-Body Algorithm for Direct Interactions
conference, May 2013

  • Driscoll, Michael; Georganas, Evangelos; Koanantakool, Penporn
  • 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing
  • DOI: 10.1109/IPDPS.2013.108

Communication optimal parallel multiplication of sparse random matrices
conference, January 2013

  • Ballard, Grey; Buluc, Aydin; Demmel, James
  • Proceedings of the 25th ACM symposium on Parallelism in algorithms and architectures - SPAA '13
  • DOI: 10.1145/2486159.2486196