Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly

Journal Article · · Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS)
 [1];  [2];  [1];  [2];  [2];  [1]
  1. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Berkeley, CA (United States)
  2. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)

One of the most computationally intensive tasks in computational biology is de novo genome assembly, the decoding of the sequence of an unknown genome from redundant and erroneous short sequences. A common assembly paradigm identifies overlapping sequences, simplifies their layout, and creates consensus. Despite many algorithms developed in the literature, the efficient assembly of large genomes is still an open problem. In this work, we introduce new distributed-memory parallel algorithms for overlap detection and layout simplification steps of de novo genome assembly, and implement them in the diBELLA 2D pipeline. Our distributed memory algorithms for both overlap detection and layout simplification are based on linear-algebra operations over semirings using 2D distributed sparse matrices. Our layout step consists of performing a transitive reduction from the overlap graph to a string graph. We provide a detailed communication analysis of the main stages of our new algorithms. diBELLA 2D achieves near linear scaling with over 80% parallel efficiency for the human genome, reducing the runtime for overlap detection by 1.2-1.3× for the human genome and 1.5-1.9× for C.elegans compared to the state-of-the-art. Our transitive reduction algorithm outperforms an existing distributed-memory implementation by 10.5-13.3× for the human genome and 18-29× for the C. elegans. Our work paves the way for efficient de novo assembly of large genomes using long reads in distributed memory.

Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
Grant/Contract Number:
AC02-05CH11231
OSTI ID:
1818231
Journal Information:
Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS), Journal Name: Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS); ISSN 1530-2075
Publisher:
IEEECopyright Statement
Country of Publication:
United States
Language:
English

References (31)

PacBio Sequencing and Its Applications journal October 2015
FSG: Fast String Graph Construction for De Novo Assembly journal October 2017
Efficient counting of k-mers in DNA sequences using a bloom filter journal August 2011
SeqAn An efficient, generic C++ library for sequence analysis journal January 2008
Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome collection January 2019
An External-Memory Algorithm for String Graph Construction journal May 2016
Assembling large genomes with single-molecule sequencing and locality-sensitive hashing journal May 2015
Genetic variation and the de novo assembly of human genomes journal October 2015
Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome journal August 2019
Parametric Complexity of Sequence Assembly: Theory and Applications to Next Generation Sequencing journal July 2009
The fragment assembly string graph journal September 2005
Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences journal March 2016
Minimap2: pairwise alignment for nucleotide sequences journal May 2018
Efficient de novo assembly of large genomes using compressed data structures journal December 2011
Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome journal October 2015
Canu: scalable and accurate long-read assembly via adaptive k -mer weighting and repeat separation journal March 2017
SORA: Scalable Overlap-graph Reduction Algorithms for Genome Assembly using Apache Spark in the Cloud conference December 2018
Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication conference September 2008
Parallel Construction of Bidirected String Graphs for Genome Assembly conference September 2008
Communication-Efficient Jaccard similarity for High-Performance Distributed Genome Comparisons conference May 2020
Distributed Many-to-Many Protein Sequence Alignment using Sparse Matrices conference November 2020
Real-Time DNA Sequencing from Single Polymerase Molecules journal January 2009
Indexing compressed text journal July 2005
Communication optimal parallel multiplication of sparse random matrices conference January 2013
HipMer: an extreme-scale de novo genome assembler
  • Georganas, Evangelos; Buluç, Aydın; Chapman, Jarrod
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15 https://doi.org/10.1145/2807591.2807664
conference January 2015
Apache Spark: a unified engine for big data processing journal October 2016
diBELLA: Distributed Long Read to Long Read Alignment conference January 2019
The Combinatorial BLAS: design, implementation, and applications journal May 2011
Genome assembly forensics: finding the elusive mis-assembly journal January 2008
The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community journal November 2016
A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies journal March 2011

Similar Records

diBELLA 2D: Parallel String Graph Construction and Transitive Reduction for De Novo Assembly (diBELLA 2D) v1.0
Software · Mon Feb 08 19:00:00 EST 2021 · OSTI ID:code-52577

Extreme-Scale De Novo Genome Assembly
Journal Article · Tue Sep 26 00:00:00 EDT 2017 · OSTI ID:1398520

SORA: Scalable Overlap-graph Reduction Algorithms for Genome Assembly using Apache Spark in the Cloud
Conference · Fri Nov 30 23:00:00 EST 2018 · OSTI ID:1557475

Related Subjects