Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly
Abstract
One of the most computationally intensive tasks in computational biology is de novo genome assembly, the decoding of the sequence of an unknown genome from redundant and erroneous short sequences. A common assembly paradigm identifies overlapping sequences, simplifies their layout, and creates consensus. Despite many algorithms developed in the literature, the efficient assembly of large genomes is still an open problem. In this work, we introduce new distributed-memory parallel algorithms for overlap detection and layout simplification steps of de novo genome assembly, and implement them in the diBELLA 2D pipeline. Our distributed memory algorithms for both overlap detection and layout simplification are based on linear-algebra operations over semirings using 2D distributed sparse matrices. Our layout step consists of performing a transitive reduction from the overlap graph to a string graph. We provide a detailed communication analysis of the main stages of our new algorithms. diBELLA 2D achieves near linear scaling with over 80% parallel efficiency for the human genome, reducing the runtime for overlap detection by 1.2-1.3× for the human genome and 1.5-1.9× for C.elegans compared to the state-of-the-art. Our transitive reduction algorithm outperforms an existing distributed-memory implementation by 10.5-13.3× for the human genome and 18-29× for the C. elegans.more »
- Authors:
-
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Berkeley, CA (United States)
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
- Publication Date:
- Research Org.:
- Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
- Sponsoring Org.:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- OSTI Identifier:
- 1818231
- Grant/Contract Number:
- AC02-05CH11231
- Resource Type:
- Accepted Manuscript
- Journal Name:
- Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS)
- Additional Journal Information:
- Journal Name: Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS); Conference: 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Portland, OR (United States), 17-21 May 2021; Journal ID: ISSN 1530-2075
- Publisher:
- IEEE
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 59 BASIC BIOLOGICAL SCIENCES
Citation Formats
Guidi, Giulia, Selvitopi, Oguz, Ellis, Marquita, Oliker, Leonid, Yelick, Katherine, and Buluc, Aydin. Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly. United States: N. p., 2021.
Web. doi:10.1109/ipdps49936.2021.00060.
Guidi, Giulia, Selvitopi, Oguz, Ellis, Marquita, Oliker, Leonid, Yelick, Katherine, & Buluc, Aydin. Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly. United States. https://doi.org/10.1109/ipdps49936.2021.00060
Guidi, Giulia, Selvitopi, Oguz, Ellis, Marquita, Oliker, Leonid, Yelick, Katherine, and Buluc, Aydin. Sat .
"Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly". United States. https://doi.org/10.1109/ipdps49936.2021.00060. https://www.osti.gov/servlets/purl/1818231.
@article{osti_1818231,
title = {Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly},
author = {Guidi, Giulia and Selvitopi, Oguz and Ellis, Marquita and Oliker, Leonid and Yelick, Katherine and Buluc, Aydin},
abstractNote = {One of the most computationally intensive tasks in computational biology is de novo genome assembly, the decoding of the sequence of an unknown genome from redundant and erroneous short sequences. A common assembly paradigm identifies overlapping sequences, simplifies their layout, and creates consensus. Despite many algorithms developed in the literature, the efficient assembly of large genomes is still an open problem. In this work, we introduce new distributed-memory parallel algorithms for overlap detection and layout simplification steps of de novo genome assembly, and implement them in the diBELLA 2D pipeline. Our distributed memory algorithms for both overlap detection and layout simplification are based on linear-algebra operations over semirings using 2D distributed sparse matrices. Our layout step consists of performing a transitive reduction from the overlap graph to a string graph. We provide a detailed communication analysis of the main stages of our new algorithms. diBELLA 2D achieves near linear scaling with over 80% parallel efficiency for the human genome, reducing the runtime for overlap detection by 1.2-1.3× for the human genome and 1.5-1.9× for C.elegans compared to the state-of-the-art. Our transitive reduction algorithm outperforms an existing distributed-memory implementation by 10.5-13.3× for the human genome and 18-29× for the C. elegans. Our work paves the way for efficient de novo assembly of large genomes using long reads in distributed memory.},
doi = {10.1109/ipdps49936.2021.00060},
journal = {Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
number = ,
volume = ,
place = {United States},
year = {Sat May 01 00:00:00 EDT 2021},
month = {Sat May 01 00:00:00 EDT 2021}
}
Works referenced in this record:
Efficient counting of k-mers in DNA sequences using a bloom filter
journal, August 2011
- Melsted, Páll; Pritchard, Jonathan K.
- BMC Bioinformatics, Vol. 12, Issue 1
PacBio Sequencing and Its Applications
journal, October 2015
- Rhoads, Anthony; Au, Kin Fai
- Genomics, Proteomics & Bioinformatics, Vol. 13, Issue 5
FSG: Fast String Graph Construction for De Novo Assembly
journal, October 2017
- Bonizzoni, Paola; Vedova, Gianluca Della; Pirola, Yuri
- Journal of Computational Biology, Vol. 24, Issue 10
SeqAn An efficient, generic C++ library for sequence analysis
journal, January 2008
- Döring, Andreas; Weese, David; Rausch, Tobias
- BMC Bioinformatics, Vol. 9, Issue 1
Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome
collection, January 2019
- Wenger, Aaron M.; Peluso, Paul; Rowell, William J.
- Universität des Saarlandes
PacBio Sequencing and Its Applications
journal, October 2015
- Rhoads, Anthony; Au, Kin Fai
- Genomics, Proteomics & Bioinformatics, Vol. 13, Issue 5
Communication optimal parallel multiplication of sparse random matrices
conference, January 2013
- Ballard, Grey; Buluc, Aydin; Demmel, James
- Proceedings of the 25th ACM symposium on Parallelism in algorithms and architectures - SPAA '13
Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication
conference, September 2008
- Buluc, Aydin; Gilbert, John R.
- 2008 37th International Conference on Parallel Processing (ICPP)
Genetic variation and the de novo assembly of human genomes
journal, October 2015
- Chaisson, Mark J. P.; Wilson, Richard K.; Eichler, Evan E.
- Nature Reviews Genetics, Vol. 16, Issue 11
The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community
journal, November 2016
- Jain, Miten; Olsen, Hugh E.; Paten, Benedict
- Genome Biology, Vol. 17, Issue 1
Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome
journal, August 2019
- Wenger, Aaron M.; Peluso, Paul; Rowell, William J.
- Nature Biotechnology, Vol. 37, Issue 10
The fragment assembly string graph
journal, September 2005
- Myers, E. W.
- Bioinformatics, Vol. 21, Issue Suppl 2
Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences
journal, March 2016
- Li, Heng
- Bioinformatics, Vol. 32, Issue 14
Canu: scalable and accurate long-read assembly via adaptive k -mer weighting and repeat separation
journal, March 2017
- Koren, Sergey; Walenz, Brian P.; Berlin, Konstantin
- Genome Research, Vol. 27, Issue 5
Parallel Construction of Bidirected String Graphs for Genome Assembly
conference, September 2008
- Jackson, Benjamin G.; Aluru, Srinivas
- 2008 37th International Conference on Parallel Processing
FSG: Fast String Graph Construction for De Novo Assembly
journal, October 2017
- Bonizzoni, Paola; Vedova, Gianluca Della; Pirola, Yuri
- Journal of Computational Biology, Vol. 24, Issue 10
Minimap2: pairwise alignment for nucleotide sequences
journal, May 2018
- Li, Heng
- Bioinformatics, Vol. 34, Issue 18
Indexing compressed text
journal, July 2005
- Ferragina, Paolo; Manzini, Giovanni
- Journal of the ACM, Vol. 52, Issue 4
An External-Memory Algorithm for String Graph Construction
journal, May 2016
- Bonizzoni, Paola; Della Vedova, Gianluca; Pirola, Yuri
- Algorithmica, Vol. 78, Issue 2
Efficient counting of k-mers in DNA sequences using a bloom filter
journal, August 2011
- Melsted, Páll; Pritchard, Jonathan K.
- BMC Bioinformatics, Vol. 12, Issue 1
Genome assembly forensics: finding the elusive mis-assembly
journal, January 2008
- Phillippy, Adam M.; Schatz, Michael C.; Pop, Mihai
- Genome Biology, Vol. 9, Issue 3
SeqAn An efficient, generic C++ library for sequence analysis
journal, January 2008
- Döring, Andreas; Weese, David; Rausch, Tobias
- BMC Bioinformatics, Vol. 9, Issue 1
Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome
journal, October 2015
- Goodwin, Sara; Gurtowski, James; Ethe-Sayers, Scott
- Genome Research, Vol. 25, Issue 11
Assembling large genomes with single-molecule sequencing and locality-sensitive hashing
journal, May 2015
- Berlin, Konstantin; Koren, Sergey; Chin, Chen-Shan
- Nature Biotechnology, Vol. 33, Issue 6
The Combinatorial BLAS: design, implementation, and applications
journal, May 2011
- Buluç, Aydın; Gilbert, John R.
- The International Journal of High Performance Computing Applications, Vol. 25, Issue 4
Parametric Complexity of Sequence Assembly: Theory and Applications to Next Generation Sequencing
journal, July 2009
- Nagarajan, Niranjan; Pop, Mihai
- Journal of Computational Biology, Vol. 16, Issue 7
Efficient de novo assembly of large genomes using compressed data structures
journal, December 2011
- Simpson, J. T.; Durbin, R.
- Genome Research, Vol. 22, Issue 3
Real-Time DNA Sequencing from Single Polymerase Molecules
journal, January 2009
- Eid, John; Fehr, Adrian; Gray, Jeremy
- Science, Vol. 323, Issue 5910, p. 133-138
diBELLA: Distributed Long Read to Long Read Alignment
conference, January 2019
- Ellis, Marquita; Guidi, Giulia; Buluç, Aydın
- Proceedings of the 48th International Conference on Parallel Processing - ICPP 2019
A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies
journal, March 2011
- Zhang, Wenyu; Chen, Jiajia; Yang, Yang
- PLoS ONE, Vol. 6, Issue 3
Distributed Many-to-Many Protein Sequence Alignment using Sparse Matrices
conference, November 2020
- Selvitopi, Oguz; Ekanayake, Saliya; Guidi, Giulia
- SC20: International Conference for High Performance Computing, Networking, Storage and Analysis
Communication-Efficient Jaccard similarity for High-Performance Distributed Genome Comparisons
conference, May 2020
- Besta, Maciej; Kanakagiri, Raghavendra; Mustafa, Harun
- 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
Apache Spark: a unified engine for big data processing
journal, October 2016
- Zaharia, Matei; Franklin, Michael J.; Ghodsi, Ali
- Communications of the ACM, Vol. 59, Issue 11
SORA: Scalable Overlap-graph Reduction Algorithms for Genome Assembly using Apache Spark in the Cloud
conference, December 2018
- Paul, Alexander J.; Lawrence, Dylan; Song, Myoungkyu
- 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
HipMer: an extreme-scale de novo genome assembler
conference, January 2015
- Georganas, Evangelos; Buluç, Aydın; Chapman, Jarrod
- Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15