skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: PaKman: A Scalable Algorithm for Generating Genomic Contigs on Distributed Memory Machines

Abstract

De novo genome assembly is a fundamental problem in the field of bioinformatics, that aims to assemble the DNA sequence of an unknown genome from numerous short DNA fragments (aka reads) obtained from it. With the advent of high-throughput sequencing technologies, billions of reads can be generated in a matter of hours, necessitating efficient parallelization of the assembly process. While multiple parallel solutions have been proposed in the past, conducting a large-scale assembly at scale remains a challenging problem because of the inherent complexities associated with data movement, and irregular access footprints of memory and I/O operations. In this article, we present a novel algorithm, called PaKman , to address the problem of performing large-scale genome assemblies on a distributed memory parallel computer. Our approach focuses on improving performance through a combination of novel data structures and algorithmic strategies for reducing the communication and I/O footprint during the assembly process. PaKman presents a solution for the two most time-consuming phases in the full genome assembly pipeline, namely, k-mer counting and contig generation . A key aspect of our algorithm is its graph data structure (PaK-Graph), which comprises fat nodes (or what we call “macro-nodes”) that reduce the communication burden duringmore » contig generation. We present an extensive performance and qualitative evaluation of our algorithm across a wide range of genomes (varying in both size and species group), including comparisons to other state-of-the-art parallel assemblers. Our results demonstrate the ability to achieve near-linear speedups on up to 16K cores (tested) on the NERSC Cori supercomputer; perform better than or comparable to other state-of-the-art distributed memory and shared memory tools in terms of performance while delivering comparable (if not better) quality; and reduce time to solution significantly. For instance, PaKman is able to generate a high-quality set of assembled contigs for complex genomes such as the human and bread wheat genomes in under a minute on 16K cores. In addition, PaKman was able to successfully process a 3.1 TB simulated dataset of one of the largest known genomes (to date)- Ambystoma mexicanum (the axolotl), in just over 200 seconds on 16K cores.« less

Authors:
ORCiD logo [1];  [1]; ORCiD logo [2]
  1. Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
  2. Washington State Univ., Pullman, WA (United States)
Publication Date:
Research Org.:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); National Science Foundation (NSF)
OSTI Identifier:
1756647
Report Number(s):
PNNL-SA-157408
Journal ID: ISSN 1045-9219
Grant/Contract Number:  
AC05-76RL01830; AC02-05CH11231; 63823; CCF-1815467; OAC-1910213; CCF-1919122
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
IEEE Transactions on Parallel and Distributed Systems
Additional Journal Information:
Journal Volume: 32; Journal Issue: 5; Journal ID: ISSN 1045-9219
Publisher:
IEEE
Country of Publication:
United States
Language:
English
Subject:
42 ENGINEERING; genome assembly; distributed memory; de bruijn graphs; k-mer counting

Citation Formats

Ghosh, Priyanka, Krishnamoorthy, Sriram, and Kalyanaraman, Ananth. PaKman: A Scalable Algorithm for Generating Genomic Contigs on Distributed Memory Machines. United States: N. p., 2021. Web. doi:10.1109/tpds.2020.3043241.
Ghosh, Priyanka, Krishnamoorthy, Sriram, & Kalyanaraman, Ananth. PaKman: A Scalable Algorithm for Generating Genomic Contigs on Distributed Memory Machines. United States. https://doi.org/10.1109/tpds.2020.3043241
Ghosh, Priyanka, Krishnamoorthy, Sriram, and Kalyanaraman, Ananth. Sat . "PaKman: A Scalable Algorithm for Generating Genomic Contigs on Distributed Memory Machines". United States. https://doi.org/10.1109/tpds.2020.3043241.
@article{osti_1756647,
title = {PaKman: A Scalable Algorithm for Generating Genomic Contigs on Distributed Memory Machines},
author = {Ghosh, Priyanka and Krishnamoorthy, Sriram and Kalyanaraman, Ananth},
abstractNote = {De novo genome assembly is a fundamental problem in the field of bioinformatics, that aims to assemble the DNA sequence of an unknown genome from numerous short DNA fragments (aka reads) obtained from it. With the advent of high-throughput sequencing technologies, billions of reads can be generated in a matter of hours, necessitating efficient parallelization of the assembly process. While multiple parallel solutions have been proposed in the past, conducting a large-scale assembly at scale remains a challenging problem because of the inherent complexities associated with data movement, and irregular access footprints of memory and I/O operations. In this article, we present a novel algorithm, called PaKman , to address the problem of performing large-scale genome assemblies on a distributed memory parallel computer. Our approach focuses on improving performance through a combination of novel data structures and algorithmic strategies for reducing the communication and I/O footprint during the assembly process. PaKman presents a solution for the two most time-consuming phases in the full genome assembly pipeline, namely, k-mer counting and contig generation . A key aspect of our algorithm is its graph data structure (PaK-Graph), which comprises fat nodes (or what we call “macro-nodes”) that reduce the communication burden during contig generation. We present an extensive performance and qualitative evaluation of our algorithm across a wide range of genomes (varying in both size and species group), including comparisons to other state-of-the-art parallel assemblers. Our results demonstrate the ability to achieve near-linear speedups on up to 16K cores (tested) on the NERSC Cori supercomputer; perform better than or comparable to other state-of-the-art distributed memory and shared memory tools in terms of performance while delivering comparable (if not better) quality; and reduce time to solution significantly. For instance, PaKman is able to generate a high-quality set of assembled contigs for complex genomes such as the human and bread wheat genomes in under a minute on 16K cores. In addition, PaKman was able to successfully process a 3.1 TB simulated dataset of one of the largest known genomes (to date)- Ambystoma mexicanum (the axolotl), in just over 200 seconds on 16K cores.},
doi = {10.1109/tpds.2020.3043241},
url = {https://www.osti.gov/biblio/1756647}, journal = {IEEE Transactions on Parallel and Distributed Systems},
issn = {1045-9219},
number = 5,
volume = 32,
place = {United States},
year = {2021},
month = {5}
}

Journal Article:
Free Publicly Available Full Text
This content will become publicly available on May 1, 2021
Publisher's Version of Record

Save / Share: