skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: PaKman: Scalable Assembly of Large Genomes on Distributed Memory Machines

Abstract

De novo genome assembly is a fundamental problem in the field of bioinformatics, that aims to assemble the DNA sequence of an unknown genome from numerous short DNA fragments (aka reads) obtained from it. With the advent of high-throughput sequencing technologies, billions of reads can be generated in a matter of hours, necessitating efficient parallelization of the assembly process. While multiple parallel solutions have been proposed in the past, conducting a large- scale assembly at scale remains a challenging problem because of the inherent complexities associated with data movement, and irregular access footprints of memory and I/O operations. In this paper, we present a novel algorithm, called PaKman, to address the problem of performing large-scale genome assemblies on a distributed memory parallel computer. Our approach focuses on improving performance through a combination of novel data structures and algorithmic strategies for reducing the communication and I/O footprint during the assembly process. A key aspect of our algorithm is its graph data structure, which comprises fat nodes (or what we call "macro-nodes") that reduce the communication burden during contig generation. We present an extensive performance and qualitative evaluation of our algorithm, including comparisons to other state-of-the-art parallel assemblers. Our results demonstrate themore » ability to achieve near-linear speedups on up to 8K cores (tested); outperform state-of-the-art distributed memory and shared memory tools in performance while delivering comparable (if not better) quality; and reduce time to solution significantly. For instance, PaKman is able to complete an assembly of the full human genome in just over a minute on 8K cores.« less

Authors:
 [1];  [2];  [3]
  1. WASHINGTON STATE UNIV
  2. BATTELLE (PACIFIC NW LAB)
  3. Washington State University
Publication Date:
Research Org.:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1617871
Report Number(s):
PNNL-SA-138919
DOE Contract Number:  
AC05-76RL01830
Resource Type:
Conference
Resource Relation:
Conference: IEEE International Parallel & Distributed Processing Symposium (IPDPS 2019), May 20-24, 2019), Rio de Janeiro, Brazil
Country of Publication:
United States
Language:
English

Citation Formats

Ghosh, Priyanka, Krishnamoorthy, Sriram, and Kalyanaraman, Anantharaman. PaKman: Scalable Assembly of Large Genomes on Distributed Memory Machines. United States: N. p., 2019. Web. doi:10.1109/IPDPS.2019.00067.
Ghosh, Priyanka, Krishnamoorthy, Sriram, & Kalyanaraman, Anantharaman. PaKman: Scalable Assembly of Large Genomes on Distributed Memory Machines. United States. doi:10.1109/IPDPS.2019.00067.
Ghosh, Priyanka, Krishnamoorthy, Sriram, and Kalyanaraman, Anantharaman. Mon . "PaKman: Scalable Assembly of Large Genomes on Distributed Memory Machines". United States. doi:10.1109/IPDPS.2019.00067.
@article{osti_1617871,
title = {PaKman: Scalable Assembly of Large Genomes on Distributed Memory Machines},
author = {Ghosh, Priyanka and Krishnamoorthy, Sriram and Kalyanaraman, Anantharaman},
abstractNote = {De novo genome assembly is a fundamental problem in the field of bioinformatics, that aims to assemble the DNA sequence of an unknown genome from numerous short DNA fragments (aka reads) obtained from it. With the advent of high-throughput sequencing technologies, billions of reads can be generated in a matter of hours, necessitating efficient parallelization of the assembly process. While multiple parallel solutions have been proposed in the past, conducting a large- scale assembly at scale remains a challenging problem because of the inherent complexities associated with data movement, and irregular access footprints of memory and I/O operations. In this paper, we present a novel algorithm, called PaKman, to address the problem of performing large-scale genome assemblies on a distributed memory parallel computer. Our approach focuses on improving performance through a combination of novel data structures and algorithmic strategies for reducing the communication and I/O footprint during the assembly process. A key aspect of our algorithm is its graph data structure, which comprises fat nodes (or what we call "macro-nodes") that reduce the communication burden during contig generation. We present an extensive performance and qualitative evaluation of our algorithm, including comparisons to other state-of-the-art parallel assemblers. Our results demonstrate the ability to achieve near-linear speedups on up to 8K cores (tested); outperform state-of-the-art distributed memory and shared memory tools in performance while delivering comparable (if not better) quality; and reduce time to solution significantly. For instance, PaKman is able to complete an assembly of the full human genome in just over a minute on 8K cores.},
doi = {10.1109/IPDPS.2019.00067},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2019},
month = {5}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: