skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: SWAP-Assembler 2: Optimization of De Novo Genome Assembler at Large Scale

Conference ·
DOI:https://doi.org/10.1109/ICPP.2016.29· OSTI ID:1336466

In this paper, we analyze and optimize the most time-consuming steps of the SWAP-Assembler, a parallel genome assembler, so that it can scale to a large number of cores for huge genomes with the size of sequencing data ranging from terabyes to petabytes. According to the performance analysis results, the most time-consuming steps are input parallelization, k-mer graph construction, and graph simplification (edge merging). For the input parallelization, the input data is divided into virtual fragments with nearly equal size, and the start position and end position of each fragment are automatically separated at the beginning of the reads. In k-mer graph construction, in order to improve the communication efficiency, the message size is kept constant between any two processes by proportionally increasing the number of nucleotides to the number of processes in the input parallelization step for each round. The memory usage is also decreased because only a small part of the input data is processed in each round. With graph simplification, the communication protocol reduces the number of communication loops from four to two loops and decreases the idle communication time. The optimized assembler is denoted as SWAP-Assembler 2 (SWAP2). In our experiments using a 1000 Genomes project dataset of 4 terabytes (the largest dataset ever used for assembling) on the supercomputer Mira, the results show that SWAP2 scales to 131,072 cores with an efficiency of 40%. We also compared our work with both the HipMER assembler and the SWAP-Assembler. On the Yanhuang dataset of 300 gigabytes, SWAP2 shows a 3X speedup and 4X better scalability compared with the HipMer assembler and is 45 times faster than the SWAP-Assembler. The SWAP2 software is available at https://sourceforge.net/projects/swapassembler.

Research Organization:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Basic Energy Sciences (BES); National Natural Science Foundation of China (NSFC)
DOE Contract Number:
AC02-06CH11357
OSTI ID:
1336466
Resource Relation:
Conference: 45th International Conference on Parallel Processing, 08/16/16 - 08/19/16, Philadelphia, PA, US
Country of Publication:
United States
Language:
English

Similar Records

SWAP-Assembler: scalable and efficient genome assembly towards thousands of cores
Journal Article · Wed Sep 10 00:00:00 EDT 2014 · BMC Bioinformatics · OSTI ID:1336466

PaKman: A Scalable Algorithm for Generating Genomic Contigs on Distributed Memory Machines
Journal Article · Sat May 01 00:00:00 EDT 2021 · IEEE Transactions on Parallel and Distributed Systems · OSTI ID:1336466

Optimizing de novo genome assembly from PCR-amplified metagenomes
Journal Article · Fri Dec 28 00:00:00 EST 2018 · OSTI ID:1336466

Related Subjects