skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Optimization of SAMtools sorting using OpenMP tasks

Abstract

SAMtools is a widely-used genomics application for post-processing high-throughput sequence alignment data. Such sequence alignment data are commonly sorted to make downstream analysis more efficient.However, this sorting process itself can be computationally- and I/O-intensive: high-throughput sequence alignment files in the de facto standard binary alignment/map (BAM) format can be many gigabytes in size, and may need to be decompressed before sorting and compressed afterwards. As a result, BAM-file sorting can be a bottleneck in genomics workflows. This paper describes a case study on the performance analysis and optimization of SAMtools for sorting large BAM files. OpenMP task parallelism and memory optimization techniques resulted in a speedup of 5.9X versus the upstream SAMtools 1.3.1 for an internal (in-memory) sort of 24.6 GiB of compressed BAM data (102.6 GiB uncompressed) with 32 processor cores, while a 1.98X speedup was achieved for an external (out-of-core) sort of a 271.4 GiB BAM file.

Authors:
 [1];  [2]
  1. Iowa State University, Ames, IA (United States). Department of Computer Science
  2. Iowa State University, Ames, IA (United States). Department of Mathematics
Publication Date:
Research Org.:
Lawrence Berkeley National Laboratory, Berkeley, CA (United States). National Energy Research Scientific Computing Center (NERSC).
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
OSTI Identifier:
1461875
DOE Contract Number:  
AC02-05CH11231
Resource Type:
Journal Article
Journal Name:
Cluster Computing
Additional Journal Information:
Journal Volume: 20; Journal Issue: 3; Journal ID: ISSN 1386-7857
Publisher:
Springer
Country of Publication:
United States
Language:
English

Citation Formats

Weeks, Nathan T., and Luecke, Glenn R. Optimization of SAMtools sorting using OpenMP tasks. United States: N. p., 2017. Web. doi:10.1007/s10586-017-0874-8.
Weeks, Nathan T., & Luecke, Glenn R. Optimization of SAMtools sorting using OpenMP tasks. United States. doi:10.1007/s10586-017-0874-8.
Weeks, Nathan T., and Luecke, Glenn R. Wed . "Optimization of SAMtools sorting using OpenMP tasks". United States. doi:10.1007/s10586-017-0874-8.
@article{osti_1461875,
title = {Optimization of SAMtools sorting using OpenMP tasks},
author = {Weeks, Nathan T. and Luecke, Glenn R.},
abstractNote = {SAMtools is a widely-used genomics application for post-processing high-throughput sequence alignment data. Such sequence alignment data are commonly sorted to make downstream analysis more efficient.However, this sorting process itself can be computationally- and I/O-intensive: high-throughput sequence alignment files in the de facto standard binary alignment/map (BAM) format can be many gigabytes in size, and may need to be decompressed before sorting and compressed afterwards. As a result, BAM-file sorting can be a bottleneck in genomics workflows. This paper describes a case study on the performance analysis and optimization of SAMtools for sorting large BAM files. OpenMP task parallelism and memory optimization techniques resulted in a speedup of 5.9X versus the upstream SAMtools 1.3.1 for an internal (in-memory) sort of 24.6 GiB of compressed BAM data (102.6 GiB uncompressed) with 32 processor cores, while a 1.98X speedup was achieved for an external (out-of-core) sort of a 271.4 GiB BAM file.},
doi = {10.1007/s10586-017-0874-8},
journal = {Cluster Computing},
issn = {1386-7857},
number = 3,
volume = 20,
place = {United States},
year = {2017},
month = {4}
}

Works referenced in this record:

Sambamba: fast processing of NGS alignment formats
journal, February 2015


Algorithms for scalable synchronization on shared-memory multiprocessors
journal, February 1991

  • Mellor-Crummey, John M.; Scott, Michael L.
  • ACM Transactions on Computer Systems, Vol. 9, Issue 1
  • DOI: 10.1145/103727.103729

Big Data: Astronomical or Genomical?
journal, July 2015


SAMBLASTER: fast duplicate marking and structural variant read extraction
journal, May 2014


elPrep: High-Performance Preparation of Sequence Alignment/Map Files for Variant Calling
journal, July 2015


Isaac: ultra-fast whole-genome secondary analysis on Illumina sequencing platforms
journal, June 2013


HPCTOOLKIT: tools for performance analysis of optimized parallel programs
journal, January 2009

  • Adhianto, L.; Banerjee, S.; Fagan, M.
  • Concurrency and Computation: Practice and Experience
  • DOI: 10.1002/cpe.1553

The Sequence Alignment/Map format and SAMtools
journal, June 2009


The Scramble conversion tool
journal, June 2014


Supercomputing for the parallelization of whole genome analysis
journal, February 2014