Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark

Journal Article · · BMC Bioinformatics

Abstract Background

XHMM is a widely used tool for copy-number variant (CNV) discovery from whole exome sequencing data but can require hours to days to run for large cohorts. A more scalable implementation would reduce the need for specialized computational resources and enable increased exploration of the configuration parameter space to obtain the best possible results.

Results

DECA is a horizontally scalable implementation of the XHMM algorithm using the ADAM framework and Apache Spark that incorporates novel algorithmic optimizations to eliminate unneeded computation. DECA parallelizes XHMM on both multi-core shared memory computers and large shared-nothing Spark clusters. We performed CNV discovery from the read-depth matrix in 2535 exomes in 9.3 min on a 16-core workstation (35.3× speedup vs. XHMM), 12.7 min using 10 executor cores on a Spark cluster (18.8× speedup vs. XHMM), and 9.8 min using 32 executor cores on Amazon AWS’ Elastic MapReduce. We performed CNV discovery from the original BAM files in 292 min using 640 executor cores on a Spark cluster.

Conclusions

We describe DECA’s performance, our algorithmic and implementation enhancements to XHMM to obtain that performance, and our lessons learned porting a complex genome analysis application to ADAM and Spark. ADAM and Apache Spark are a performant and productive platform for implementing large-scale genome analyses, but efficiently utilizing large clusters can require algorithmic optimizations and careful attention to Spark’s configuration parameters.

Sponsoring Organization:
USDOE
OSTI ID:
1618535
Journal Information:
BMC Bioinformatics, Journal Name: BMC Bioinformatics Journal Issue: 1 Vol. 20; ISSN 1471-2105
Publisher:
Springer Science + Business MediaCopyright Statement
Country of Publication:
United Kingdom
Language:
English

References (17)

Using XHMM Software to Detect Copy Number Variation in Whole‐Exome Sequencing Data journal April 2014
Discovery and Statistical Genotyping of Copy-Number Variation from Whole-Exome Sequencing Depth journal October 2012
SEQSpark: A Complete Analysis Tool for Large-Scale Rare Variant Association Studies Using Whole-Genome and Exome Sequence Data journal July 2017
A global reference for human genetic variation journal January 2015
Patterns of genic intolerance of rare copy number variation in 59,898 human exomes journal August 2016
Hadoop-BAM: directly manipulating next generation sequencing data in the cloud journal February 2012
SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision journal May 2014
CLAMMS: a scalable algorithm for calling common and rare copy number variants from exome sequencing data journal September 2015
Biospark: scalable analysis of large numerical datasets from biological simulations and experiments using Hadoop and Spark journal September 2016
Bioinformatics applications on Apache Spark journal August 2018
The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data journal July 2010
A tutorial on hidden Markov models and selected applications in speech recognition journal January 1989
SparkScore: Leveraging Apache Spark for Distributed Genomic Inference conference May 2016
Rethinking Data-Intensive Science Using Scalable Analytics Systems
  • Nothaft, Frank Austin; Linderman, Michael; Franklin, Michael J.
  • Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD '15 https://doi.org/10.1145/2723372.2742787
conference January 2015
Abstract 3580: GATK CNV: copy-number variation discovery from coverage data conference July 2017
Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives journal September 2013
VariantSpark: population scale clustering of genotype information journal December 2015

Similar Records

Large-scale seismic waveform quality metric calculation using Hadoop
Journal Article · Fri May 27 00:00:00 EDT 2016 · Computers and Geosciences · OSTI ID:1262167

SpaRC: scalable sequence clustering using Apache Spark
Journal Article · Thu Aug 23 00:00:00 EDT 2018 · Bioinformatics · OSTI ID:1542383

A multi-platform evaluation of the randomized CX low-rank matrix factorization in Spark
Conference · Thu Jul 27 00:00:00 EDT 2017 · OSTI ID:1372901

Related Subjects