DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark

Linderman, Michael D.; Chia, Davin; Wallace, Forrest; Nothaft, Frank A.

doi:10.1186/s12859-019-3108-7

DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark

Journal Article · Fri Oct 11 00:00:00 EDT 2019 · BMC Bioinformatics

DOI:https://doi.org/10.1186/s12859-019-3108-7· OSTI ID:1618535

; Chia, Davin; Wallace, Forrest; Nothaft, Frank A.

Abstract Background

XHMM is a widely used tool for copy-number variant (CNV) discovery from whole exome sequencing data but can require hours to days to run for large cohorts. A more scalable implementation would reduce the need for specialized computational resources and enable increased exploration of the configuration parameter space to obtain the best possible results.

Results

DECA is a horizontally scalable implementation of the XHMM algorithm using the ADAM framework and Apache Spark that incorporates novel algorithmic optimizations to eliminate unneeded computation. DECA parallelizes XHMM on both multi-core shared memory computers and large shared-nothing Spark clusters. We performed CNV discovery from the read-depth matrix in 2535 exomes in 9.3 min on a 16-core workstation (35.3× speedup vs. XHMM), 12.7 min using 10 executor cores on a Spark cluster (18.8× speedup vs. XHMM), and 9.8 min using 32 executor cores on Amazon AWS’ Elastic MapReduce. We performed CNV discovery from the original BAM files in 292 min using 640 executor cores on a Spark cluster.

Conclusions

We describe DECA’s performance, our algorithmic and implementation enhancements to XHMM to obtain that performance, and our lessons learned porting a complex genome analysis application to ADAM and Spark. ADAM and Apache Spark are a performant and productive platform for implementing large-scale genome analyses, but efficiently utilizing large clusters can require algorithmic optimizations and careful attention to Spark’s configuration parameters.

View Journal Article

Sponsoring Organization:: USDOE

OSTI ID:: 1618535

Journal Information:: BMC Bioinformatics, Journal Name: BMC Bioinformatics Journal Issue: 1 Vol. 20; ISSN 1471-2105

Publisher:: Springer Science + Business MediaCopyright Statement

Country of Publication:: United Kingdom

Language:: English

References (17)

Using XHMM Software to Detect Copy Number Variation in Whole‐Exome Sequencing Data Fromer, Menachem; Purcell, Shaun M. Current Protocols in Human Genetics, Vol. 81, Issue 1 https://doi.org/10.1002/0471142905.hg0723s81	journal	April 2014
Discovery and Statistical Genotyping of Copy-Number Variation from Whole-Exome Sequencing Depth Fromer, Menachem; Moran, Jennifer L.; Chambert, Kimberly The American Journal of Human Genetics, Vol. 91, Issue 4 https://doi.org/10.1016/j.ajhg.2012.08.005	journal	October 2012
SEQSpark: A Complete Analysis Tool for Large-Scale Rare Variant Association Studies Using Whole-Genome and Exome Sequence Data Zhang, Di; Zhao, Linhai; Li, Biao The American Journal of Human Genetics, Vol. 101, Issue 1 https://doi.org/10.1016/j.ajhg.2017.05.017	journal	July 2017
A global reference for human genetic variation Consortium, The 1000 Genomes Project; Auton, Adam; Abecasis, Gonçalo R. Nature, Vol. 526, Issue 7571, p. 68-74 https://doi.org/10.1038/nature15393	journal	January 2015
Patterns of genic intolerance of rare copy number variation in 59,898 human exomes Ruderfer, Douglas M.; Hamamsy, Tymor; Lek, Monkol Nature Genetics, Vol. 48, Issue 10 https://doi.org/10.1038/ng.3638	journal	August 2016
Hadoop-BAM: directly manipulating next generation sequencing data in the cloud Niemenmaa, Matti; Kallio, Aleksi; Schumacher, André Bioinformatics, Vol. 28, Issue 6 https://doi.org/10.1093/bioinformatics/bts054	journal	February 2012
SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision Wiewiórka, Marek S.; Messina, Antonio; Pacholewska, Alicja Bioinformatics, Vol. 30, Issue 18 https://doi.org/10.1093/bioinformatics/btu343	journal	May 2014
CLAMMS: a scalable algorithm for calling common and rare copy number variants from exome sequencing data Packer, Jonathan S.; Maxwell, Evan K.; O’Dushlaine, Colm Bioinformatics https://doi.org/10.1093/bioinformatics/btv547	journal	September 2015
Biospark: scalable analysis of large numerical datasets from biological simulations and experiments using Hadoop and Spark Klein, Max; Sharma, Rati; Bohrer, Chris H. Bioinformatics, Vol. 33, Issue 2 https://doi.org/10.1093/bioinformatics/btw614	journal	September 2016
Bioinformatics applications on Apache Spark Guo, Runxin; Zhao, Yi; Zou, Quan GigaScience https://doi.org/10.1093/gigascience/giy098	journal	August 2018
The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data McKenna, A.; Hanna, M.; Banks, E. Genome Research, Vol. 20, Issue 9 https://doi.org/10.1101/gr.107524.110	journal	July 2010
A tutorial on hidden Markov models and selected applications in speech recognition Rabiner, L. R. Proceedings of the IEEE, Vol. 77, Issue 2 https://doi.org/10.1109/5.18626	journal	January 1989
SparkScore: Leveraging Apache Spark for Distributed Genomic Inference Bahmani, Amir; Sibley, Alexander B.; Parsian, Mahmoud 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) https://doi.org/10.1109/IPDPSW.2016.6	conference	May 2016
Rethinking Data-Intensive Science Using Scalable Analytics Systems Nothaft, Frank Austin; Linderman, Michael; Franklin, Michael J. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD '15 https://doi.org/10.1145/2723372.2742787	conference	January 2015
Abstract 3580: GATK CNV: copy-number variation discovery from coverage data Babadi, Mehrtash; Benjamin, David I.; Lee, Samuel K. Proceedings: AACR Annual Meeting 2017; April 1-5, 2017; Washington, DC, Bioinformatics and Systems Biology https://doi.org/10.1158/1538-7445.AM2017-3580	conference	July 2017
Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives Zhao, Min; Wang, Qingguo; Wang, Quan BMC Bioinformatics, Vol. 14, Issue S11 https://doi.org/10.1186/1471-2105-14-S11-S1	journal	September 2013
VariantSpark: population scale clustering of genotype information O’Brien, Aidan R.; Saunders, Neil F. W.; Guo, Yi BMC Genomics, Vol. 16, Issue 1 https://doi.org/10.1186/s12864-015-2269-7	journal	December 2015

Similar Records

Large-scale seismic waveform quality metric calculation using Hadoop

Journal Article · Fri May 27 00:00:00 EDT 2016 · Computers and Geosciences · OSTI ID:1262167

SpaRC: scalable sequence clustering using Apache Spark

Journal Article · Thu Aug 23 00:00:00 EDT 2018 · Bioinformatics · OSTI ID:1542383

A multi-platform evaluation of the randomized CX low-rank matrix factorization in Spark

Conference · Thu Jul 27 00:00:00 EDT 2017 · OSTI ID:1372901

DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark

Citation Formats

References (17)

Similar Records

Related Subjects