skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A hybrid computational strategy to address WGS variant analysis in >5000 samples

Abstract

The decreasing costs of sequencing are driving the need for cost effective and real time variant calling of whole genome sequencing data. The scale of these projects are far beyond the capacity of typical computing resources available with most research labs. Other infrastructures like the cloud AWS environment and supercomputers also have limitations due to which large scale joint variant calling becomes infeasible, and infrastructure specific variant calling strategies either fail to scale up to large datasets or abandon joint calling strategies. We present a high throughput framework including multiple variant callers for single nucleotide variant (SNV) calling, which leverages hybrid computing infrastructure consisting of cloud AWS, supercomputers and local high performance computing infrastructures. We present a novel binning approach for large scale joint variant calling and imputation which can scale up to over 10,000 samples while producing SNV callsets with high sensitivity and specificity. As a proof of principle, we present results of analysis on Cohorts for Heart And Aging Research in Genomic Epidemiology (CHARGE) WGS freeze 3 dataset in which joint calling, imputation and phasing of over 5300 whole genome samples was produced in under 6 weeks using four state-of-the-art callers. The callers used were SNPTools, GATK-HaplotypeCaller, GATK-UnifiedGenotypermore » and GotCloud. We used Amazon AWS, a 4000-core in-house cluster at Baylor College of Medicine, IBM power PC Blue BioU at Rice and Rhea at Oak Ridge National Laboratory (ORNL) for the computation. AWS was used for joint calling of 180 TB of BAM files, and ORNL and Rice supercomputers were used for the imputation and phasing step. All other steps were carried out on the local compute cluster. The entire operation used 5.2 million core hours and only transferred a total of 6 TB of data across the platforms. Even with increasing sizes of whole genome datasets, ensemble joint calling of SNVs for low coverage data can be accomplished in a scalable, cost effective and fast manner by using heterogeneous computing platforms without compromising on the quality of variants.« less

Authors:
ORCiD logo [1];  [1];  [1];  [2];  [1];  [3];  [4];  [1]
  1. Baylor College of Medicine, Houston, TX (United States). Human Genome Sequencing Center
  2. DNAnexus, Mountain View, CA (United States)
  3. Baylor College of Medicine, Houston, TX (United States). Human Genome Sequencing Center; Univ. of Texas Health Science Center, Houston, TX (United States). Human Genetics Center
  4. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1565524
Grant/Contract Number:  
AC05-00OR22725
Resource Type:
Accepted Manuscript
Journal Name:
BMC Bioinformatics
Additional Journal Information:
Journal Volume: 17; Journal Issue: 1; Journal ID: ISSN 1471-2105
Publisher:
BioMed Central
Country of Publication:
United States
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES; biochemistry & molecular biology; biotechnology & applied microbiology; mathematical & computational biology; WGS; SNV; variant calling; joint calling; supercomputer; cloud AWS; scalable; big data; ensemble calling

Citation Formats

Huang, Zhuoyi, Rustagi, Navin, Veeraraghavan, Narayanan, Carroll, Andrew, Gibbs, Richard, Boerwinkle, Eric, Venkata, Manjunath Gorentla, and Yu, Fuli. A hybrid computational strategy to address WGS variant analysis in >5000 samples. United States: N. p., 2016. Web. doi:10.1186/s12859-016-1211-6.
Huang, Zhuoyi, Rustagi, Navin, Veeraraghavan, Narayanan, Carroll, Andrew, Gibbs, Richard, Boerwinkle, Eric, Venkata, Manjunath Gorentla, & Yu, Fuli. A hybrid computational strategy to address WGS variant analysis in >5000 samples. United States. doi:10.1186/s12859-016-1211-6.
Huang, Zhuoyi, Rustagi, Navin, Veeraraghavan, Narayanan, Carroll, Andrew, Gibbs, Richard, Boerwinkle, Eric, Venkata, Manjunath Gorentla, and Yu, Fuli. Sat . "A hybrid computational strategy to address WGS variant analysis in >5000 samples". United States. doi:10.1186/s12859-016-1211-6. https://www.osti.gov/servlets/purl/1565524.
@article{osti_1565524,
title = {A hybrid computational strategy to address WGS variant analysis in >5000 samples},
author = {Huang, Zhuoyi and Rustagi, Navin and Veeraraghavan, Narayanan and Carroll, Andrew and Gibbs, Richard and Boerwinkle, Eric and Venkata, Manjunath Gorentla and Yu, Fuli},
abstractNote = {The decreasing costs of sequencing are driving the need for cost effective and real time variant calling of whole genome sequencing data. The scale of these projects are far beyond the capacity of typical computing resources available with most research labs. Other infrastructures like the cloud AWS environment and supercomputers also have limitations due to which large scale joint variant calling becomes infeasible, and infrastructure specific variant calling strategies either fail to scale up to large datasets or abandon joint calling strategies. We present a high throughput framework including multiple variant callers for single nucleotide variant (SNV) calling, which leverages hybrid computing infrastructure consisting of cloud AWS, supercomputers and local high performance computing infrastructures. We present a novel binning approach for large scale joint variant calling and imputation which can scale up to over 10,000 samples while producing SNV callsets with high sensitivity and specificity. As a proof of principle, we present results of analysis on Cohorts for Heart And Aging Research in Genomic Epidemiology (CHARGE) WGS freeze 3 dataset in which joint calling, imputation and phasing of over 5300 whole genome samples was produced in under 6 weeks using four state-of-the-art callers. The callers used were SNPTools, GATK-HaplotypeCaller, GATK-UnifiedGenotyper and GotCloud. We used Amazon AWS, a 4000-core in-house cluster at Baylor College of Medicine, IBM power PC Blue BioU at Rice and Rhea at Oak Ridge National Laboratory (ORNL) for the computation. AWS was used for joint calling of 180 TB of BAM files, and ORNL and Rice supercomputers were used for the imputation and phasing step. All other steps were carried out on the local compute cluster. The entire operation used 5.2 million core hours and only transferred a total of 6 TB of data across the platforms. Even with increasing sizes of whole genome datasets, ensemble joint calling of SNVs for low coverage data can be accomplished in a scalable, cost effective and fast manner by using heterogeneous computing platforms without compromising on the quality of variants.},
doi = {10.1186/s12859-016-1211-6},
journal = {BMC Bioinformatics},
number = 1,
volume = 17,
place = {United States},
year = {2016},
month = {9}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 1 work
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data
journal, July 2010


The cardiovascular health study: Design and rationale
journal, February 1991


Big Data: Astronomical or Genomical?
journal, July 2015


An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data
journal, April 2015

  • Jun, Goo; Wing, Mary Kate; Abecasis, Gonçalo R.
  • Genome Research, Vol. 25, Issue 6
  • DOI: 10.1101/gr.176552.114

Group-based variant calling leveraging next-generation supercomputing for large-scale whole-genome sequencing studies
journal, September 2015

  • Standish, Kristopher A.; Carland, Tristan M.; Lockwood, Glenn K.
  • BMC Bioinformatics, Vol. 16, Issue 1
  • DOI: 10.1186/s12859-015-0736-4

Low-coverage sequencing: Implications for design of complex trait association studies
journal, April 2011


An economic and energy-aware analysis of the viability of outsourcing cluster computing to a cloud
journal, March 2013

  • de Alfonso, Carlos; Caballer, Miguel; Alvarruiz, Fernando
  • Future Generation Computer Systems, Vol. 29, Issue 3
  • DOI: 10.1016/j.future.2012.08.014

Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium: Design of Prospective Meta-Analyses of Genome-Wide Association Studies From 5 Cohorts
journal, February 2009

  • Psaty, Bruce M.; O'Donnell, Christopher J.; Gudnason, Vilmundur
  • Circulation: Cardiovascular Genetics, Vol. 2, Issue 1
  • DOI: 10.1161/CIRCGENETICS.108.829747

Systematic comparison of variant calling pipelines using gold standard personal exome variants
journal, December 2015

  • Hwang, Sohyun; Kim, Eiru; Lee, Insuk
  • Scientific Reports, Vol. 5, Issue 1
  • DOI: 10.1038/srep17875

Inexpensive and Highly Reproducible Cloud-Based Variant Calling of 2,535 Human Genomes
journal, June 2015


New mini- zincin structures provide a minimal scaffold for members of this metallopeptidase superfamily
journal, January 2014

  • Trame, Christine B.; Chang, Yuanyuan; Axelrod, Herbert L.
  • BMC Bioinformatics, Vol. 15, Issue 1
  • DOI: 10.1186/1471-2105-15-1

CloudBurst: highly sensitive read mapping with MapReduce
journal, April 2009


Searching for SNPs with cloud computing
journal, January 2009


Best Practices and Joint Calling of the HumanExome BeadChip: The CHARGE Consortium
journal, July 2013


Mojo Hand, a TALEN design tool for genome editing applications
journal, January 2013

  • Neff, Kevin L.; Argue, David P.; Ma, Alvin C.
  • BMC Bioinformatics, Vol. 14, Issue 1
  • DOI: 10.1186/1471-2105-14-1

Genotype and SNP calling from next-generation sequencing data
journal, May 2011

  • Nielsen, Rasmus; Paul, Joshua S.; Albrechtsen, Anders
  • Nature Reviews Genetics, Vol. 12, Issue 6
  • DOI: 10.1038/nrg2986

Characterizing Bias in Population Genetic Inferences from Low-Coverage Sequencing Data
journal, November 2013

  • Han, Eunjung; Sinsheimer, Janet S.; Novembre, John
  • Molecular Biology and Evolution, Vol. 31, Issue 3
  • DOI: 10.1093/molbev/mst229

Epidemiological Approaches to Heart Disease: The Framingham Study
journal, March 1951

  • Dawber, Thomas R.; Meadors, Gilcin F.; Moore, Felix E.
  • American Journal of Public Health and the Nations Health, Vol. 41, Issue 3
  • DOI: 10.2105/AJPH.41.3.279

An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data
journal, January 2013


An integrative variant analysis suite for whole exome next-generation sequencing data
journal, January 2012


    Works referencing / citing this record:

    An economic and energy-aware analysis of the viability of outsourcing cluster computing to a cloud
    journal, March 2013

    • de Alfonso, Carlos; Caballer, Miguel; Alvarruiz, Fernando
    • Future Generation Computer Systems, Vol. 29, Issue 3
    • DOI: 10.1016/j.future.2012.08.014

    Genotype and SNP calling from next-generation sequencing data
    journal, May 2011

    • Nielsen, Rasmus; Paul, Joshua S.; Albrechtsen, Anders
    • Nature Reviews Genetics, Vol. 12, Issue 6
    • DOI: 10.1038/nrg2986

    Systematic comparison of variant calling pipelines using gold standard personal exome variants
    journal, December 2015

    • Hwang, Sohyun; Kim, Eiru; Lee, Insuk
    • Scientific Reports, Vol. 5, Issue 1
    • DOI: 10.1038/srep17875

    CloudBurst: highly sensitive read mapping with MapReduce
    journal, April 2009


    Characterizing Bias in Population Genetic Inferences from Low-Coverage Sequencing Data
    journal, November 2013

    • Han, Eunjung; Sinsheimer, Janet S.; Novembre, John
    • Molecular Biology and Evolution, Vol. 31, Issue 3
    • DOI: 10.1093/molbev/mst229

    The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data
    journal, July 2010


    Low-coverage sequencing: Implications for design of complex trait association studies
    journal, April 2011


    An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data
    journal, January 2013


    An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data
    journal, April 2015

    • Jun, Goo; Wing, Mary Kate; Abecasis, Gonçalo R.
    • Genome Research, Vol. 25, Issue 6
    • DOI: 10.1101/gr.176552.114

    An integrative variant analysis suite for whole exome next-generation sequencing data
    journal, January 2012


    Mojo Hand, a TALEN design tool for genome editing applications
    journal, January 2013

    • Neff, Kevin L.; Argue, David P.; Ma, Alvin C.
    • BMC Bioinformatics, Vol. 14, Issue 1
    • DOI: 10.1186/1471-2105-14-1

    New mini- zincin structures provide a minimal scaffold for members of this metallopeptidase superfamily
    journal, January 2014

    • Trame, Christine B.; Chang, Yuanyuan; Axelrod, Herbert L.
    • BMC Bioinformatics, Vol. 15, Issue 1
    • DOI: 10.1186/1471-2105-15-1

    Searching for SNPs with cloud computing
    journal, January 2009


    Group-based variant calling leveraging next-generation supercomputing for large-scale whole-genome sequencing studies
    journal, September 2015

    • Standish, Kristopher A.; Carland, Tristan M.; Lockwood, Glenn K.
    • BMC Bioinformatics, Vol. 16, Issue 1
    • DOI: 10.1186/s12859-015-0736-4

    Big Data: Astronomical or Genomical?
    journal, July 2015


    Best Practices and Joint Calling of the HumanExome BeadChip: The CHARGE Consortium
    journal, July 2013


    Inexpensive and Highly Reproducible Cloud-Based Variant Calling of 2,535 Human Genomes
    journal, June 2015