Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

A hybrid computational strategy to address WGS variant analysis in >5000 samples

Journal Article · · BMC Bioinformatics
Abstract Background

The decreasing costs of sequencing are driving the need for cost effective and real time variant calling of whole genome sequencing data. The scale of these projects are far beyond the capacity of typical computing resources available with most research labs. Other infrastructures like the cloud AWS environment and supercomputers also have limitations due to which large scale joint variant calling becomes infeasible, and infrastructure specific variant calling strategies either fail to scale up to large datasets or abandon joint calling strategies.

Results

We present a high throughput framework including multiple variant callers for single nucleotide variant (SNV) calling, which leverages hybrid computing infrastructure consisting of cloud AWS, supercomputers and local high performance computing infrastructures. We present a novel binning approach for large scale joint variant calling and imputation which can scale up to over 10,000 samples while producing SNV callsets with high sensitivity and specificity. As a proof of principle, we present results of analysis on Cohorts for Heart And Aging Research in Genomic Epidemiology (CHARGE) WGS freeze 3 dataset in which joint calling, imputation and phasing of over 5300 whole genome samples was produced in under 6 weeks using four state-of-the-art callers. The callers used were SNPTools, GATK-HaplotypeCaller, GATK-UnifiedGenotyper and GotCloud. We used Amazon AWS, a 4000-core in-house cluster at Baylor College of Medicine, IBM power PC Blue BioU at Rice and Rhea at Oak Ridge National Laboratory (ORNL) for the computation. AWS was used for joint calling of 180 TB of BAM files, and ORNL and Rice supercomputers were used for the imputation and phasing step. All other steps were carried out on the local compute cluster. The entire operation used 5.2 million core hours and only transferred a total of 6 TB of data across the platforms.

Conclusions

Even with increasing sizes of whole genome datasets, ensemble joint calling of SNVs for low coverage data can be accomplished in a scalable, cost effective and fast manner by using heterogeneous computing platforms without compromising on the quality of variants.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Organization:
USDOE; USDOE Office of Science (SC)
Grant/Contract Number:
AC05-00OR22725
OSTI ID:
1618528
Alternate ID(s):
OSTI ID: 1565524
Journal Information:
BMC Bioinformatics, Journal Name: BMC Bioinformatics Journal Issue: 1 Vol. 17; ISSN 1471-2105
Publisher:
Springer Science + Business MediaCopyright Statement
Country of Publication:
United Kingdom
Language:
English

References (25)

The cardiovascular health study: Design and rationale journal February 1991
An economic and energy-aware analysis of the viability of outsourcing cluster computing to a cloud journal March 2013
The UK10K project identifies rare variants in health and disease journal September 2015
A global reference for human genetic variation journal January 2015
Genotype and SNP calling from next-generation sequencing data journal May 2011
Systematic comparison of variant calling pipelines using gold standard personal exome variants journal December 2015
CloudBurst: highly sensitive read mapping with MapReduce journal April 2009
Characterizing Bias in Population Genetic Inferences from Low-Coverage Sequencing Data journal November 2013
The Atherosclerosis Risk in Communit (ARIC) study: design and objectives journal April 1989
The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data journal July 2010
Low-coverage sequencing: Implications for design of complex trait association studies journal April 2011
An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data journal January 2013
An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data journal April 2015
A comparative study of high-performance computing on the cloud
  • Marathe, Aniruddha; Harris, Rachel; Lowenthal, David K.
  • Proceedings of the 22nd international symposium on High-performance parallel and distributed computing - HPDC '13 https://doi.org/10.1145/2493123.2462919
conference January 2013
Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium: Design of Prospective Meta-Analyses of Genome-Wide Association Studies From 5 Cohorts journal February 2009
An integrative variant analysis suite for whole exome next-generation sequencing data journal January 2012
Comparing a few SNP calling algorithms using low-coverage sequencing data journal September 2013
New mini- zincin structures provide a minimal scaffold for members of this metallopeptidase superfamily journal January 2014
Searching for SNPs with cloud computing journal January 2009
Group-based variant calling leveraging next-generation supercomputing for large-scale whole-genome sequencing studies journal September 2015
Big Data: Astronomical or Genomical? journal July 2015
Best Practices and Joint Calling of the HumanExome BeadChip: The CHARGE Consortium journal July 2013
Population Genomic Analysis of 962 Whole Genome Sequences of Humans Reveals Natural Selection in Non-Coding Regions journal March 2015
Inexpensive and Highly Reproducible Cloud-Based Variant Calling of 2,535 Human Genomes journal June 2015
Epidemiological Approaches to Heart Disease: The Framingham Study journal March 1951