DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Recommendations for Uniform Variant Calling of SARS-CoV-2 Genome Sequence across Bioinformatic Workflows

Journal Article · · Viruses
DOI: https://doi.org/10.3390/v16030430 · OSTI ID:2470558
ORCiD logo [1];  [2]; ORCiD logo [3]; ORCiD logo [4];  [5]; ORCiD logo [3];  [1]; ORCiD logo [2];  [6]; ORCiD logo [7]; ORCiD logo [2];  [8];  [1];  [1];  [1];  [1];  [6];  [5]; ORCiD logo [2]; ORCiD logo [2] more »;  [9];  [5];  [1];  [1]; ORCiD logo [7];  [1] « less
  1. National Institutes of Health (NIH), Bethesda, MD (United States). National Library of Medicine (NLM), National Center for Biotechnology Information (NCBI)
  2. Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)
  3. American Type Culture Collection, Manassas, VA (United States); BEI Resources, Manassas, VA (United States)
  4. University of Freiburg (Germany)
  5. Gilead Sciences, Foster City, CA (United States)
  6. Deloitte Consulting LLP, Rosslyn, VA (United States)
  7. Vir Biotechnology Inc., San Francisco, CA (United States)
  8. Eli Lilly and Company, Indianapolis, IN (United States)
  9. American Type Culture Collection, Manassas, VA (United States)

Genomic sequencing of clinical samples to identify emerging variants of SARS-CoV-2 has been a key public health tool for curbing the spread of the virus. As a result, an unprecedented number of SARS-CoV-2 genomes were sequenced during the COVID-19 pandemic, which allowed for rapid identification of genetic variants, enabling the timely design and testing of therapies and deployment of new vaccine formulations to combat the new variants. However, despite the technological advances of deep sequencing, the analysis of the raw sequence data generated globally is neither standardized nor consistent, leading to vastly disparate sequences that may impact identification of variants. Here, we show that for both Illumina and Oxford Nanopore sequencing platforms, downstream bioinformatic protocols used by industry, government, and academic groups resulted in different virus sequences from same sample. These bioinformatic workflows produced consensus genomes with differences in single nucleotide polymorphisms, inclusion and exclusion of insertions, and/or deletions, despite using the same raw sequence as input datasets. Here, we compared and characterized such discrepancies and propose a specific suite of parameters and protocols that should be adopted across the field. Consistent results from bioinformatic workflows are fundamental to SARS-CoV-2 and future pathogen surveillance efforts, including pandemic preparation, to allow for a data-driven and timely public health response.

Research Organization:
Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)
Sponsoring Organization:
European Union Horizon 2020; National Institutes of Health (NIH); USDOE Laboratory Directed Research and Development (LDRD) Program
Grant/Contract Number:
89233218CNA000001
OSTI ID:
2470558
Journal Information:
Viruses, Journal Name: Viruses Journal Issue: 3 Vol. 16; ISSN 1999-4915
Publisher:
MDPICopyright Statement
Country of Publication:
United States
Language:
English

References (55)

Rates of Co-infection Between SARS-CoV-2 and Other Respiratory Pathogens journal May 2020
Variant calling: Considerations, practices, and developments journal December 2021
Emergence and widespread circulation of a recombinant SARS-CoV-2 lineage in North America journal August 2022
A Rigorous Interlaboratory Examination of the Need to Confirm Next-Generation Sequencing–Detected Variants with an Orthogonal Method in Clinical Genetic Testing journal March 2019
From public health genomics to precision public health: a 20-year journey journal June 2018
Accurate whole human genome sequencing using reversible terminator chemistry journal November 2008
Mapping and phasing of structural variation in patient genomes using nanopore sequencing journal November 2017
Co-infection with SARS-CoV-2 Omicron and Delta variants revealed by genomic surveillance journal May 2022
The biological and clinical significance of emerging SARS-CoV-2 variants journal September 2021
Genomics and epidemiological surveillance journal July 2020
Pandemic-scale phylogenomics reveals the SARS-CoV-2 recombination landscape journal August 2022
Best practices for benchmarking germline small-variant calls in human genomes journal March 2019
Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype journal August 2019
Toward best practice in cancer mutation detection with whole-genome and whole-exome sequencing journal September 2021
Performance assessment of DNA sequencing platforms in the ABRF Next-Generation Sequencing Study journal September 2021
Ready-to-use public infrastructure for global SARS-CoV-2 monitoring journal September 2021
Nanopore sequencing technology, bioinformatics and applications journal November 2021
Systematic comparison of germline variant calling pipelines cross multiple next-generation sequencers journal June 2019
Comparison of GATK and DeepVariant by trio sequencing journal February 2022
Next generation sequencing of SARS-CoV-2 genomes: challenges, applications and opportunities journal December 2020
EDGE COVID-19: a web platform to generate submission-ready genomes from SARS-CoV-2 sequencing efforts journal March 2022
Biopython: freely available Python tools for computational molecular biology and bioinformatics journal March 2009
Fast and accurate short read alignment with Burrows-Wheeler transform journal May 2009
Fast and accurate long-read alignment with Burrows–Wheeler transform journal January 2010
The variant call format and VCFtools journal June 2011
A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data journal September 2011
Trimmomatic: a flexible trimmer for Illumina sequence data journal April 2014
NanoPack: visualizing and processing long-read sequencing data journal March 2018
Minimap2: pairwise alignment for nucleotide sequences journal May 2018
fastp: an ultra-fast all-in-one FASTQ preprocessor journal September 2018
SPDI: data model for variants and applications at NCBI journal November 2019
Twelve years of SAMtools and BCFtools journal January 2021
The Sequence Read Archive: a decade more of explosive growth journal November 2021
Database resources of the national center for biotechnology information journal December 2021
GenBank journal December 2021
The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update journal April 2022
LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets journal October 2012
Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation journal November 2015
The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data journal July 2010
Measurements of Intrahost Viral Diversity Are Extremely Sensitive to Systematic Errors in Variant Calling journal May 2016
A Comparison of Variant Calling Pipelines Using Genome in a Bottle as a Reference journal January 2015
NGmerge: merging paired-end reads via novel empirically-derived models of sequencing errors journal December 2018
Benchmarking workflows to assess performance and suitability of germline variant calling pipelines in clinical diagnostic assays journal February 2021
From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy journal July 2018
An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar journal January 2019
Improved metagenomic analysis with Kraken 2 journal November 2019
STAT: a fast, scalable, MinHash-based k-mer tool to assess Sequence Read Archive next-generation sequence submissions journal September 2021
Assessing reproducibility of inherited variants detected with short-read whole genome sequencing journal January 2022
The Dockstore: enabling modular, community-focused sharing of Docker-based genomics tools and workflows journal January 2017
Evaluating the performance of tools used to call minority variants from whole genome short-read data journal September 2018
SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation journal October 2016
Cutadapt removes adapter sequences from high-throughput sequencing reads journal May 2011
Regulatory Evaluation of Antiviral Drug Resistance in the Era of Next-Generation Sequencing journal October 2015
Strategy and Performance Evaluation of Low-Frequency Variant Calling for SARS-CoV-2 Using Targeted Deep Illumina Sequencing journal October 2021
Challenges and Opportunities for Global Genomic Surveillance Strategies in the COVID-19 Era journal November 2022