skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Real-Time Pathogen Detection in the Era of Whole-Genome Sequencing and Big Data: Comparison of k-mer and Site-Based Methods for Inferring the Genetic Distances among Tens of Thousands of Salmonella Samples

Abstract

The adoption of whole-genome sequencing within the public health realm for molecular characterization of bacterial pathogens has been followed by an increased emphasis on real-time detection of emerging outbreaks (e.g., food-borne Salmonellosis). In turn, large databases of whole-genome sequence data are being populated. These databases currently contain tens of thousands of samples and are expected to grow to hundreds of thousands within a few years. For these databases to be of optimal use one must be able to quickly interrogate them to accurately determine the genetic distances among a set of samples. Being able to do so is challenging due to both biological (evolutionary diverse samples) and computational (petabytes of sequence data) issues. We evaluated seven measures of genetic distance, which were estimated from either k-mer profiles (Jaccard, Euclidean, Manhattan, Mash Jaccard, and Mash distances) or nucleotide sites (NUCmer and an extended multi-locus sequence typing (MLST) scheme). Finally, when analyzing empirical data (wholegenome sequence data from 18,997 Salmonella isolates) there are features (e.g., genomic, assembly, and contamination) that cause distances inferred from k-mer profiles, which treat absent data as informative, to fail to accurately capture the distance between samples when compared to distances inferred from differences in nucleotide sites. Thus,more » site-based distances, like NUCmer and extended MLST, are superior in performance, but accessing the computing resources necessary to perform them may be challenging when analyzing large databases.« less

Authors:
 [1];  [1];  [1];  [1];  [1]
  1. U.S. Food and Drug Administration, College Park, MD (United States). Center for Food Safety and Applied Nutrition
Publication Date:
Research Org.:
U.S. Food and Drug Administration, College Park, MD (United States). Center for Food Safety and Applied Nutrition, Office of Foods and Veterinary Medicine
Sponsoring Org.:
USDOE; US Food and Drug Administration (FDA)
OSTI Identifier:
1378468
Resource Type:
Accepted Manuscript
Journal Name:
PLoS ONE
Additional Journal Information:
Journal Volume: 11; Journal Issue: 11; Journal ID: ISSN 1932-6203
Publisher:
Public Library of Science
Country of Publication:
United States
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES

Citation Formats

Pettengill, James B., Pightling, Arthur W., Baugher, Joseph D., Rand, Hugh, and Strain, Errol. Real-Time Pathogen Detection in the Era of Whole-Genome Sequencing and Big Data: Comparison of k-mer and Site-Based Methods for Inferring the Genetic Distances among Tens of Thousands of Salmonella Samples. United States: N. p., 2016. Web. doi:10.1371/journal.pone.0166162.
Pettengill, James B., Pightling, Arthur W., Baugher, Joseph D., Rand, Hugh, & Strain, Errol. Real-Time Pathogen Detection in the Era of Whole-Genome Sequencing and Big Data: Comparison of k-mer and Site-Based Methods for Inferring the Genetic Distances among Tens of Thousands of Salmonella Samples. United States. doi:10.1371/journal.pone.0166162.
Pettengill, James B., Pightling, Arthur W., Baugher, Joseph D., Rand, Hugh, and Strain, Errol. Thu . "Real-Time Pathogen Detection in the Era of Whole-Genome Sequencing and Big Data: Comparison of k-mer and Site-Based Methods for Inferring the Genetic Distances among Tens of Thousands of Salmonella Samples". United States. doi:10.1371/journal.pone.0166162. https://www.osti.gov/servlets/purl/1378468.
@article{osti_1378468,
title = {Real-Time Pathogen Detection in the Era of Whole-Genome Sequencing and Big Data: Comparison of k-mer and Site-Based Methods for Inferring the Genetic Distances among Tens of Thousands of Salmonella Samples},
author = {Pettengill, James B. and Pightling, Arthur W. and Baugher, Joseph D. and Rand, Hugh and Strain, Errol},
abstractNote = {The adoption of whole-genome sequencing within the public health realm for molecular characterization of bacterial pathogens has been followed by an increased emphasis on real-time detection of emerging outbreaks (e.g., food-borne Salmonellosis). In turn, large databases of whole-genome sequence data are being populated. These databases currently contain tens of thousands of samples and are expected to grow to hundreds of thousands within a few years. For these databases to be of optimal use one must be able to quickly interrogate them to accurately determine the genetic distances among a set of samples. Being able to do so is challenging due to both biological (evolutionary diverse samples) and computational (petabytes of sequence data) issues. We evaluated seven measures of genetic distance, which were estimated from either k-mer profiles (Jaccard, Euclidean, Manhattan, Mash Jaccard, and Mash distances) or nucleotide sites (NUCmer and an extended multi-locus sequence typing (MLST) scheme). Finally, when analyzing empirical data (wholegenome sequence data from 18,997 Salmonella isolates) there are features (e.g., genomic, assembly, and contamination) that cause distances inferred from k-mer profiles, which treat absent data as informative, to fail to accurately capture the distance between samples when compared to distances inferred from differences in nucleotide sites. Thus, site-based distances, like NUCmer and extended MLST, are superior in performance, but accessing the computing resources necessary to perform them may be challenging when analyzing large databases.},
doi = {10.1371/journal.pone.0166162},
journal = {PLoS ONE},
number = 11,
volume = 11,
place = {United States},
year = {2016},
month = {11}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 2 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

Prokka: rapid prokaryotic genome annotation
journal, March 2014


A fast, lock-free approach for efficient parallel counting of occurrences of k-mers
journal, January 2011


Dating of the human-ape splitting by a molecular clock of mitochondrial DNA
journal, October 1985

  • Hasegawa, Masami; Kishino, Hirohisa; Yano, Taka-aki
  • Journal of Molecular Evolution, Vol. 22, Issue 2
  • DOI: 10.1007/BF02101694

SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing
journal, May 2012

  • Bankevich, Anton; Nurk, Sergey; Antipov, Dmitry
  • Journal of Computational Biology, Vol. 19, Issue 5
  • DOI: 10.1089/cmb.2012.0021

CFSAN SNP Pipeline: an automated method for constructing SNP matrices from next-generation sequence data
journal, January 2015

  • Davis, Steve; Pettengill, James B.; Luo, Yan
  • PeerJ Computer Science, Vol. 1
  • DOI: 10.7717/peerj-cs.20

Phylogenetic Diversity of the Enteric Pathogen Salmonella enterica subsp. enterica Inferred from Genome-Wide Reference-Free SNP Characters
journal, October 2013

  • Timme, Ruth E.; Pettengill, James B.; Allard, Marc W.
  • Genome Biology and Evolution, Vol. 5, Issue 11
  • DOI: 10.1093/gbe/evt159

ART: a next-generation sequencing read simulator
journal, December 2011


Generating samples under a Wright-Fisher neutral model of genetic variation
journal, February 2002


The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes
journal, November 2014


An introduction to ROC analysis
journal, June 2006


Basic local alignment search tool
journal, October 1990

  • Altschul, Stephen F.; Gish, Warren; Miller, Webb
  • Journal of Molecular Biology, Vol. 215, Issue 3, p. 403-410
  • DOI: 10.1016/S0022-2836(05)80360-2

Rapid Whole-Genome Sequencing for Surveillance of Salmonella enterica Serovar Enteritidis
journal, August 2014

  • den Bakker, Henk C.; Allard, Marc W.; Bopp, Dianna
  • Emerging Infectious Diseases, Vol. 20, Issue 8
  • DOI: 10.3201/eid2008.131399