Real-Time Pathogen Detection in the Era of Whole-Genome Sequencing and Big Data: Comparison of k-mer and Site-Based Methods for Inferring the Genetic Distances among Tens of Thousands of Salmonella Samples

Pettengill, James B.; Pightling, Arthur W.; Baugher, Joseph D.; Rand, Hugh; Strain, Errol

doi:10.1371/journal.pone.0166162

Title: Real-Time Pathogen Detection in the Era of Whole-Genome Sequencing and Big Data: Comparison of k-mer and Site-Based Methods for Inferring the Genetic Distances among Tens of Thousands of Salmonella Samples

Journal Article · Thu Nov 10 00:00:00 EST 2016 · PLoS ONE

DOI:https://doi.org/10.1371/journal.pone.0166162· OSTI ID:1378468

Pettengill, James B. ^[1]; Pightling, Arthur W. ^[1]; Baugher, Joseph D. ^[1]; Rand, Hugh ^[1]; Strain, Errol ^[1]

U.S. Food and Drug Administration, College Park, MD (United States). Center for Food Safety and Applied Nutrition

The adoption of whole-genome sequencing within the public health realm for molecular characterization of bacterial pathogens has been followed by an increased emphasis on real-time detection of emerging outbreaks (e.g., food-borne Salmonellosis). In turn, large databases of whole-genome sequence data are being populated. These databases currently contain tens of thousands of samples and are expected to grow to hundreds of thousands within a few years. For these databases to be of optimal use one must be able to quickly interrogate them to accurately determine the genetic distances among a set of samples. Being able to do so is challenging due to both biological (evolutionary diverse samples) and computational (petabytes of sequence data) issues. We evaluated seven measures of genetic distance, which were estimated from either k-mer profiles (Jaccard, Euclidean, Manhattan, Mash Jaccard, and Mash distances) or nucleotide sites (NUCmer and an extended multi-locus sequence typing (MLST) scheme). Finally, when analyzing empirical data (wholegenome sequence data from 18,997 Salmonella isolates) there are features (e.g., genomic, assembly, and contamination) that cause distances inferred from k-mer profiles, which treat absent data as informative, to fail to accurately capture the distance between samples when compared to distances inferred from differences in nucleotide sites. Thus, site-based distances, like NUCmer and extended MLST, are superior in performance, but accessing the computing resources necessary to perform them may be challenging when analyzing large databases.

View Accepted Manuscript (DOE)

Cite

Export

Save

Research Organization:: U.S. Food and Drug Administration, College Park, MD (United States). Center for Food Safety and Applied Nutrition, Office of Foods and Veterinary Medicine

Sponsoring Organization:: USDOE; US Food and Drug Administration (FDA)

OSTI ID:: 1378468

Journal Information:: PLoS ONE, Vol. 11, Issue 11; ISSN 1932-6203

Publisher:: Public Library of ScienceCopyright Statement

Country of Publication:: United States

Language:: English

Citation Metrics:

Cited by: 9 works

Citation information provided by
Web of Science

References (23)

Prokka: rapid prokaryotic genome annotation Seemann, T. Bioinformatics, Vol. 30, Issue 14 https://doi.org/10.1093/bioinformatics/btu153	journal	March 2014
When Whole-Genome Alignments Just Won't Work: kSNP v2 Software for Alignment-Free SNP Discovery and Phylogenetics of Hundreds of Microbial Genomes Gardner, Shea N.; Hall, Barry G. PLoS ONE, Vol. 8, Issue 12 https://doi.org/10.1371/journal.pone.0081760	journal	December 2013
Multilocus Sequence Typing as a Replacement for Serotyping in Salmonella enterica Achtman, Mark; Wain, John; Weill, François-Xavier PLoS Pathogens, Vol. 8, Issue 6 https://doi.org/10.1371/journal.ppat.1002776	journal	June 2012
A fast, lock-free approach for efficient parallel counting of occurrences of k-mers Marçais, Guillaume; Kingsford, Carl Bioinformatics, Vol. 27, Issue 6 https://doi.org/10.1093/bioinformatics/btr011	journal	January 2011
Versatile and open software for comparing large genomes Kurtz, Stefan; Phillippy, Adam; Delcher, Arthur L. Genome Biology, Vol. 5, Issue 2, R12 https://doi.org/10.1186/gb-2004-5-2-r12	journal	January 2004
Dating of the human-ape splitting by a molecular clock of mitochondrial DNA Hasegawa, Masami; Kishino, Hirohisa; Yano, Taka-aki Journal of Molecular Evolution, Vol. 22, Issue 2 https://doi.org/10.1007/BF02101694	journal	October 1985
SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing Bankevich, Anton; Nurk, Sergey; Antipov, Dmitry Journal of Computational Biology, Vol. 19, Issue 5 https://doi.org/10.1089/cmb.2012.0021	journal	May 2012
CFSAN SNP Pipeline: an automated method for constructing SNP matrices from next-generation sequence data Davis, Steve; Pettengill, James B.; Luo, Yan PeerJ Computer Science, Vol. 1 https://doi.org/10.7717/peerj-cs.20	journal	January 2015
Comparative genomics: the bacterial pan-genome Tettelin, Hervé; Riley, David; Cattuto, Ciro Current Opinion in Microbiology, Vol. 11, Issue 5 https://doi.org/10.1016/j.mib.2008.09.006	journal	October 2008
A Framework for Assessing the Concordance of Molecular Typing Methods and the True Strain Phylogeny of Campylobacter jejuni and C. coli Using Draft Genome Sequence Data Carrillo, Catherine D.; Kruczkiewicz, Peter; Mutschall, Steven Frontiers in Cellular and Infection Microbiology, Vol. 2 https://doi.org/10.3389/fcimb.2012.00057	journal	January 2012
Phylogenetic Diversity of the Enteric Pathogen Salmonella enterica subsp. enterica Inferred from Genome-Wide Reference-Free SNP Characters Timme, Ruth E.; Pettengill, James B.; Allard, Marc W. Genome Biology and Evolution, Vol. 5, Issue 11 https://doi.org/10.1093/gbe/evt159	journal	October 2013
ART: a next-generation sequencing read simulator Huang, Weichun; Li, Leping; Myers, Jason R. Bioinformatics, Vol. 28, Issue 4 https://doi.org/10.1093/bioinformatics/btr708	journal	December 2011
The Listeria monocytogenes Core-Genome Sequence Typer (LmCGST): a bioinformatic pipeline for molecular characterization with next-generation sequence data Pightling, Arthur W.; Petronella, Nicholas; Pagotto, Franco BMC Microbiology, Vol. 15, Issue 1 https://doi.org/10.1186/s12866-015-0526-1	journal	October 2015
Real-time digital pathogen surveillance — the time is now Gardy, Jennifer; Loman, Nicholas J.; Rambaut, Andrew Genome Biology, Vol. 16, Issue 1 https://doi.org/10.1186/s13059-015-0726-x	journal	July 2015
Generating samples under a Wright-Fisher neutral model of genetic variation Hudson, R. R. Bioinformatics, Vol. 18, Issue 2 https://doi.org/10.1093/bioinformatics/18.2.337	journal	February 2002
The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes Treangen, Todd J.; Ondov, Brian D.; Koren, Sergey Genome Biology, Vol. 15, Issue 11 https://doi.org/10.1186/s13059-014-0524-x	journal	November 2014
An introduction to ROC analysis Fawcett, Tom Pattern Recognition Letters, Vol. 27, Issue 8 https://doi.org/10.1016/j.patrec.2005.10.010	journal	June 2006
Basic local alignment search tool Altschul, Stephen F.; Gish, Warren; Miller, Webb Journal of Molecular Biology, Vol. 215, Issue 3, p. 403-410 https://doi.org/10.1016/S0022-2836(05)80360-2	journal	October 1990
Rapid Whole-Genome Sequencing for Surveillance of Salmonella enterica Serovar Enteritidis den Bakker, Henk C.; Allard, Marc W.; Bopp, Dianna Emerging Infectious Diseases, Vol. 20, Issue 8 https://doi.org/10.3201/eid2008.131399	journal	August 2014
Commensal Pseudomonas protect Arabidopsis thaliana from a coexisting pathogen via multiple lineage-dependent mechanisms Shalev, Or; Ashkenazy, Haim; Neumann, Manuela The ISME Journal, Vol. 16, Issue 5 https://doi.org/10.1038/s41396-021-01168-6	journal	December 2021
Genomic Epidemiology: Whole-Genome-Sequencing–Powered Surveillance and Outbreak Investigation of Foodborne Bacterial Pathogens Deng, Xiangyu; den Bakker, Henk C.; Hendriksen, Rene S. Annual Review of Food Science and Technology, Vol. 7, Issue 1 https://doi.org/10.1146/annurev-food-041715-033259	journal	February 2016
Kraken: ultrafast metagenomic sequence classification using exact alignments Wood, Derrick E.; Salzberg, Steven L. Genome Biology, Vol. 15, Issue 3 https://doi.org/10.1186/gb-2014-15-3-r46	journal	January 2014
Mash: fast genome and metagenome distance estimation using MinHash Ondov, Brian D.; Treangen, Todd J.; Melsted, Páll Genome Biology, Vol. 17, Issue 1 https://doi.org/10.1186/s13059-016-0997-x	journal	June 2016

Cited By (5)

Whole genome sequencing for investigations of meningococcal outbreaks in the United States: a retrospective analysis Whaley, Melissa J.; Joseph, Sandeep J.; Retchless, Adam C. Scientific Reports, Vol. 8, Issue 1 https://doi.org/10.1038/s41598-018-33622-5	journal	October 2018
Within-species contamination of bacterial whole-genome sequence data has a greater influence on clustering analyses than between-species contamination Pightling, Arthur W.; Pettengill, James B.; Wang, Yu Genome Biology, Vol. 20, Issue 1 https://doi.org/10.1186/s13059-019-1914-x	journal	December 2019
Pan-genome Analyses of the Species Salmonella enterica, and Identification of Genomic Markers Predictive for Species, Subspecies, and Serovar Laing, Chad R.; Whiteside, Matthew D.; Gannon, Victor P. J. Frontiers in Microbiology, Vol. 8 https://doi.org/10.3389/fmicb.2017.01345	journal	July 2017
Rapid and accurate SNP genotyping of clonal bacterial pathogens with BioHansel Labbé, Geneviève; Kruczkiewicz, Peter; Robertson, James Microbial Genomics, Vol. 7, Issue 9 https://doi.org/10.1099/mgen.0.000651	journal	September 2021
Building large updatable colored de Bruijn graphs via merging Muggli, Martin D.; Alipanahi, Bahar; Boucher, Christina Bioinformatics, Vol. 35, Issue 14 https://doi.org/10.1093/bioinformatics/btz350	journal	July 2019

Similar Records

A k-mer based approach for classifying viruses without taxonomy identifies viral associations in human autism and plant microbiomes

Journal Article · Mon Oct 25 00:00:00 EDT 2021 · Computational and Structural Biotechnology Journal · OSTI ID:1378468

Garcia, Benjamin J.; Simha, Ramanuja; Garvin, Michael; +7 more

Maast: genotyping thousands of microbial strains efficiently

Journal Article · Thu Aug 10 00:00:00 EDT 2023 · Genome Biology (Online) · OSTI ID:1378468

Shi, Zhou Jason; Nayfach, Stephen; Pollard, Katherine S.

A Statistical Framework for Microbial Source Attribution: Measuring Uncertainty in Host Transmission Events Inferred from Genetic Data (Part 2 of a 2 Part Report)

Technical Report · Mon Nov 16 00:00:00 EST 2009 · OSTI ID:1378468

Allen, J; Velsko, S

Related Subjects

59 BASIC BIOLOGICAL SCIENCES

Title: Real-Time Pathogen Detection in the Era of Whole-Genome Sequencing and Big Data: Comparison of k-mer and Site-Based Methods for Inferring the Genetic Distances among Tens of Thousands of Salmonella Samples

Citation Formats

References (23)

Cited By (5)

Similar Records

Related Subjects