skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Viral phylogenomics using an alignment-free method: A three-step approach to determine optimal length of k-mer

Journal Article · · Scientific Reports
DOI:https://doi.org/10.1038/srep40712· OSTI ID:1351783
 [1];  [2];  [1];  [2];  [2]
  1. Univ. of Tennessee, Knoxville, TN (United States); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
  2. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Univ. of Arkansas for Medical Sciences, Little Rock, AR (United States)

The development of rapid, economical genome sequencing has shed new light on the classification of viruses. As of October 2016, the National Center for Biotechnology Information (NCBI) database contained >2 million viral genome sequences and a reference set of ~4000 viral genome sequences that cover a wide range of known viral families. Whole-genome sequences can be used to improve viral classification and provide insight into the viral tree of life . However, due to the lack of evolutionary conservation amongst diverse viruses, it is not feasible to build a viral tree of life using traditional phylogenetic methods based on conserved proteins. In this study, we used an alignment-free method that uses k-mers as genomic features for a large-scale comparison of complete viral genomes available in RefSeq. To determine the optimal feature length, k (an essential step in constructing a meaningful dendrogram), we designed a comprehensive strategy that combines three approaches: (1) cumulative relative entropy, (2) average number of common features among genomes, and (3) the Shannon diversity index. This strategy was used to determine k for all 3,905 complete viral genomes in RefSeq. Lastly, the resulting dendrogram shows consistency with the viral taxonomy of the ICTV and the Baltimore classification of viruses.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Organization:
USDOE Office of Science (SC)
Grant/Contract Number:
AC05-00OR22725
OSTI ID:
1351783
Journal Information:
Scientific Reports, Vol. 7; ISSN 2045-2322
Publisher:
Nature Publishing GroupCopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 25 works
Citation information provided by
Web of Science

References (52)

Whole genome sequencing as a means to assess pathogenic mutations in medical genetics and cancer journal December 2014
Sequencing pools of individuals — mining genome-wide polymorphism data without big funding journal September 2014
WGS Analysis and Interpretation in Clinical and Public Health Microbiology Laboratories: What Are the Requirements and How Do Existing Tools Compare? journal June 2014
Whole Genome Sequencing as a Diagnostic Test: Challenges and Opportunities journal November 2013
Computational solutions to large-scale data management and analysis journal September 2010
Systems Analysis of High-Throughput Data book January 2014
NCBI Viral Genomes Resource journal November 2014
Methods for virus classification and the challenge of incorporating metagenomic sequence data journal June 2015
Tracking the changes in virus taxonomy journal February 2015
Past, present, and future of arenavirus taxonomy journal May 2015
Taxonomy: get it Right or Leave it Alone journal May 2003
Metagenomic Characterisation of the Viral Community of Lough Neagh, the Largest Freshwater Lake in Ireland journal February 2016
Revision of Begomovirus taxonomy based on pairwise sequence comparisons journal April 2015
Phylogenomics and the reconstruction of the tree of life journal May 2005
Viral evolution and the emergence of SARS coronavirus journal July 2004
Assessment of codivergence of Mastreviruses with their plant hosts journal January 2008
Genome-scale approaches to resolving incongruence in molecular phylogenies journal October 2003
Alignment-free phylogeny of whole genomes using underlying subwords journal December 2012
Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches journal May 2014
kmacs: the k -mismatch average common substring approach to alignment-free sequence comparison journal May 2014
Clustering DNA sequences using the out-of-place measure with reduced n-grams journal October 2016
Alignment-free sequence comparison--a review journal March 2003
Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis journal July 2013
Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions journal February 2009
Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs) journal May 2011
Whole-genome phylogeny of mammals: Evolutionary information in genic and nongenic regions journal September 2009
Whole-proteome phylogeny of large dsDNA virus families by an alignment-free method journal June 2009
Global comparison of multiple-segmented viruses in 12-dimensional genome space journal December 2014
An ensemble distance measure of k-mer and Natural Vector for the phylogenetic analysis of multiple-segmented viruses journal June 2016
Previously unknown and highly divergent ssDNA viruses populate the oceans journal July 2013
Identification of novel inter-genotypic recombinants of human hepatitis B viruses by large-scale phylogenetic analysis journal May 2012
Update on RefSeq microbial genomes resources journal December 2014
Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution journal December 2009
Ebolavirus comparative genomics journal July 2015
Microbial species delineation using whole genome sequences journal July 2015
A fast, lock-free approach for efficient parallel counting of occurrences of k-mers journal January 2011
phytools: an R package for phylogenetic comparative biology (and other things): phytools: R package journal December 2011
Comparison of Real Frequencies of Strings vs. the Expected Ones Reveals the Information Capacity of Macromoleculae journal January 2003
On Information and Sufficiency journal March 1951
A mathematical theory of communication journal January 2001
Comparison of phylogenetic trees journal February 1981
A Mathematical Theory of Communication journal July 1948
Inferring parsimonious migration histories for metastatic cancers journal April 2018
A Mathematical Theory of Communication journal October 1948
Methods for virus classification and the challenge of incorporating metagenomic sequence data journal June 2015
Using the whole-genome sequence to characterize and name human adenoviruses text January 2011
Usefulness and limitation of phylogenetic analysis for hepatitis C virus core region: application to isolates from Egyptian and Yemeni patients journal June 1996
Using the Whole-Genome Sequence To Characterize and Name Human Adenoviruses journal March 2011
National Center for Biotechnology Information Viral Genomes Project journal June 2004
Phylogenetic Incongruence among Oncogenic Genital Alpha Human Papillomaviruses journal November 2005
The Human Skin Double-Stranded DNA Virome: Topographical and Temporal Diversity, Genetic Enrichment, and Dynamic Associations with the Host Microbiome journal October 2015
Evolution of the M gene of the influenza A virus in different host species: large-scale sequence analysis journal January 2009

Cited By (4)

Defining a Core Genome for the Herpesvirales and Exploring their Evolutionary Relationship with the Caudovirales journal August 2019
Comparison of Compression-Based Measures with Application to the Evolution of Primate Genomes journal May 2018
Metagenomic Composition Analysis of an Ancient Sequenced Polar Bear Jawbone from Svalbard journal September 2018
SSAW: A new sequence similarity analysis method based on the stationary discrete wavelet transform journal May 2018