DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Viral phylogenomics using an alignment-free method: A three-step approach to determine optimal length of k-mer

Abstract

The development of rapid, economical genome sequencing has shed new light on the classification of viruses. As of October 2016, the National Center for Biotechnology Information (NCBI) database contained >2 million viral genome sequences and a reference set of ~4000 viral genome sequences that cover a wide range of known viral families. Whole-genome sequences can be used to improve viral classification and provide insight into the viral tree of life . However, due to the lack of evolutionary conservation amongst diverse viruses, it is not feasible to build a viral tree of life using traditional phylogenetic methods based on conserved proteins. In this study, we used an alignment-free method that uses k-mers as genomic features for a large-scale comparison of complete viral genomes available in RefSeq. To determine the optimal feature length, k (an essential step in constructing a meaningful dendrogram), we designed a comprehensive strategy that combines three approaches: (1) cumulative relative entropy, (2) average number of common features among genomes, and (3) the Shannon diversity index. This strategy was used to determine k for all 3,905 complete viral genomes in RefSeq. Lastly, the resulting dendrogram shows consistency with the viral taxonomy of the ICTV and the Baltimore classificationmore » of viruses.« less

Authors:
 [1];  [2];  [1];  [2];  [2]
  1. Univ. of Tennessee, Knoxville, TN (United States); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
  2. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Univ. of Arkansas for Medical Sciences, Little Rock, AR (United States)
Publication Date:
Research Org.:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1351783
Grant/Contract Number:  
AC05-00OR22725
Resource Type:
Accepted Manuscript
Journal Name:
Scientific Reports
Additional Journal Information:
Journal Volume: 7; Journal ID: ISSN 2045-2322
Publisher:
Nature Publishing Group
Country of Publication:
United States
Language:
English
Subject:
60 APPLIED LIFE SCIENCES; 96 KNOWLEDGE MANAGEMENT AND PRESERVATION; classification and taxonomy; genome informatics

Citation Formats

Zhang, Qian, Jun, Se -Ran, Leuze, Michael, Ussery, David, and Nookaew, Intawat. Viral phylogenomics using an alignment-free method: A three-step approach to determine optimal length of k-mer. United States: N. p., 2017. Web. doi:10.1038/srep40712.
Zhang, Qian, Jun, Se -Ran, Leuze, Michael, Ussery, David, & Nookaew, Intawat. Viral phylogenomics using an alignment-free method: A three-step approach to determine optimal length of k-mer. United States. https://doi.org/10.1038/srep40712
Zhang, Qian, Jun, Se -Ran, Leuze, Michael, Ussery, David, and Nookaew, Intawat. Thu . "Viral phylogenomics using an alignment-free method: A three-step approach to determine optimal length of k-mer". United States. https://doi.org/10.1038/srep40712. https://www.osti.gov/servlets/purl/1351783.
@article{osti_1351783,
title = {Viral phylogenomics using an alignment-free method: A three-step approach to determine optimal length of k-mer},
author = {Zhang, Qian and Jun, Se -Ran and Leuze, Michael and Ussery, David and Nookaew, Intawat},
abstractNote = {The development of rapid, economical genome sequencing has shed new light on the classification of viruses. As of October 2016, the National Center for Biotechnology Information (NCBI) database contained >2 million viral genome sequences and a reference set of ~4000 viral genome sequences that cover a wide range of known viral families. Whole-genome sequences can be used to improve viral classification and provide insight into the viral tree of life . However, due to the lack of evolutionary conservation amongst diverse viruses, it is not feasible to build a viral tree of life using traditional phylogenetic methods based on conserved proteins. In this study, we used an alignment-free method that uses k-mers as genomic features for a large-scale comparison of complete viral genomes available in RefSeq. To determine the optimal feature length, k (an essential step in constructing a meaningful dendrogram), we designed a comprehensive strategy that combines three approaches: (1) cumulative relative entropy, (2) average number of common features among genomes, and (3) the Shannon diversity index. This strategy was used to determine k for all 3,905 complete viral genomes in RefSeq. Lastly, the resulting dendrogram shows consistency with the viral taxonomy of the ICTV and the Baltimore classification of viruses.},
doi = {10.1038/srep40712},
journal = {Scientific Reports},
number = ,
volume = 7,
place = {United States},
year = {Thu Jan 19 00:00:00 EST 2017},
month = {Thu Jan 19 00:00:00 EST 2017}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 25 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

Whole genome sequencing as a means to assess pathogenic mutations in medical genetics and cancer
journal, December 2014

  • Royer-Bertrand, Beryl; Rivolta, Carlo
  • Cellular and Molecular Life Sciences, Vol. 72, Issue 8
  • DOI: 10.1007/s00018-014-1807-9

Sequencing pools of individuals — mining genome-wide polymorphism data without big funding
journal, September 2014

  • Schlötterer, Christian; Tobler, Raymond; Kofler, Robert
  • Nature Reviews Genetics, Vol. 15, Issue 11
  • DOI: 10.1038/nrg3803

Whole Genome Sequencing as a Diagnostic Test: Challenges and Opportunities
journal, November 2013


Computational solutions to large-scale data management and analysis
journal, September 2010

  • Schadt, Eric E.; Linderman, Michael D.; Sorenson, Jon
  • Nature Reviews Genetics, Vol. 11, Issue 9
  • DOI: 10.1038/nrg2857

Systems Analysis of High-Throughput Data
book, January 2014


NCBI Viral Genomes Resource
journal, November 2014

  • Brister, J. Rodney; Ako-adjei, Danso; Bao, Yiming
  • Nucleic Acids Research, Vol. 43, Issue D1
  • DOI: 10.1093/nar/gku1207

Methods for virus classification and the challenge of incorporating metagenomic sequence data
journal, June 2015


Tracking the changes in virus taxonomy
journal, February 2015


Past, present, and future of arenavirus taxonomy
journal, May 2015

  • Radoshitzky, Sheli R.; Bào, Yīmíng; Buchmeier, Michael J.
  • Archives of Virology, Vol. 160, Issue 7
  • DOI: 10.1007/s00705-015-2418-y

Taxonomy: get it Right or Leave it Alone
journal, May 2003

  • Calisher, Charles H.; Mahy, Brian W. J.
  • The American Journal of Tropical Medicine and Hygiene, Vol. 68, Issue 5
  • DOI: 10.4269/ajtmh.2003.68.505

Metagenomic Characterisation of the Viral Community of Lough Neagh, the Largest Freshwater Lake in Ireland
journal, February 2016


Revision of Begomovirus taxonomy based on pairwise sequence comparisons
journal, April 2015

  • Brown, Judith K.; Zerbini, F. Murilo; Navas-Castillo, Jesús
  • Archives of Virology, Vol. 160, Issue 6
  • DOI: 10.1007/s00705-015-2398-y

Phylogenomics and the reconstruction of the tree of life
journal, May 2005

  • Delsuc, Frédéric; Brinkmann, Henner; Philippe, Hervé
  • Nature Reviews Genetics, Vol. 6, Issue 5
  • DOI: 10.1038/nrg1603

Viral evolution and the emergence of SARS coronavirus
journal, July 2004

  • Holmes, Edward C.; Rambaut, Andrew
  • Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences, Vol. 359, Issue 1447
  • DOI: 10.1098/rstb.2004.1478

Assessment of codivergence of Mastreviruses with their plant hosts
journal, January 2008


Genome-scale approaches to resolving incongruence in molecular phylogenies
journal, October 2003

  • Rokas, Antonis; Williams, Barry L.; King, Nicole
  • Nature, Vol. 425, Issue 6960
  • DOI: 10.1038/nature02053

Alignment-free phylogeny of whole genomes using underlying subwords
journal, December 2012

  • Comin, Matteo; Verzotto, Davide
  • Algorithms for Molecular Biology, Vol. 7, Issue 1
  • DOI: 10.1186/1748-7188-7-34

Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches
journal, May 2014

  • Horwege, Sebastian; Lindner, Sebastian; Boden, Marcus
  • Nucleic Acids Research, Vol. 42, Issue W1
  • DOI: 10.1093/nar/gku398

kmacs: the k -mismatch average common substring approach to alignment-free sequence comparison
journal, May 2014


Clustering DNA sequences using the out-of-place measure with reduced n-grams
journal, October 2016


Alignment-free sequence comparison--a review
journal, March 2003


Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis
journal, July 2013

  • Bonham-Carter, O.; Steele, J.; Bastola, D.
  • Briefings in Bioinformatics, Vol. 15, Issue 6
  • DOI: 10.1093/bib/bbt052

Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions
journal, February 2009

  • Sims, Gregory E.; Jun, Se-Ran; Wu, Guohong A.
  • Proceedings of the National Academy of Sciences, Vol. 106, Issue 8
  • DOI: 10.1073/pnas.0813249106

Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs)
journal, May 2011

  • Sims, G. E.; Kim, S. -H.
  • Proceedings of the National Academy of Sciences, Vol. 108, Issue 20
  • DOI: 10.1073/pnas.1105168108

Whole-genome phylogeny of mammals: Evolutionary information in genic and nongenic regions
journal, September 2009

  • Sims, G. E.; Jun, S. -R.; Wu, G. A.
  • Proceedings of the National Academy of Sciences, Vol. 106, Issue 40
  • DOI: 10.1073/pnas.0909377106

Whole-proteome phylogeny of large dsDNA virus families by an alignment-free method
journal, June 2009

  • Wu, G. A.; Jun, S. -R.; Sims, G. E.
  • Proceedings of the National Academy of Sciences, Vol. 106, Issue 31
  • DOI: 10.1073/pnas.0905115106

Global comparison of multiple-segmented viruses in 12-dimensional genome space
journal, December 2014


Previously unknown and highly divergent ssDNA viruses populate the oceans
journal, July 2013


Identification of novel inter-genotypic recombinants of human hepatitis B viruses by large-scale phylogenetic analysis
journal, May 2012


Update on RefSeq microbial genomes resources
journal, December 2014

  • Tatusova, Tatiana; Ciufo, Stacy; Federhen, Scott
  • Nucleic Acids Research, Vol. 43, Issue D1
  • DOI: 10.1093/nar/gku1062

Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution
journal, December 2009

  • Jun, S. -R.; Sims, G. E.; Wu, G. A.
  • Proceedings of the National Academy of Sciences, Vol. 107, Issue 1
  • DOI: 10.1073/pnas.0913033107

Ebolavirus comparative genomics
journal, July 2015

  • Jun, Se-Ran; Leuze, Michael R.; Nookaew, Intawat
  • FEMS Microbiology Reviews, Vol. 39, Issue 5
  • DOI: 10.1093/femsre/fuv031

Microbial species delineation using whole genome sequences
journal, July 2015

  • Varghese, Neha J.; Mukherjee, Supratim; Ivanova, Natalia
  • Nucleic Acids Research, Vol. 43, Issue 14
  • DOI: 10.1093/nar/gkv657

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers
journal, January 2011


phytools: an R package for phylogenetic comparative biology (and other things): phytools: R package
journal, December 2011


Comparison of Real Frequencies of Strings vs. the Expected Ones Reveals the Information Capacity of Macromoleculae
journal, January 2003


On Information and Sufficiency
journal, March 1951

  • Kullback, S.; Leibler, R. A.
  • The Annals of Mathematical Statistics, Vol. 22, Issue 1
  • DOI: 10.1214/aoms/1177729694

A mathematical theory of communication
journal, January 2001

  • Shannon, C. E.
  • ACM SIGMOBILE Mobile Computing and Communications Review, Vol. 5, Issue 1
  • DOI: 10.1145/584091.584093

Comparison of phylogenetic trees
journal, February 1981


A Mathematical Theory of Communication
journal, July 1948


Inferring parsimonious migration histories for metastatic cancers
journal, April 2018


A Mathematical Theory of Communication
journal, October 1948


Methods for virus classification and the challenge of incorporating metagenomic sequence data
journal, June 2015


Using the whole-genome sequence to characterize and name human adenoviruses
text, January 2011

  • Seto, D.; Chodosh, J.; Brister, J. R.
  • American Society for Microbiology
  • DOI: 10.5167/uzh-52826

Systems Analysis of High-Throughput Data
book, January 2014


Whole genome sequencing as a means to assess pathogenic mutations in medical genetics and cancer
journal, December 2014

  • Royer-Bertrand, Beryl; Rivolta, Carlo
  • Cellular and Molecular Life Sciences, Vol. 72, Issue 8
  • DOI: 10.1007/s00018-014-1807-9

Tracking the changes in virus taxonomy
journal, February 2015


Revision of Begomovirus taxonomy based on pairwise sequence comparisons
journal, April 2015

  • Brown, Judith K.; Zerbini, F. Murilo; Navas-Castillo, Jesús
  • Archives of Virology, Vol. 160, Issue 6
  • DOI: 10.1007/s00705-015-2398-y

Clustering DNA sequences using the out-of-place measure with reduced n-grams
journal, October 2016


Identification of novel inter-genotypic recombinants of human hepatitis B viruses by large-scale phylogenetic analysis
journal, May 2012


Global comparison of multiple-segmented viruses in 12-dimensional genome space
journal, December 2014


Previously unknown and highly divergent ssDNA viruses populate the oceans
journal, July 2013


Phylogenomics and the reconstruction of the tree of life
journal, May 2005

  • Delsuc, Frédéric; Brinkmann, Henner; Philippe, Hervé
  • Nature Reviews Genetics, Vol. 6, Issue 5
  • DOI: 10.1038/nrg1603

Computational solutions to large-scale data management and analysis
journal, September 2010

  • Schadt, Eric E.; Linderman, Michael D.; Sorenson, Jon
  • Nature Reviews Genetics, Vol. 11, Issue 9
  • DOI: 10.1038/nrg2857

Whole-proteome phylogeny of large dsDNA virus families by an alignment-free method
journal, June 2009

  • Wu, G. A.; Jun, S. -R.; Sims, G. E.
  • Proceedings of the National Academy of Sciences, Vol. 106, Issue 31
  • DOI: 10.1073/pnas.0905115106

Whole-genome phylogeny of mammals: Evolutionary information in genic and nongenic regions
journal, September 2009

  • Sims, G. E.; Jun, S. -R.; Wu, G. A.
  • Proceedings of the National Academy of Sciences, Vol. 106, Issue 40
  • DOI: 10.1073/pnas.0909377106

Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs)
journal, May 2011

  • Sims, G. E.; Kim, S. -H.
  • Proceedings of the National Academy of Sciences, Vol. 108, Issue 20
  • DOI: 10.1073/pnas.1105168108

Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis
journal, July 2013

  • Bonham-Carter, O.; Steele, J.; Bastola, D.
  • Briefings in Bioinformatics, Vol. 15, Issue 6
  • DOI: 10.1093/bib/bbt052

Alignment-free sequence comparison--a review
journal, March 2003


kmacs: the k -mismatch average common substring approach to alignment-free sequence comparison
journal, May 2014


Ebolavirus comparative genomics
journal, July 2015

  • Jun, Se-Ran; Leuze, Michael R.; Nookaew, Intawat
  • FEMS Microbiology Reviews, Vol. 39, Issue 5
  • DOI: 10.1093/femsre/fuv031

Update on RefSeq microbial genomes resources
journal, December 2014

  • Tatusova, Tatiana; Ciufo, Stacy; Federhen, Scott
  • Nucleic Acids Research, Vol. 43, Issue D1
  • DOI: 10.1093/nar/gku1062

Microbial species delineation using whole genome sequences
journal, July 2015

  • Varghese, Neha J.; Mukherjee, Supratim; Ivanova, Natalia
  • Nucleic Acids Research, Vol. 43, Issue 14
  • DOI: 10.1093/nar/gkv657

Viral evolution and the emergence of SARS coronavirus
journal, July 2004

  • Holmes, Edward C.; Rambaut, Andrew
  • Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences, Vol. 359, Issue 1447
  • DOI: 10.1098/rstb.2004.1478

Using the Whole-Genome Sequence To Characterize and Name Human Adenoviruses
journal, March 2011

  • Seto, D.; Chodosh, J.; Brister, J. R.
  • Journal of Virology, Vol. 85, Issue 11
  • DOI: 10.1128/jvi.00354-11

National Center for Biotechnology Information Viral Genomes Project
journal, June 2004


Phylogenetic Incongruence among Oncogenic Genital Alpha Human Papillomaviruses
journal, November 2005


The Human Skin Double-Stranded DNA Virome: Topographical and Temporal Diversity, Genetic Enrichment, and Dynamic Associations with the Host Microbiome
journal, October 2015

  • Hannigan, Geoffrey D.; Meisel, Jacquelyn S.; Tyldsley, Amanda S.
  • mBio, Vol. 6, Issue 5
  • DOI: 10.1128/mbio.01578-15

Assessment of codivergence of Mastreviruses with their plant hosts
journal, January 2008


Evolution of the M gene of the influenza A virus in different host species: large-scale sequence analysis
journal, January 2009


Alignment-free phylogeny of whole genomes using underlying subwords
journal, December 2012

  • Comin, Matteo; Verzotto, Davide
  • Algorithms for Molecular Biology, Vol. 7, Issue 1
  • DOI: 10.1186/1748-7188-7-34

Metagenomic Characterisation of the Viral Community of Lough Neagh, the Largest Freshwater Lake in Ireland
journal, February 2016


Works referencing / citing this record:

Defining a Core Genome for the Herpesvirales and Exploring their Evolutionary Relationship with the Caudovirales
journal, August 2019

  • Andrade-Martínez, Juan S.; Moreno-Gallego, J. Leonardo; Reyes, Alejandro
  • Scientific Reports, Vol. 9, Issue 1
  • DOI: 10.1038/s41598-019-47742-z

Comparison of Compression-Based Measures with Application to the Evolution of Primate Genomes
journal, May 2018

  • Pratas, Diogo; Silva, Raquel; Pinho, Armando
  • Entropy, Vol. 20, Issue 6
  • DOI: 10.3390/e20060393

Metagenomic Composition Analysis of an Ancient Sequenced Polar Bear Jawbone from Svalbard
journal, September 2018

  • Pratas, Diogo; Hosseini, Morteza; Grilo, Gonçalo
  • Genes, Vol. 9, Issue 9
  • DOI: 10.3390/genes9090445

Defining a Core Genome for the Herpesvirales and Exploring their Evolutionary Relationship with the Caudovirales
journal, August 2019

  • Andrade-Martínez, Juan S.; Moreno-Gallego, J. Leonardo; Reyes, Alejandro
  • Scientific Reports, Vol. 9, Issue 1
  • DOI: 10.1038/s41598-019-47742-z

SSAW: A new sequence similarity analysis method based on the stationary discrete wavelet transform
journal, May 2018