Viral phylogenomics using an alignment-free method: A three-step approach to determine optimal length of k-mer
Abstract
The development of rapid, economical genome sequencing has shed new light on the classification of viruses. As of October 2016, the National Center for Biotechnology Information (NCBI) database contained >2 million viral genome sequences and a reference set of ~4000 viral genome sequences that cover a wide range of known viral families. Whole-genome sequences can be used to improve viral classification and provide insight into the viral tree of life . However, due to the lack of evolutionary conservation amongst diverse viruses, it is not feasible to build a viral tree of life using traditional phylogenetic methods based on conserved proteins. In this study, we used an alignment-free method that uses k-mers as genomic features for a large-scale comparison of complete viral genomes available in RefSeq. To determine the optimal feature length, k (an essential step in constructing a meaningful dendrogram), we designed a comprehensive strategy that combines three approaches: (1) cumulative relative entropy, (2) average number of common features among genomes, and (3) the Shannon diversity index. This strategy was used to determine k for all 3,905 complete viral genomes in RefSeq. Lastly, the resulting dendrogram shows consistency with the viral taxonomy of the ICTV and the Baltimore classificationmore »
- Authors:
-
- Univ. of Tennessee, Knoxville, TN (United States); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Univ. of Arkansas for Medical Sciences, Little Rock, AR (United States)
- Publication Date:
- Research Org.:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
- Sponsoring Org.:
- USDOE Office of Science (SC)
- OSTI Identifier:
- 1351783
- Grant/Contract Number:
- AC05-00OR22725
- Resource Type:
- Accepted Manuscript
- Journal Name:
- Scientific Reports
- Additional Journal Information:
- Journal Volume: 7; Journal ID: ISSN 2045-2322
- Publisher:
- Nature Publishing Group
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 60 APPLIED LIFE SCIENCES; 96 KNOWLEDGE MANAGEMENT AND PRESERVATION; classification and taxonomy; genome informatics
Citation Formats
Zhang, Qian, Jun, Se -Ran, Leuze, Michael, Ussery, David, and Nookaew, Intawat. Viral phylogenomics using an alignment-free method: A three-step approach to determine optimal length of k-mer. United States: N. p., 2017.
Web. doi:10.1038/srep40712.
Zhang, Qian, Jun, Se -Ran, Leuze, Michael, Ussery, David, & Nookaew, Intawat. Viral phylogenomics using an alignment-free method: A three-step approach to determine optimal length of k-mer. United States. https://doi.org/10.1038/srep40712
Zhang, Qian, Jun, Se -Ran, Leuze, Michael, Ussery, David, and Nookaew, Intawat. Thu .
"Viral phylogenomics using an alignment-free method: A three-step approach to determine optimal length of k-mer". United States. https://doi.org/10.1038/srep40712. https://www.osti.gov/servlets/purl/1351783.
@article{osti_1351783,
title = {Viral phylogenomics using an alignment-free method: A three-step approach to determine optimal length of k-mer},
author = {Zhang, Qian and Jun, Se -Ran and Leuze, Michael and Ussery, David and Nookaew, Intawat},
abstractNote = {The development of rapid, economical genome sequencing has shed new light on the classification of viruses. As of October 2016, the National Center for Biotechnology Information (NCBI) database contained >2 million viral genome sequences and a reference set of ~4000 viral genome sequences that cover a wide range of known viral families. Whole-genome sequences can be used to improve viral classification and provide insight into the viral tree of life . However, due to the lack of evolutionary conservation amongst diverse viruses, it is not feasible to build a viral tree of life using traditional phylogenetic methods based on conserved proteins. In this study, we used an alignment-free method that uses k-mers as genomic features for a large-scale comparison of complete viral genomes available in RefSeq. To determine the optimal feature length, k (an essential step in constructing a meaningful dendrogram), we designed a comprehensive strategy that combines three approaches: (1) cumulative relative entropy, (2) average number of common features among genomes, and (3) the Shannon diversity index. This strategy was used to determine k for all 3,905 complete viral genomes in RefSeq. Lastly, the resulting dendrogram shows consistency with the viral taxonomy of the ICTV and the Baltimore classification of viruses.},
doi = {10.1038/srep40712},
journal = {Scientific Reports},
number = ,
volume = 7,
place = {United States},
year = {Thu Jan 19 00:00:00 EST 2017},
month = {Thu Jan 19 00:00:00 EST 2017}
}
Web of Science
Works referenced in this record:
Whole genome sequencing as a means to assess pathogenic mutations in medical genetics and cancer
journal, December 2014
- Royer-Bertrand, Beryl; Rivolta, Carlo
- Cellular and Molecular Life Sciences, Vol. 72, Issue 8
Sequencing pools of individuals — mining genome-wide polymorphism data without big funding
journal, September 2014
- Schlötterer, Christian; Tobler, Raymond; Kofler, Robert
- Nature Reviews Genetics, Vol. 15, Issue 11
WGS Analysis and Interpretation in Clinical and Public Health Microbiology Laboratories: What Are the Requirements and How Do Existing Tools Compare?
journal, June 2014
- Wyres, Kelly; Conway, Thomas; Garg, Saurabh
- Pathogens, Vol. 3, Issue 2
Whole Genome Sequencing as a Diagnostic Test: Challenges and Opportunities
journal, November 2013
- Chrystoja, C. C.; Diamandis, E. P.
- Clinical Chemistry, Vol. 60, Issue 5
Computational solutions to large-scale data management and analysis
journal, September 2010
- Schadt, Eric E.; Linderman, Michael D.; Sorenson, Jon
- Nature Reviews Genetics, Vol. 11, Issue 9
Systems Analysis of High-Throughput Data
book, January 2014
- Braun, Rosemary
- A Systems Biology Approach to Blood
NCBI Viral Genomes Resource
journal, November 2014
- Brister, J. Rodney; Ako-adjei, Danso; Bao, Yiming
- Nucleic Acids Research, Vol. 43, Issue D1
Methods for virus classification and the challenge of incorporating metagenomic sequence data
journal, June 2015
- Simmonds, P.
- Journal of General Virology, Vol. 96, Issue Pt_6
Tracking the changes in virus taxonomy
journal, February 2015
- Adams, M. J.; Hendrickson, R. C.; Dempsey, D. M.
- Archives of Virology, Vol. 160, Issue 5
Past, present, and future of arenavirus taxonomy
journal, May 2015
- Radoshitzky, Sheli R.; Bào, Yīmíng; Buchmeier, Michael J.
- Archives of Virology, Vol. 160, Issue 7
Taxonomy: get it Right or Leave it Alone
journal, May 2003
- Calisher, Charles H.; Mahy, Brian W. J.
- The American Journal of Tropical Medicine and Hygiene, Vol. 68, Issue 5
Metagenomic Characterisation of the Viral Community of Lough Neagh, the Largest Freshwater Lake in Ireland
journal, February 2016
- Skvortsov, Timofey; de Leeuwe, Colin; Quinn, John P.
- PLOS ONE, Vol. 11, Issue 2
Revision of Begomovirus taxonomy based on pairwise sequence comparisons
journal, April 2015
- Brown, Judith K.; Zerbini, F. Murilo; Navas-Castillo, Jesús
- Archives of Virology, Vol. 160, Issue 6
Phylogenomics and the reconstruction of the tree of life
journal, May 2005
- Delsuc, Frédéric; Brinkmann, Henner; Philippe, Hervé
- Nature Reviews Genetics, Vol. 6, Issue 5
Viral evolution and the emergence of SARS coronavirus
journal, July 2004
- Holmes, Edward C.; Rambaut, Andrew
- Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences, Vol. 359, Issue 1447
Assessment of codivergence of Mastreviruses with their plant hosts
journal, January 2008
- Wu, Beilei; Melcher, Ulrich; Guo, Xingyi
- BMC Evolutionary Biology, Vol. 8, Issue 1
Genome-scale approaches to resolving incongruence in molecular phylogenies
journal, October 2003
- Rokas, Antonis; Williams, Barry L.; King, Nicole
- Nature, Vol. 425, Issue 6960
Alignment-free phylogeny of whole genomes using underlying subwords
journal, December 2012
- Comin, Matteo; Verzotto, Davide
- Algorithms for Molecular Biology, Vol. 7, Issue 1
Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches
journal, May 2014
- Horwege, Sebastian; Lindner, Sebastian; Boden, Marcus
- Nucleic Acids Research, Vol. 42, Issue W1
kmacs: the k -mismatch average common substring approach to alignment-free sequence comparison
journal, May 2014
- Leimeister, Chris-Andre; Morgenstern, Burkhard
- Bioinformatics, Vol. 30, Issue 14
Clustering DNA sequences using the out-of-place measure with reduced n-grams
journal, October 2016
- Huang, Hsin-Hsiung; Yu, Chenglong
- Journal of Theoretical Biology, Vol. 406
Alignment-free sequence comparison--a review
journal, March 2003
- Vinga, S.; Almeida, J.
- Bioinformatics, Vol. 19, Issue 4
Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis
journal, July 2013
- Bonham-Carter, O.; Steele, J.; Bastola, D.
- Briefings in Bioinformatics, Vol. 15, Issue 6
Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions
journal, February 2009
- Sims, Gregory E.; Jun, Se-Ran; Wu, Guohong A.
- Proceedings of the National Academy of Sciences, Vol. 106, Issue 8
Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs)
journal, May 2011
- Sims, G. E.; Kim, S. -H.
- Proceedings of the National Academy of Sciences, Vol. 108, Issue 20
Whole-genome phylogeny of mammals: Evolutionary information in genic and nongenic regions
journal, September 2009
- Sims, G. E.; Jun, S. -R.; Wu, G. A.
- Proceedings of the National Academy of Sciences, Vol. 106, Issue 40
Whole-proteome phylogeny of large dsDNA virus families by an alignment-free method
journal, June 2009
- Wu, G. A.; Jun, S. -R.; Sims, G. E.
- Proceedings of the National Academy of Sciences, Vol. 106, Issue 31
Global comparison of multiple-segmented viruses in 12-dimensional genome space
journal, December 2014
- Huang, Hsin-Hsiung; Yu, Chenglong; Zheng, Hui
- Molecular Phylogenetics and Evolution, Vol. 81
An ensemble distance measure of k-mer and Natural Vector for the phylogenetic analysis of multiple-segmented viruses
journal, June 2016
- Huang, Hsin-Hsiung
- Journal of Theoretical Biology, Vol. 398
Previously unknown and highly divergent ssDNA viruses populate the oceans
journal, July 2013
- Labonté, Jessica M.; Suttle, Curtis A.
- The ISME Journal, Vol. 7, Issue 11
Identification of novel inter-genotypic recombinants of human hepatitis B viruses by large-scale phylogenetic analysis
journal, May 2012
- Shi, Weifeng; Carr, Michael J.; Dunford, Linda
- Virology, Vol. 427, Issue 1
Update on RefSeq microbial genomes resources
journal, December 2014
- Tatusova, Tatiana; Ciufo, Stacy; Federhen, Scott
- Nucleic Acids Research, Vol. 43, Issue D1
Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution
journal, December 2009
- Jun, S. -R.; Sims, G. E.; Wu, G. A.
- Proceedings of the National Academy of Sciences, Vol. 107, Issue 1
Ebolavirus comparative genomics
journal, July 2015
- Jun, Se-Ran; Leuze, Michael R.; Nookaew, Intawat
- FEMS Microbiology Reviews, Vol. 39, Issue 5
Microbial species delineation using whole genome sequences
journal, July 2015
- Varghese, Neha J.; Mukherjee, Supratim; Ivanova, Natalia
- Nucleic Acids Research, Vol. 43, Issue 14
A fast, lock-free approach for efficient parallel counting of occurrences of k-mers
journal, January 2011
- Marçais, Guillaume; Kingsford, Carl
- Bioinformatics, Vol. 27, Issue 6
phytools: an R package for phylogenetic comparative biology (and other things): phytools: R package
journal, December 2011
- Revell, Liam J.
- Methods in Ecology and Evolution, Vol. 3, Issue 2
Comparison of Real Frequencies of Strings vs. the Expected Ones Reveals the Information Capacity of Macromoleculae
journal, January 2003
- Sadovsky, Michael G.
- Journal of Biological Physics, Vol. 29, Issue 1, p. 23-38
On Information and Sufficiency
journal, March 1951
- Kullback, S.; Leibler, R. A.
- The Annals of Mathematical Statistics, Vol. 22, Issue 1
A mathematical theory of communication
journal, January 2001
- Shannon, C. E.
- ACM SIGMOBILE Mobile Computing and Communications Review, Vol. 5, Issue 1
Comparison of phylogenetic trees
journal, February 1981
- Robinson, D. F.; Foulds, L. R.
- Mathematical Biosciences, Vol. 53, Issue 1-2
A Mathematical Theory of Communication
journal, July 1948
- Shannon, C. E.
- Bell System Technical Journal, Vol. 27, Issue 3
Inferring parsimonious migration histories for metastatic cancers
journal, April 2018
- El-Kebir, Mohammed; Satas, Gryte; Raphael, Benjamin J.
- Nature Genetics, Vol. 50, Issue 5
A Mathematical Theory of Communication
journal, October 1948
- Shannon, C. E.
- Bell System Technical Journal, Vol. 27, Issue 4
Methods for virus classification and the challenge of incorporating metagenomic sequence data
journal, June 2015
- Simmonds, Peter
- Journal of General Virology, Vol. 96, Issue 6
Using the whole-genome sequence to characterize and name human adenoviruses
text, January 2011
- Seto, D.; Chodosh, J.; Brister, J. R.
- American Society for Microbiology
Systems Analysis of High-Throughput Data
book, January 2014
- Braun, Rosemary
- A Systems Biology Approach to Blood
Usefulness and limitation of phylogenetic analysis for hepatitis C virus core region: application to isolates from Egyptian and Yemeni patients
journal, June 1996
- Ohno, T.; Mizokami, M.; Saleh, M. G.
- Archives of Virology, Vol. 141, Issue 6
Whole genome sequencing as a means to assess pathogenic mutations in medical genetics and cancer
journal, December 2014
- Royer-Bertrand, Beryl; Rivolta, Carlo
- Cellular and Molecular Life Sciences, Vol. 72, Issue 8
Tracking the changes in virus taxonomy
journal, February 2015
- Adams, M. J.; Hendrickson, R. C.; Dempsey, D. M.
- Archives of Virology, Vol. 160, Issue 5
Revision of Begomovirus taxonomy based on pairwise sequence comparisons
journal, April 2015
- Brown, Judith K.; Zerbini, F. Murilo; Navas-Castillo, Jesús
- Archives of Virology, Vol. 160, Issue 6
An ensemble distance measure of k-mer and Natural Vector for the phylogenetic analysis of multiple-segmented viruses
journal, June 2016
- Huang, Hsin-Hsiung
- Journal of Theoretical Biology, Vol. 398
Clustering DNA sequences using the out-of-place measure with reduced n-grams
journal, October 2016
- Huang, Hsin-Hsiung; Yu, Chenglong
- Journal of Theoretical Biology, Vol. 406
Identification of novel inter-genotypic recombinants of human hepatitis B viruses by large-scale phylogenetic analysis
journal, May 2012
- Shi, Weifeng; Carr, Michael J.; Dunford, Linda
- Virology, Vol. 427, Issue 1
Global comparison of multiple-segmented viruses in 12-dimensional genome space
journal, December 2014
- Huang, Hsin-Hsiung; Yu, Chenglong; Zheng, Hui
- Molecular Phylogenetics and Evolution, Vol. 81
Previously unknown and highly divergent ssDNA viruses populate the oceans
journal, July 2013
- Labonté, Jessica M.; Suttle, Curtis A.
- The ISME Journal, Vol. 7, Issue 11
Phylogenomics and the reconstruction of the tree of life
journal, May 2005
- Delsuc, Frédéric; Brinkmann, Henner; Philippe, Hervé
- Nature Reviews Genetics, Vol. 6, Issue 5
Computational solutions to large-scale data management and analysis
journal, September 2010
- Schadt, Eric E.; Linderman, Michael D.; Sorenson, Jon
- Nature Reviews Genetics, Vol. 11, Issue 9
Whole-proteome phylogeny of large dsDNA virus families by an alignment-free method
journal, June 2009
- Wu, G. A.; Jun, S. -R.; Sims, G. E.
- Proceedings of the National Academy of Sciences, Vol. 106, Issue 31
Whole-genome phylogeny of mammals: Evolutionary information in genic and nongenic regions
journal, September 2009
- Sims, G. E.; Jun, S. -R.; Wu, G. A.
- Proceedings of the National Academy of Sciences, Vol. 106, Issue 40
Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs)
journal, May 2011
- Sims, G. E.; Kim, S. -H.
- Proceedings of the National Academy of Sciences, Vol. 108, Issue 20
Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis
journal, July 2013
- Bonham-Carter, O.; Steele, J.; Bastola, D.
- Briefings in Bioinformatics, Vol. 15, Issue 6
Alignment-free sequence comparison--a review
journal, March 2003
- Vinga, S.; Almeida, J.
- Bioinformatics, Vol. 19, Issue 4
kmacs: the k -mismatch average common substring approach to alignment-free sequence comparison
journal, May 2014
- Leimeister, Chris-Andre; Morgenstern, Burkhard
- Bioinformatics, Vol. 30, Issue 14
Ebolavirus comparative genomics
journal, July 2015
- Jun, Se-Ran; Leuze, Michael R.; Nookaew, Intawat
- FEMS Microbiology Reviews, Vol. 39, Issue 5
Update on RefSeq microbial genomes resources
journal, December 2014
- Tatusova, Tatiana; Ciufo, Stacy; Federhen, Scott
- Nucleic Acids Research, Vol. 43, Issue D1
Microbial species delineation using whole genome sequences
journal, July 2015
- Varghese, Neha J.; Mukherjee, Supratim; Ivanova, Natalia
- Nucleic Acids Research, Vol. 43, Issue 14
Viral evolution and the emergence of SARS coronavirus
journal, July 2004
- Holmes, Edward C.; Rambaut, Andrew
- Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences, Vol. 359, Issue 1447
Using the Whole-Genome Sequence To Characterize and Name Human Adenoviruses
journal, March 2011
- Seto, D.; Chodosh, J.; Brister, J. R.
- Journal of Virology, Vol. 85, Issue 11
National Center for Biotechnology Information Viral Genomes Project
journal, June 2004
- Bao, Y.; Federhen, S.; Leipe, D.
- Journal of Virology, Vol. 78, Issue 14
Phylogenetic Incongruence among Oncogenic Genital Alpha Human Papillomaviruses
journal, November 2005
- Narechania, A.; Chen, Z.; DeSalle, R.
- Journal of Virology, Vol. 79, Issue 24
The Human Skin Double-Stranded DNA Virome: Topographical and Temporal Diversity, Genetic Enrichment, and Dynamic Associations with the Host Microbiome
journal, October 2015
- Hannigan, Geoffrey D.; Meisel, Jacquelyn S.; Tyldsley, Amanda S.
- mBio, Vol. 6, Issue 5
Assessment of codivergence of Mastreviruses with their plant hosts
journal, January 2008
- Wu, Beilei; Melcher, Ulrich; Guo, Xingyi
- BMC Evolutionary Biology, Vol. 8, Issue 1
Evolution of the M gene of the influenza A virus in different host species: large-scale sequence analysis
journal, January 2009
- Furuse, Yuki; Suzuki, Akira; Kamigaki, Taro
- Virology Journal, Vol. 6, Issue 1
Alignment-free phylogeny of whole genomes using underlying subwords
journal, December 2012
- Comin, Matteo; Verzotto, Davide
- Algorithms for Molecular Biology, Vol. 7, Issue 1
Metagenomic Characterisation of the Viral Community of Lough Neagh, the Largest Freshwater Lake in Ireland
journal, February 2016
- Skvortsov, Timofey; de Leeuwe, Colin; Quinn, John P.
- PLOS ONE, Vol. 11, Issue 2
WGS Analysis and Interpretation in Clinical and Public Health Microbiology Laboratories: What Are the Requirements and How Do Existing Tools Compare?
journal, June 2014
- Wyres, Kelly; Conway, Thomas; Garg, Saurabh
- Pathogens, Vol. 3, Issue 2
Works referencing / citing this record:
Defining a Core Genome for the Herpesvirales and Exploring their Evolutionary Relationship with the Caudovirales
journal, August 2019
- Andrade-Martínez, Juan S.; Moreno-Gallego, J. Leonardo; Reyes, Alejandro
- Scientific Reports, Vol. 9, Issue 1
Comparison of Compression-Based Measures with Application to the Evolution of Primate Genomes
journal, May 2018
- Pratas, Diogo; Silva, Raquel; Pinho, Armando
- Entropy, Vol. 20, Issue 6
Metagenomic Composition Analysis of an Ancient Sequenced Polar Bear Jawbone from Svalbard
journal, September 2018
- Pratas, Diogo; Hosseini, Morteza; Grilo, Gonçalo
- Genes, Vol. 9, Issue 9
Defining a Core Genome for the Herpesvirales and Exploring their Evolutionary Relationship with the Caudovirales
journal, August 2019
- Andrade-Martínez, Juan S.; Moreno-Gallego, J. Leonardo; Reyes, Alejandro
- Scientific Reports, Vol. 9, Issue 1
SSAW: A new sequence similarity analysis method based on the stationary discrete wavelet transform
journal, May 2018
- Lin, Jie; Wei, Jing; Adjeroh, Donald
- BMC Bioinformatics, Vol. 19, Issue 1