skip to main content

DOE PAGESDOE PAGES

Title: Viral phylogenomics using an alignment-free method: A three-step approach to determine optimal length of k-mer

The development of rapid, economical genome sequencing has shed new light on the classification of viruses. As of October 2016, the National Center for Biotechnology Information (NCBI) database contained >2 million viral genome sequences and a reference set of ~4000 viral genome sequences that cover a wide range of known viral families. Whole-genome sequences can be used to improve viral classification and provide insight into the viral tree of life . However, due to the lack of evolutionary conservation amongst diverse viruses, it is not feasible to build a viral tree of life using traditional phylogenetic methods based on conserved proteins. In this study, we used an alignment-free method that uses k-mers as genomic features for a large-scale comparison of complete viral genomes available in RefSeq. To determine the optimal feature length, k (an essential step in constructing a meaningful dendrogram), we designed a comprehensive strategy that combines three approaches: (1) cumulative relative entropy, (2) average number of common features among genomes, and (3) the Shannon diversity index. This strategy was used to determine k for all 3,905 complete viral genomes in RefSeq. Lastly, the resulting dendrogram shows consistency with the viral taxonomy of the ICTV and the Baltimore classificationmore » of viruses.« less
Authors:
 [1] ;  [2] ;  [1] ;  [2] ;  [2]
  1. Univ. of Tennessee, Knoxville, TN (United States); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
  2. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Univ. of Arkansas for Medical Sciences, Little Rock, AR (United States)
Publication Date:
Grant/Contract Number:
AC05-00OR22725
Type:
Accepted Manuscript
Journal Name:
Scientific Reports
Additional Journal Information:
Journal Volume: 7; Journal ID: ISSN 2045-2322
Publisher:
Nature Publishing Group
Research Org:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org:
USDOE Office of Science (SC)
Country of Publication:
United States
Language:
English
Subject:
60 APPLIED LIFE SCIENCES; 96 KNOWLEDGE MANAGEMENT AND PRESERVATION; classification and taxonomy; genome informatics
OSTI Identifier:
1351783

Zhang, Qian, Jun, Se -Ran, Leuze, Michael, Ussery, David, and Nookaew, Intawat. Viral phylogenomics using an alignment-free method: A three-step approach to determine optimal length of k-mer. United States: N. p., Web. doi:10.1038/srep40712.
Zhang, Qian, Jun, Se -Ran, Leuze, Michael, Ussery, David, & Nookaew, Intawat. Viral phylogenomics using an alignment-free method: A three-step approach to determine optimal length of k-mer. United States. doi:10.1038/srep40712.
Zhang, Qian, Jun, Se -Ran, Leuze, Michael, Ussery, David, and Nookaew, Intawat. 2017. "Viral phylogenomics using an alignment-free method: A three-step approach to determine optimal length of k-mer". United States. doi:10.1038/srep40712. https://www.osti.gov/servlets/purl/1351783.
@article{osti_1351783,
title = {Viral phylogenomics using an alignment-free method: A three-step approach to determine optimal length of k-mer},
author = {Zhang, Qian and Jun, Se -Ran and Leuze, Michael and Ussery, David and Nookaew, Intawat},
abstractNote = {The development of rapid, economical genome sequencing has shed new light on the classification of viruses. As of October 2016, the National Center for Biotechnology Information (NCBI) database contained >2 million viral genome sequences and a reference set of ~4000 viral genome sequences that cover a wide range of known viral families. Whole-genome sequences can be used to improve viral classification and provide insight into the viral tree of life . However, due to the lack of evolutionary conservation amongst diverse viruses, it is not feasible to build a viral tree of life using traditional phylogenetic methods based on conserved proteins. In this study, we used an alignment-free method that uses k-mers as genomic features for a large-scale comparison of complete viral genomes available in RefSeq. To determine the optimal feature length, k (an essential step in constructing a meaningful dendrogram), we designed a comprehensive strategy that combines three approaches: (1) cumulative relative entropy, (2) average number of common features among genomes, and (3) the Shannon diversity index. This strategy was used to determine k for all 3,905 complete viral genomes in RefSeq. Lastly, the resulting dendrogram shows consistency with the viral taxonomy of the ICTV and the Baltimore classification of viruses.},
doi = {10.1038/srep40712},
journal = {Scientific Reports},
number = ,
volume = 7,
place = {United States},
year = {2017},
month = {1}
}