skip to main content


Title: Viral phylogenomics using an alignment-free method: A three-step approach to determine optimal length of k-mer

The development of rapid, economical genome sequencing has shed new light on the classification of viruses. As of October 2016, the National Center for Biotechnology Information (NCBI) database contained >2 million viral genome sequences and a reference set of ~4000 viral genome sequences that cover a wide range of known viral families. Whole-genome sequences can be used to improve viral classification and provide insight into the viral tree of life . However, due to the lack of evolutionary conservation amongst diverse viruses, it is not feasible to build a viral tree of life using traditional phylogenetic methods based on conserved proteins. In this study, we used an alignment-free method that uses k-mers as genomic features for a large-scale comparison of complete viral genomes available in RefSeq. To determine the optimal feature length, k (an essential step in constructing a meaningful dendrogram), we designed a comprehensive strategy that combines three approaches: (1) cumulative relative entropy, (2) average number of common features among genomes, and (3) the Shannon diversity index. This strategy was used to determine k for all 3,905 complete viral genomes in RefSeq. Lastly, the resulting dendrogram shows consistency with the viral taxonomy of the ICTV and the Baltimore classificationmore » of viruses.« less
 [1] ;  [2] ;  [1] ;  [2] ;  [2]
  1. Univ. of Tennessee, Knoxville, TN (United States); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
  2. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Univ. of Arkansas for Medical Sciences, Little Rock, AR (United States)
Publication Date:
Grant/Contract Number:
Accepted Manuscript
Journal Name:
Scientific Reports
Additional Journal Information:
Journal Volume: 7; Journal ID: ISSN 2045-2322
Nature Publishing Group
Research Org:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org:
USDOE Office of Science (SC)
Country of Publication:
United States
60 APPLIED LIFE SCIENCES; 96 KNOWLEDGE MANAGEMENT AND PRESERVATION; classification and taxonomy; genome informatics
OSTI Identifier: