skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Discovering the Unknown: Improving Detection of Novel Species and Genera from Short Reads

Abstract

High-throughput sequencing technologies enable metagenome profiling, simultaneous sequencing of multiple microbial species present within an environmental sample. Since metagenomic data includes sequence fragments (“reads”) from organisms that are absent from any database, new algorithms must be developed for the identification and annotation of novel sequence fragments. Homology-based techniques have been modified to detect novel species and genera, but, composition-based methods, have not been adapted. We develop a detection technique that can discriminate between “known” and “unknown” taxa, which can be used with composition-based methods, as well as a hybrid method. Unlike previous studies, we rigorously evaluate all algorithms for their ability to detect novel taxa. First, we show that the integration of a detector with a composition-based method performs significantly better than homology-based methods for the detection of novel species and genera, with best performance at finer taxonomic resolutions. Most importantly, we evaluate all the algorithms by introducing an “unknown” class and show that the modified version of PhymmBL has similar or better overall classification performance than the other modified algorithms, especially for the species-level and ultrashort reads. Finally, we evaluate theperformance of several algorithms on a real acid mine drainage dataset.

Authors:
 [1];  [2];  [3]; ORCiD logo [1];  [4]
  1. Department of Electrical and Computer Engineering, Drexel University, Philadelphia, PA 19104, USA
  2. Department of Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08028, USA
  3. Spoken Language Systems Laboratory, Instituto Superior Técnico, 1049-001 Lisbon, Portugal
  4. School of Biomedical Engineering, Science, and Health Systems, Drexel University, Philadelphia, PA 19104, USA
Publication Date:
Sponsoring Org.:
USDOE
OSTI Identifier:
1197845
Grant/Contract Number:  
SC0004335
Resource Type:
Published Article
Journal Name:
Journal of Biomedicine and Biotechnology
Additional Journal Information:
Journal Name: Journal of Biomedicine and Biotechnology Journal Volume: 2011; Journal ID: ISSN 1110-7243
Publisher:
Hindawi Publishing Corporation
Country of Publication:
Country unknown/Code not available
Language:
English

Citation Formats

Rosen, Gail L., Polikar, Robi, Caseiro, Diamantino A., Essinger, Steven D., and Sokhansanj, Bahrad A. Discovering the Unknown: Improving Detection of Novel Species and Genera from Short Reads. Country unknown/Code not available: N. p., 2011. Web. doi:10.1155/2011/495849.
Rosen, Gail L., Polikar, Robi, Caseiro, Diamantino A., Essinger, Steven D., & Sokhansanj, Bahrad A. Discovering the Unknown: Improving Detection of Novel Species and Genera from Short Reads. Country unknown/Code not available. doi:10.1155/2011/495849.
Rosen, Gail L., Polikar, Robi, Caseiro, Diamantino A., Essinger, Steven D., and Sokhansanj, Bahrad A. Sat . "Discovering the Unknown: Improving Detection of Novel Species and Genera from Short Reads". Country unknown/Code not available. doi:10.1155/2011/495849.
@article{osti_1197845,
title = {Discovering the Unknown: Improving Detection of Novel Species and Genera from Short Reads},
author = {Rosen, Gail L. and Polikar, Robi and Caseiro, Diamantino A. and Essinger, Steven D. and Sokhansanj, Bahrad A.},
abstractNote = {High-throughput sequencing technologies enable metagenome profiling, simultaneous sequencing of multiple microbial species present within an environmental sample. Since metagenomic data includes sequence fragments (“reads”) from organisms that are absent from any database, new algorithms must be developed for the identification and annotation of novel sequence fragments. Homology-based techniques have been modified to detect novel species and genera, but, composition-based methods, have not been adapted. We develop a detection technique that can discriminate between “known” and “unknown” taxa, which can be used with composition-based methods, as well as a hybrid method. Unlike previous studies, we rigorously evaluate all algorithms for their ability to detect novel taxa. First, we show that the integration of a detector with a composition-based method performs significantly better than homology-based methods for the detection of novel species and genera, with best performance at finer taxonomic resolutions. Most importantly, we evaluate all the algorithms by introducing an “unknown” class and show that the modified version of PhymmBL has similar or better overall classification performance than the other modified algorithms, especially for the species-level and ultrashort reads. Finally, we evaluate theperformance of several algorithms on a real acid mine drainage dataset.},
doi = {10.1155/2011/495849},
journal = {Journal of Biomedicine and Biotechnology},
number = ,
volume = 2011,
place = {Country unknown/Code not available},
year = {2011},
month = {1}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record
DOI: 10.1155/2011/495849

Citation Metrics:
Cited by: 7 works
Citation information provided by
Web of Science

Save / Share: