skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Final Report PIPELINING RDP DATA TO THE "TAXOMATIC"

Technical Report ·
DOI:https://doi.org/10.2172/1053503· OSTI ID:1053503

This project builds on the results of previously funded research by integrating data and software that had been previously used in building resources used in the preparation of Bergey?s Manual of Systematic Bacteriology, 2nd Edition (Volumes 1 & 2A-C) and the Ribosomal Database Project-II (RDP-II) so as to both enhance the value of the data and create a pipeline approach to keeping the data current. Earlier, we demonstrated the value of using exploratory data analysis (EDA) to visualize large sets of sequence data (notably SSU rRNA gene sequences used in constructing a comprehensive phylogeny of prokaryotes. While the Self- Organizing Self-Correcting Classification (SOSCC) algorithms we developed were computationally efficient and useful for unraveling problems within the underlying data (e.g., identification of annotation errors, detection of unresolved synonymies, taxonomic and nomenclatural errors), bottlenecks at the preprocessing stage limited deployment of our applications as tools for end-users. To overcome the bottlenecks (which included hand alignment and computation of large matrices of pair-wise evolutionary distances), we proposed building a data pipeline between the ?Taxomatic? application and RDP-II. The objectives were to accelerate the production and distribution of the updated versions of the prokaryotic taxonomy in lock-step with publication of new taxa and rearrangement of existing taxa, and to distribute these data more readily with the RDP-II and and other stakeholders in the community. A related goal of the current project is to deploy our visualization techniques as an interactive web application by which end-users can view manipulate, and select datasets of particular interest based upon phylogenetic and genomic information, access sequence data, and ultimately the scientific literature where the original observations were made and those that build on the original observations. The Taxomatic is a web-based tool to visualize distance matrices. The tool accepts raw distance matrices or aligned sequence information as data sources. When sequence information is provided the distance matrix is computed using the uncorrected distance model. Users can upload files to the Taxomatic website or sequences can be submitted by a SOAP service. This SOAP service is used by RDP to streamline Taxomatic use with RDP data. In addition to supplying source information, users can either supply their own taxonomic information by uploading it in XML, retrieve data taxonomic information from the RDP using either RDP or Genbank identifiers as source data, with or without classification by the RDP Classifier web service, or completely omit taxonomic data. In the latter case, the input distance matrix can be viewed in the order in which it was loaded.

Research Organization:
Michigan State Univ., East Lansing, MI (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Biological and Environmental Research (BER)
DOE Contract Number:
FG02-04ER63933
OSTI ID:
1053503
Report Number(s):
DOE -MICHIGAN STATE- 63933
Country of Publication:
United States
Language:
English