skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Challenges in Microbial Database Interoperability Interagency Microbe Project Working Group

Technical Report ·
DOI:https://doi.org/10.2172/15005933· OSTI ID:15005933

Currently, data of interest to microbial researchers is spread across hundreds of web-accessible data sources, each with a unique interface and data format. Researchers interact with a few of these sites when they analyze their data, but are not able to utilize the majority of them on a regular basis. There are two significant challenges that must be overcome to integrate this environment and allow researchers to efficiently perform data analysis across the entire set of relevant data, or at least a significant portion of it. The first is to provide consistent access to the large numbers of distributed, heterogeneous data sets that are currently distributed over the web. The second is to define the semantics of the data provided by the individual sites in such a way that semantic conflicts can be identified and, ideally, resolved. The first step in establishing any integrated environment, from a data warehouse to a multi-database system, is provide consistent access to all of the relevant sources. While the type of access required will vary based on the integration strategy chosen--for example federated systems use query-based access while warehouses may prefer access to the underlying database--the essence of this challenge remains the same. Thus, without sacrificing generality, the remainder of this discussion focuses on query-based access. Each data source independently determines the queries that it supports, how it will answer them, and the interface that it will use to make them. Even when the same query capability is provided by different sources the details of the interface are usually different. For example, while many sequence data sources support blast searches, they differ in the parameter names, available options, script locations, etc. These differences are not restricted solely to input parameters; the query results returned by different sources also vary dramatically, with some sources returning XML, others preformatted text, and still others a variety of formats. This set of disparate interfaces makes developing an integrated environment extremely challenging because a specialized wrapper needs to be created for each data source. Once consistent data access has been provided, the next challenge is to provide a semantically and syntactically consistent environment for the scientists to use. This would allow them to smoothly transfer data between different query interfaces and applications. Unfortunately, this is an even more daunting task than providing data access because resolving semantic differences between data sources first requires understanding the semantics being used by them. Currently, a source's semantic description of its data is usually buried in its documentation, if it is provided at all. As a result, scientists have become adept at simply looking at the data being provided and divining a first-order approximation of the semantics used by the source. Often, this approximation is sufficient for the types of queries that are being asked. However, when precise semantics are needed, a tedious and time-consuming search must be undertaken. Fortunately, some communities are becoming aware of this problem and are developing ontologies that overcome it by precisely defining the semantics of commonly used terms. While this simplifies data integration for those sources that adhere to a specific ontology, the definition of a single ontology for the entire domain of genomics remains a (probably unachievable) dream. Resolving syntactic differences is relatively straight-forward once the semantic ones have been resolved.

Research Organization:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Sponsoring Organization:
US Department of Energy (US)
DOE Contract Number:
W-7405-ENG-48
OSTI ID:
15005933
Report Number(s):
UCRL-ID-146327; TRN: US200402%%220
Resource Relation:
Other Information: PBD: 21 Nov 2001
Country of Publication:
United States
Language:
English

Similar Records

Design of the National Bioforensics Library Infrastructure
Technical Report · Mon Feb 02 00:00:00 EST 2004 · OSTI ID:15005933

Automatic generation of warehouse mediators using an ontology engine
Conference · Wed Mar 04 00:00:00 EST 1998 · OSTI ID:15005933

Meta-data based mediator generation
Conference · Sun Jun 28 00:00:00 EDT 1998 · OSTI ID:15005933