skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: BioWarehouse: a bioinformatics database warehouse toolkit

Journal Article · · BMC Bioinformatics
 [1];  [1];  [1];  [1];  [2];  [3];  [1]
  1. SRI International, Menlo Park, CA (United States). Bioinformatics Research Group
  2. SRI International, Menlo Park, CA (United States). Computer Science Lab.
  3. Stanford Univ., CA (United States). Stanford Medical Informatics

Background: This article addresses the problem of interoperation of heterogeneous bioinformatics databases. Results: We introduce BioWarehouse, an open source toolkit for constructing bioinformatics database warehouses using the MySQL and Oracle relational database managers. BioWarehouse integrates its component databases into a common representational framework within a single database management system, thus enabling multi-database queries using the Structured Query Language (SQL) but also facilitating a variety of database integration tasks such as comparative analysis and data mining. BioWarehouse currently supports the integration of a pathway-centric set of databases including ENZYME, KEGG, and BioCyc, and in addition the UniProt, GenBank, NCBI Taxonomy, and CMR databases, and the Gene Ontology. Loader tools, written in the C and JAVA languages, parse and load these databases into a relational database schema. The loaders also apply a degree of semantic normalization to their respective source data, decreasing semantic heterogeneity. The schema supports the following bioinformatics datatypes: chemical compounds, biochemical reactions, metabolic pathways, proteins, genes, nucleic acid sequences, features on protein and nucleic-acid sequences, organisms, organism taxonomies, and controlled vocabularies. As an application example, we applied BioWarehouse to determine the fraction of biochemically characterized enzyme activities for which no sequences exist in the public sequence databases. The answer is that no sequence exists for 36% of enzyme activities for which EC numbers have been assigned. These gaps in sequence data significantly limit the accuracy of genome annotation and metabolic pathway prediction, and are a barrier for metabolic engineering. Complex queries of this type provide examples of the value of the data warehousing approach to bioinformatics research. Conclusion: BioWarehouse embodies significant progress on the database integration problem for bioinformatics.

Research Organization:
Stanford Univ., CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Biological and Environmental Research (BER). Biological Systems Science Division; Defense Advanced Research Projects Agency (DARPA)
Grant/Contract Number:
FG03-01ER63219; F30602-01-C-0153
OSTI ID:
1626315
Journal Information:
BMC Bioinformatics, Vol. 7, Issue 1; ISSN 1471-2105
Publisher:
BioMed CentralCopyright Statement
Country of Publication:
United States
Language:
English

References (30)

Prototype Implementation of the Integrated Genomic Database journal April 1994
EnsMart: A Generic System for Fast and Flexible Access to Biological Data journal December 2003
The Molecular Biology Database Collection: 2004 update journal January 2004
MetaCyc: a multiorganism database of metabolic pathways and enzymes journal January 2004
Database resources of the National Center for Biotechnology Information journal January 2001
Kleisli: a new tool for data integration in biology journal September 1999
Call for an enzyme genomics initiative journal January 2004
BioSPICE: Access to the Most Current Computational Tools for Biologists journal December 2003
Heterogeneous Molecular Biology Databases journal January 1995
The EcoCyc Database journal January 2002
Federated database systems for managing distributed, heterogeneous, and autonomous databases journal September 1990
GenBank journal January 2000
Dietary palmitic acid promotes a prometastatic memory via Schwann cells journal November 2021
TAMBIS: Transparent Access to Multiple Bioinformatics Information Sources journal February 2000
Red versus green leaves: transcriptomic comparison of foliar senescence between two Prunus cerasifera genotypes journal February 2020
Atlas – a data warehouse for integrative bioinformatics text January 2005
An ontology for biological function based on molecular interactions journal March 2000
Challenges in Integrating Biological Data Sources journal January 1995
A Strategy for Database Interoperation journal January 1995
Genome Informatics I: Community Databases journal January 1994
K2/Kleisli and GUS: Experiments in integrated access to genomic data sources journal January 2001
DiscoveryLink: A system for integrated access to life sciences data sources journal January 2001
Database resources of the National Center for Biotechnology Information journal January 2000
The ENZYME database in 2000 journal January 2000
KEGG: Kyoto Encyclopedia of Genes and Genomes journal January 2000
The Comprehensive Microbial Resource journal January 2001
Gene Ontology: tool for the unification of biology journal May 2000
From Annotated Genomes to Metabolic Flux Models and Kinetic Parameter Fitting journal September 2003
The Integrated Genomic Database (IGD) book January 1994
Atlas – a data warehouse for integrative bioinformatics journal January 2005

Cited By (34)

GIDL: a rule based expert system for GenBank Intelligent Data Loading into the Molecular Biodiversity database journal March 2012
Ultra-Structure database design methodology for managing systems biology data and analyses journal August 2009
XML-based approaches for the integration of heterogeneous bio-molecular data journal October 2009
Pathway Tools version 19.0 update: software for pathway/genome informatics and systems biology journal October 2015
Current Trends and New Challenges of Databases and Web Applications for Systems Driven Biological Research journal January 2010
CycADS: an annotation database system to ease the development and update of BioCyc databases journal January 2011
Data management in systems biology I - Overview and bibliography preprint January 2009
FlyMine: an integrated database for Drosophila and Anopheles genomics journal January 2007
A survey of orphan enzyme activities journal July 2007
Userscripts for the Life Sciences journal December 2007
bioDBnet: the biological database network journal January 2009
Explorative search of distributed bio-data to answer complex biomedical questions journal January 2014
Pathway Tools version 13.0: integrated software for pathway/genome informatics and systems biology journal December 2009
A systematic study of genome context methods: calibration, normalization and combination journal October 2010
Bioinformatic-driven search for metabolic biomarkers in disease journal January 2011
Integrated network analysis identifies nitric oxide response networks and dihydroxyacid dehydratase as a crucial target in Escherichia coli journal May 2007
TargetMine, an Integrated Data Warehouse for Candidate Gene Prioritisation and Target Discovery journal March 2011
Biomine: predicting links between biological entities using network models of heterogeneous databases journal June 2012
Finding Sequences for over 270 Orphan Enzymes journal May 2014
The quality of metabolic pathway resources depends on initial enzymatic function assignments: a case for maize journal November 2016
GMODWeb: a web framework for the generic model organism database journal January 2008
Flexible network reconstruction from relational databases with Cytoscape and CytoSQL journal July 2010
pubmed2ensembl: A Resource for Mining the Biological Literature on Genes journal September 2011
Techniques for integrating -omics data journal January 2009
Efficiently Storing and Analyzing Genome Data in Database Systems journal June 2017
The outcomes of pathway database computations depend on pathway ontology journal July 2006
Clinical decision support systems for improving diagnostic accuracy and achieving precision medicine journal March 2015
Critical assessment of human metabolic pathway databases: a stepping stone for future integration journal October 2011
DASMI: exchanging, annotating and assessing molecular interaction data journal May 2009
Transparent mediation-based access to multiple yeast data sources using an ontology driven interface journal January 2012
Identification of human-to-human transmissibility factors in PB2 proteins of influenza A by large-scale mutual information analysis journal February 2008
GenoQuery: a new querying module for functional annotation in a genomic warehouse journal July 2008
Booly: a new data integration platform journal October 2010
metabolicMine: an integrated genomics, genetics and proteomics data warehouse for common metabolic disease research journal January 2013