BioWarehouse: a bioinformatics database warehouse toolkit

Lee, Thomas J.; Pouliot, Yannick; Wagner, Valerie; Gupta, Priyanka; Stringer-Calvert, David W. J.; Tenenbaum, Jessica D.; Karp, Peter D.

doi:10.1186/1471-2105-7-170

Title: BioWarehouse: a bioinformatics database warehouse toolkit

Journal Article · Thu Mar 23 00:00:00 EST 2006 · BMC Bioinformatics

DOI:https://doi.org/10.1186/1471-2105-7-170· OSTI ID:1626315

Lee, Thomas J. ^[1]; Pouliot, Yannick ^[1]; Wagner, Valerie ^[1]; Gupta, Priyanka ^[1]; Stringer-Calvert, David W. J. ^[2]; Tenenbaum, Jessica D. ^[3]; Karp, Peter D. ^[1]

SRI International, Menlo Park, CA (United States). Bioinformatics Research Group
SRI International, Menlo Park, CA (United States). Computer Science Lab.
Stanford Univ., CA (United States). Stanford Medical Informatics

Background: This article addresses the problem of interoperation of heterogeneous bioinformatics databases. Results: We introduce BioWarehouse, an open source toolkit for constructing bioinformatics database warehouses using the MySQL and Oracle relational database managers. BioWarehouse integrates its component databases into a common representational framework within a single database management system, thus enabling multi-database queries using the Structured Query Language (SQL) but also facilitating a variety of database integration tasks such as comparative analysis and data mining. BioWarehouse currently supports the integration of a pathway-centric set of databases including ENZYME, KEGG, and BioCyc, and in addition the UniProt, GenBank, NCBI Taxonomy, and CMR databases, and the Gene Ontology. Loader tools, written in the C and JAVA languages, parse and load these databases into a relational database schema. The loaders also apply a degree of semantic normalization to their respective source data, decreasing semantic heterogeneity. The schema supports the following bioinformatics datatypes: chemical compounds, biochemical reactions, metabolic pathways, proteins, genes, nucleic acid sequences, features on protein and nucleic-acid sequences, organisms, organism taxonomies, and controlled vocabularies. As an application example, we applied BioWarehouse to determine the fraction of biochemically characterized enzyme activities for which no sequences exist in the public sequence databases. The answer is that no sequence exists for 36% of enzyme activities for which EC numbers have been assigned. These gaps in sequence data significantly limit the accuracy of genome annotation and metabolic pathway prediction, and are a barrier for metabolic engineering. Complex queries of this type provide examples of the value of the data warehousing approach to bioinformatics research. Conclusion: BioWarehouse embodies significant progress on the database integration problem for bioinformatics.

View Accepted Manuscript (DOE)

Cite

Export

Save

Research Organization:: Stanford Univ., CA (United States)

Sponsoring Organization:: USDOE Office of Science (SC), Biological and Environmental Research (BER). Biological Systems Science Division; Defense Advanced Research Projects Agency (DARPA)

Grant/Contract Number:: FG03-01ER63219; F30602-01-C-0153

OSTI ID:: 1626315

Journal Information:: BMC Bioinformatics, Vol. 7, Issue 1; ISSN 1471-2105

Publisher:: BioMed CentralCopyright Statement

Country of Publication:: United States

Language:: English

References (30)

Prototype Implementation of the Integrated Genomic Database Ritter, O.; Kocab, P.; Senger, M. Computers and Biomedical Research, Vol. 27, Issue 2 https://doi.org/10.1006/cbmr.1994.1011	journal	April 1994
EnsMart: A Generic System for Fast and Flexible Access to Biological Data Kasprzyk, A. Genome Research, Vol. 14, Issue 1 https://doi.org/10.1101/gr.1645104	journal	December 2003
The Molecular Biology Database Collection: 2004 update Galperin, M. Y. Nucleic Acids Research, Vol. 32, Issue 90001 https://doi.org/10.1093/nar/gkh143	journal	January 2004
MetaCyc: a multiorganism database of metabolic pathways and enzymes Krieger, C. J. Nucleic Acids Research, Vol. 32, Issue 90001 https://doi.org/10.1093/nar/gkh100	journal	January 2004
Database resources of the National Center for Biotechnology Information Wheeler, D. L. Nucleic Acids Research, Vol. 29, Issue 1 https://doi.org/10.1093/nar/29.1.11	journal	January 2001
Kleisli: a new tool for data integration in biology Chung, Su Yun; Wong, Limsoon Trends in Biotechnology, Vol. 17, Issue 9 https://doi.org/10.1016/s0167-7799(99)01342-6	journal	September 1999
Call for an enzyme genomics initiative Karp, Peter D. Genome Biology, Vol. 5, Issue 8 https://doi.org/10.1186/gb-2004-5-8-401	journal	January 2004
BioSPICE: Access to the Most Current Computational Tools for Biologists Garvey, Thomas D.; Lincoln, Patrick; Pedersen, Charles John OMICS: A Journal of Integrative Biology, Vol. 7, Issue 4 https://doi.org/10.1089/153623103322637715	journal	December 2003
Heterogeneous Molecular Biology Databases Markowitz, Victor M. Journal of Computational Biology, Vol. 2, Issue 4 https://doi.org/10.1089/cmb.1995.2.537	journal	January 1995
The EcoCyc Database Karp, P. D. Nucleic Acids Research, Vol. 30, Issue 1 https://doi.org/10.1093/nar/30.1.56	journal	January 2002
Federated database systems for managing distributed, heterogeneous, and autonomous databases Sheth, Amit P.; Larson, James A. ACM Computing Surveys, Vol. 22, Issue 3 https://doi.org/10.1145/96602.96604	journal	September 1990
GenBank Benson, D. A. Nucleic Acids Research, Vol. 28, Issue 1 https://doi.org/10.1093/nar/28.1.15	journal	January 2000
Dietary palmitic acid promotes a prometastatic memory via Schwann cells Pascual, Gloria; Domínguez, Diana; Elosúa-Bayes, Marc Nature, Vol. 599, Issue 7885 https://doi.org/10.1038/s41586-021-04075-0	journal	November 2021
TAMBIS: Transparent Access to Multiple Bioinformatics Information Sources Stevens, Robert; Baker, Patricia; Bechhofer, Sean Bioinformatics, Vol. 16, Issue 2 https://doi.org/10.1093/bioinformatics/16.2.184	journal	February 2000
Red versus green leaves: transcriptomic comparison of foliar senescence between two Prunus cerasifera genotypes Vangelisti, Alberto; Guidi, Lucia; Cavallini, Andrea Scientific Reports, Vol. 10, Issue 1 https://doi.org/10.1038/s41598-020-58878-8	journal	February 2020
Atlas – a data warehouse for integrative bioinformatics Shah, Sohrab P.; Huang, Yong; Xu, Tao BioMed Central https://doi.org/10.14288/1.0224107	text	January 2005
An ontology for biological function based on molecular interactions Karp, Peter D. Bioinformatics, Vol. 16, Issue 3 https://doi.org/10.1093/bioinformatics/16.3.269	journal	March 2000
Challenges in Integrating Biological Data Sources Davidson, S. B.; Overton, C.; Buneman, P. Journal of Computational Biology, Vol. 2, Issue 4 https://doi.org/10.1089/cmb.1995.2.557	journal	January 1995
A Strategy for Database Interoperation Karp, Peter D. Journal of Computational Biology, Vol. 2, Issue 4 https://doi.org/10.1089/cmb.1995.2.573	journal	January 1995
Genome Informatics I: Community Databases Robbins, Robert J. Journal of Computational Biology, Vol. 1, Issue 3 https://doi.org/10.1089/cmb.1994.1.173	journal	January 1994
K2/Kleisli and GUS: Experiments in integrated access to genomic data sources Davidson, S. B.; Crabtree, J.; Brunk, B. P. IBM Systems Journal, Vol. 40, Issue 2 https://doi.org/10.1147/sj.402.0512	journal	January 2001
DiscoveryLink: A system for integrated access to life sciences data sources Haas, L. M.; Schwarz, P. M.; Kodali, P. IBM Systems Journal, Vol. 40, Issue 2 https://doi.org/10.1147/sj.402.0489	journal	January 2001
Database resources of the National Center for Biotechnology Information Wheeler, D. L. Nucleic Acids Research, Vol. 28, Issue 1 https://doi.org/10.1093/nar/28.1.10	journal	January 2000
The ENZYME database in 2000 Bairoch, A. Nucleic Acids Research, Vol. 28, Issue 1 https://doi.org/10.1093/nar/28.1.304	journal	January 2000
KEGG: Kyoto Encyclopedia of Genes and Genomes Kanehisa, Minoru; Goto, Susumu Nucleic Acids Research, Vol. 28, Issue 1, p. 27-30 https://doi.org/10.1093/nar/28.1.27	journal	January 2000
The Comprehensive Microbial Resource Peterson, J. D. Nucleic Acids Research, Vol. 29, Issue 1 https://doi.org/10.1093/nar/29.1.123	journal	January 2001
Gene Ontology: tool for the unification of biology Ashburner, Michael; Ball, Catherine A.; Blake, Judith A. Nature Genetics, Vol. 25, Issue 1 https://doi.org/10.1038/75556	journal	May 2000
From Annotated Genomes to Metabolic Flux Models and Kinetic Parameter Fitting Segrè, Daniel; Zucker, Jeremy; Katz, Jeremy OMICS: A Journal of Integrative Biology, Vol. 7, Issue 3 https://doi.org/10.1089/153623103322452413	journal	September 2003
The Integrated Genomic Database (IGD) Ritter, Otto Computational Methods in Genome Research https://doi.org/10.1007/978-1-4615-2451-9_5	book	January 1994
Atlas – a data warehouse for integrative bioinformatics Shah, Sohrab P.; Huang, Yong; Xu, Tao BMC Bioinformatics, Vol. 6, Issue 1 https://doi.org/10.1186/1471-2105-6-34	journal	January 2005

Cited By (34)

GIDL: a rule based expert system for GenBank Intelligent Data Loading into the Molecular Biodiversity database Pannarale, Paolo; Catalano, Domenico; De Caro, Giorgio BMC Bioinformatics, Vol. 13, Issue S4 https://doi.org/10.1186/1471-2105-13-s4-s4	journal	March 2012
Ultra-Structure database design methodology for managing systems biology data and analyses Maier, Christopher W.; Long, Jeffrey G.; Hemminger, Bradley M. BMC Bioinformatics, Vol. 10, Issue 1 https://doi.org/10.1186/1471-2105-10-254	journal	August 2009
XML-based approaches for the integration of heterogeneous bio-molecular data Mesiti, Marco; Jiménez-Ruiz, Ernesto; Sanz, Ismael BMC Bioinformatics, Vol. 10, Issue S12 https://doi.org/10.1186/1471-2105-10-s12-s7	journal	October 2009
Pathway Tools version 19.0 update: software for pathway/genome informatics and systems biology Karp, Peter D.; Latendresse, Mario; Paley, Suzanne M. Briefings in Bioinformatics, Vol. 17, Issue 5 https://doi.org/10.1093/bib/bbv079	journal	October 2015
Current Trends and New Challenges of Databases and Web Applications for Systems Driven Biological Research Sreenivasaiah, Pradeep Kumar; Kim, Do Han Frontiers in Physiology, Vol. 1 https://doi.org/10.3389/fphys.2010.00147	journal	January 2010
CycADS: an annotation database system to ease the development and update of BioCyc databases Vellozo, A. F.; Veron, A. S.; Baa-Puyoulet, P. Database, Vol. 2011, Issue 0 https://doi.org/10.1093/database/bar008	journal	January 2011
Data management in systems biology I - Overview and bibliography Mayer, Gerhard arXiv https://doi.org/10.48550/arxiv.0908.0411	preprint	January 2009
FlyMine: an integrated database for Drosophila and Anopheles genomics Lyne, Rachel; Smith, Richard; Rutherford, Kim Genome Biology, Vol. 8, Issue 7 https://doi.org/10.1186/gb-2007-8-7-r129	journal	January 2007
A survey of orphan enzyme activities Pouliot, Yannick; Karp, Peter D. BMC Bioinformatics, Vol. 8, Issue 1 https://doi.org/10.1186/1471-2105-8-244	journal	July 2007
Userscripts for the Life Sciences Willighagen, Egon L.; O'Boyle, Noel M.; Gopalakrishnan, Harini BMC Bioinformatics, Vol. 8, Issue 1 https://doi.org/10.1186/1471-2105-8-487	journal	December 2007
bioDBnet: the biological database network Mudunuri, Uma; Che, Anney; Yi, Ming Bioinformatics, Vol. 25, Issue 4 https://doi.org/10.1093/bioinformatics/btn654	journal	January 2009
Explorative search of distributed bio-data to answer complex biomedical questions Masseroli, Marco; Picozzi, Matteo; Ghisalberti, Giorgio BMC Bioinformatics, Vol. 15, Issue S1 https://doi.org/10.1186/1471-2105-15-s1-s3	journal	January 2014
Pathway Tools version 13.0: integrated software for pathway/genome informatics and systems biology Karp, P. D.; Paley, S. M.; Krummenacker, M. Briefings in Bioinformatics, Vol. 11, Issue 1 https://doi.org/10.1093/bib/bbp043	journal	December 2009
A systematic study of genome context methods: calibration, normalization and combination Ferrer, Luciana; Dale, Joseph M.; Karp, Peter D. BMC Bioinformatics, Vol. 11, Issue 1 https://doi.org/10.1186/1471-2105-11-493	journal	October 2010
Bioinformatic-driven search for metabolic biomarkers in disease Baumgartner, Christian; Osl, Melanie; Netzer, Michael Journal of Clinical Bioinformatics, Vol. 1, Issue 1 https://doi.org/10.1186/2043-9113-1-2	journal	January 2011
Integrated network analysis identifies nitric oxide response networks and dihydroxyacid dehydratase as a crucial target in Escherichia coli Hyduke, D. R.; Jarboe, L. R.; Tran, L. M. Proceedings of the National Academy of Sciences, Vol. 104, Issue 20 https://doi.org/10.1073/pnas.0610888104	journal	May 2007
TargetMine, an Integrated Data Warehouse for Candidate Gene Prioritisation and Target Discovery Chen, Yi-An; Tripathi, Lokesh P.; Mizuguchi, Kenji PLoS ONE, Vol. 6, Issue 3 https://doi.org/10.1371/journal.pone.0017844	journal	March 2011
Biomine: predicting links between biological entities using network models of heterogeneous databases Eronen, Lauri; Toivonen, Hannu BMC Bioinformatics, Vol. 13, Issue 1 https://doi.org/10.1186/1471-2105-13-119	journal	June 2012
Finding Sequences for over 270 Orphan Enzymes Shearer, Alexander G.; Altman, Tomer; Rhee, Christine D. PLoS ONE, Vol. 9, Issue 5 https://doi.org/10.1371/journal.pone.0097250	journal	May 2014
The quality of metabolic pathway resources depends on initial enzymatic function assignments: a case for maize Walsh, Jesse R.; Schaeffer, Mary L.; Zhang, Peifen BMC Systems Biology, Vol. 10, Issue 1 https://doi.org/10.1186/s12918-016-0369-x	journal	November 2016
GMODWeb: a web framework for the generic model organism database O'Connor, Brian D.; Day, Allen; Cain, Scott Genome Biology, Vol. 9, Issue 6 https://doi.org/10.1186/gb-2008-9-6-r102	journal	January 2008
Flexible network reconstruction from relational databases with Cytoscape and CytoSQL Laukens, Kris; Hollunder, Jens; Dang, Thanh Hai BMC Bioinformatics, Vol. 11, Issue 1 https://doi.org/10.1186/1471-2105-11-360	journal	July 2010
pubmed2ensembl: A Resource for Mining the Biological Literature on Genes Baran, Joachim; Gerner, Martin; Haeussler, Maximilian PLoS ONE, Vol. 6, Issue 9 https://doi.org/10.1371/journal.pone.0024716	journal	September 2011
Techniques for integrating -omics data Akula, Siva Prasad; Miriyala, Raghava Naidu; Thota, Hanuman Bioinformation, Vol. 3, Issue 6 https://doi.org/10.6026/97320630003284	journal	January 2009
Efficiently Storing and Analyzing Genome Data in Database Systems Dorok, Sebastian; Breß, Sebastian; Teubner, Jens Datenbank-Spektrum, Vol. 17, Issue 2 https://doi.org/10.1007/s13222-017-0254-9	journal	June 2017
The outcomes of pathway database computations depend on pathway ontology Green, M. L. Nucleic Acids Research, Vol. 34, Issue 13 https://doi.org/10.1093/nar/gkl438	journal	July 2006
Clinical decision support systems for improving diagnostic accuracy and achieving precision medicine Castaneda, Christian; Nalley, Kip; Mannion, Ciaran Journal of Clinical Bioinformatics, Vol. 5, Issue 1 https://doi.org/10.1186/s13336-015-0019-3	journal	March 2015
Critical assessment of human metabolic pathway databases: a stepping stone for future integration Stobbe, Miranda D.; Houten, Sander M.; Jansen, Gerbert A. BMC Systems Biology, Vol. 5, Issue 1 https://doi.org/10.1186/1752-0509-5-165	journal	October 2011
DASMI: exchanging, annotating and assessing molecular interaction data Blankenburg, Hagen; Finn, Robert D.; Prlić, Andreas Bioinformatics, Vol. 25, Issue 10 https://doi.org/10.1093/bioinformatics/btp142	journal	May 2009
Transparent mediation-based access to multiple yeast data sources using an ontology driven interface Briache, Abdelaali; Marrakchi, Kamar; Kerzazi, Amine BMC Bioinformatics, Vol. 13, Issue S1 https://doi.org/10.1186/1471-2105-13-s1-s7	journal	January 2012
Identification of human-to-human transmissibility factors in PB2 proteins of influenza A by large-scale mutual information analysis Miotto, Olivo; Heiny, At; Tan, Tin Wee BMC Bioinformatics, Vol. 9, Issue S1 https://doi.org/10.1186/1471-2105-9-s1-s18	journal	February 2008
GenoQuery: a new querying module for functional annotation in a genomic warehouse Lemoine, Frédéric; Labedan, Bernard; Froidevaux, Christine Bioinformatics, Vol. 24, Issue 13 https://doi.org/10.1093/bioinformatics/btn159	journal	July 2008
Booly: a new data integration platform Do, Long H.; Esteves, Francisco F.; Karten, Harvey J. BMC Bioinformatics, Vol. 11, Issue 1 https://doi.org/10.1186/1471-2105-11-513	journal	October 2010
metabolicMine: an integrated genomics, genetics and proteomics data warehouse for common metabolic disease research Lyne, Mike; Smith, Richard N.; Lyne, Rachel Database, Vol. 2013 https://doi.org/10.1093/database/bat060	journal	January 2013

Similar Records

Full Integration of Lipidomics Data into Multi-OMIC Functional Enrichment

Technical Report · Fri Nov 01 00:00:00 EDT 2019 · OSTI ID:1626315

Mitchell, Hugh D.; Kyle, Jennifer E.

The Gaggle: An open-source software system for integrating bioinformatics software and data sources

Journal Article · Tue Mar 28 00:00:00 EST 2006 · BMC Bioinformatics · OSTI ID:1626315

Shannon, Paul T.; Reiss, David J.; Bonneau, Richard; +1 more

EpiK: A Knowledge Base for Epidemiological Modeling and Analytics of Infectious Diseases

Journal Article · Mon Nov 06 00:00:00 EST 2017 · Journal of Healthcare Informatics Research · OSTI ID:1626315

Hasan, S. M. Shamimul; Fox, Edward A.; Bisset, Keith; +1 more

Related Subjects

59 BASIC BIOLOGICAL SCIENCES
97 MATHEMATICS AND COMPUTING
Biochemistry & Molecular Biology
Biotechnology & Applied Microbiology
Mathematical & Computational Biology

Title: BioWarehouse: a bioinformatics database warehouse toolkit

Citation Formats

References (30)

Cited By (34)

Similar Records

Related Subjects