skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Sequence modelling and an extensible data model for genomic database

Abstract

The Human Genome Project (HGP) plans to sequence the human genome by the beginning of the next century. It will generate DNA sequences of more than 10 billion bases and complex marker sequences (maps) of more than 100 million markers. All of these information will be stored in database management systems (DBMSs). However, existing data models do not have the abstraction mechanism for modelling sequences and existing DBMS`s do not have operations for complex sequences. This work addresses the problem of sequence modelling in the context of the HGP and the more general problem of an extensible object data model that can incorporate the sequence model as well as existing and future data constructs and operators. First, we proposed a general sequence model that is application and implementation independent. This model is used to capture the sequence information found in the HGP at the conceptual level. In addition, abstract and biological sequence operators are defined for manipulating the modelled sequences. Second, we combined many features of semantic and object oriented data models into an extensible framework, which we called the ``Extensible Object Model``, to address the need of a modelling framework for incorporating the sequence data model with other typesmore » of data constructs and operators. This framework is based on the conceptual separation between constructors and constraints. We then used this modelling framework to integrate the constructs for the conceptual sequence model. The Extensible Object Model is also defined with a graphical representation, which is useful as a tool for database designers. Finally, we defined a query language to support this model and implement the query processor to demonstrate the feasibility of the extensible framework and the usefulness of the conceptual sequence model.« less

Authors:
 [1]
  1. California Univ., San Francisco, CA (United States); Univ. of California, Berkeley, CA (United States)
Publication Date:
Research Org.:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
10159439
Report Number(s):
LBL-31935
ON: DE92017107
DOE Contract Number:
AC03-76SF00098
Resource Type:
Thesis/Dissertation
Resource Relation:
Other Information: TH: Thesis (Ph.D.); PBD: Jan 1992
Country of Publication:
United States
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES; GENETIC MAPPING; INFORMATION THEORY; DATA PROCESSING; MATHEMATICAL MODELS; MAN; PATTERN RECOGNITION; SET THEORY; DNA; 550200; 550400; BIOCHEMISTRY; GENETICS

Citation Formats

Li, Peter Wei-Der. Sequence modelling and an extensible data model for genomic database. United States: N. p., 1992. Web. doi:10.2172/10159439.
Li, Peter Wei-Der. Sequence modelling and an extensible data model for genomic database. United States. doi:10.2172/10159439.
Li, Peter Wei-Der. Wed . "Sequence modelling and an extensible data model for genomic database". United States. doi:10.2172/10159439. https://www.osti.gov/servlets/purl/10159439.
@article{osti_10159439,
title = {Sequence modelling and an extensible data model for genomic database},
author = {Li, Peter Wei-Der},
abstractNote = {The Human Genome Project (HGP) plans to sequence the human genome by the beginning of the next century. It will generate DNA sequences of more than 10 billion bases and complex marker sequences (maps) of more than 100 million markers. All of these information will be stored in database management systems (DBMSs). However, existing data models do not have the abstraction mechanism for modelling sequences and existing DBMS`s do not have operations for complex sequences. This work addresses the problem of sequence modelling in the context of the HGP and the more general problem of an extensible object data model that can incorporate the sequence model as well as existing and future data constructs and operators. First, we proposed a general sequence model that is application and implementation independent. This model is used to capture the sequence information found in the HGP at the conceptual level. In addition, abstract and biological sequence operators are defined for manipulating the modelled sequences. Second, we combined many features of semantic and object oriented data models into an extensible framework, which we called the ``Extensible Object Model``, to address the need of a modelling framework for incorporating the sequence data model with other types of data constructs and operators. This framework is based on the conceptual separation between constructors and constraints. We then used this modelling framework to integrate the constructs for the conceptual sequence model. The Extensible Object Model is also defined with a graphical representation, which is useful as a tool for database designers. Finally, we defined a query language to support this model and implement the query processor to demonstrate the feasibility of the extensible framework and the usefulness of the conceptual sequence model.},
doi = {10.2172/10159439},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Wed Jan 01 00:00:00 EST 1992},
month = {Wed Jan 01 00:00:00 EST 1992}
}

Thesis/Dissertation:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this thesis or dissertation.

Save / Share:
  • The Human Genome Project (HGP) plans to sequence the human genome by the beginning of the next century. It will generate DNA sequences of more than 10 billion bases and complex marker sequences (maps) of more than 100 million markers. All of these information will be stored in database management systems (DBMSs). However, existing data models do not have the abstraction mechanism for modelling sequences and existing DBMS's do not have operations for complex sequences. This work addresses the problem of sequence modelling in the context of the HGP and the more general problem of an extensible object data modelmore » that can incorporate the sequence model as well as existing and future data constructs and operators. First, we proposed a general sequence model that is application and implementation independent. This model is used to capture the sequence information found in the HGP at the conceptual level. In addition, abstract and biological sequence operators are defined for manipulating the modelled sequences. Second, we combined many features of semantic and object oriented data models into an extensible framework, which we called the Extensible Object Model'', to address the need of a modelling framework for incorporating the sequence data model with other types of data constructs and operators. This framework is based on the conceptual separation between constructors and constraints. We then used this modelling framework to integrate the constructs for the conceptual sequence model. The Extensible Object Model is also defined with a graphical representation, which is useful as a tool for database designers. Finally, we defined a query language to support this model and implement the query processor to demonstrate the feasibility of the extensible framework and the usefulness of the conceptual sequence model.« less
  • This study develops an approach to distributed query optimization which uses detailed database statistics regarding database instances. A Detailed Database Statistics Model (DDSM) is presented which (1) portrays statistics about a relational database in matrix notation, and (2) presents matrix algebra on the statistics, suitable for query optimization. The model is applicable to centralized as well as distributed database systems that support the relational model of data. The typical assumptions of uniformly distributed attribute values, and independence among attributes are relaxed. Since computed statistics about the database are used, the model is expected to enable accurate evaluation of query processingmore » alternatives and thus better query processing strategies. Results of a simulation study to evaluate the performance of the model and the matrix operations are also presented. The DDSM can be used in conjunction with existing query optimization algorithms, and existing local processing and/or data transfer cost models. A discussion of such interfacing is presented in the appendix.« less
  • Database machines are fundamentally distributed processing systems; however, they are based on several different approaches, associative disks, specialized CPUs, and conventional multiprocessor technology. It is not clear which of these approaches, or which particular designs, will be successful. This dissertation presents a step toward evaluating the performance of some of these database machines. An abstract model based on a distributed database machine, the MUFFIN system, was developed. The Simulation Program for the Analysis of Database Machines and Environments (SPADE), was then implemented to evaluate MUFFIN, but could be used to model other database machines. SPADE is comprised of two basicmore » parts: the node model, and the network model. The node model was designed to simulate the activity of distributed database software. The network model was designed to support interprocess (and node-to-node) communications. The simulation experiments examined the performance implications of distributed query processing and network support. Results indicate that some performance improvement could be gained by using the fragmented query processing technique and by decreasing the cost of messages and disk I/O. However, the greatest improvement in system throughput was observed when the cost of page-level database operations was decreased.« less
  • The search for computer architectures utilizing large numbers of processing elements and their application to suitable problems has been a continual quest for many researchers. This dissertation presents results of an analysis of a multiprocessor architecture applied to the problem of data base management. The data base problem is first re-stated in terms of a new data model, the Active Graph Model, which employs a graphical representation for data (nodes) and relationships (arcs) in addition to concepts from the data flow model of computation to exploit the parallel processing power of the architecture. The nodes of the graph are activemore » elements that respond to requests in the form of tokens traveling along the arcs. This data model and its query language are shown to be relationally complete and therefore equivalent in expressive power to the Relational Model. A mesh-connected array of processing elements forms the basis for the architecture. The nodes and arcs of the model are mapped onto the architecture and practical algorithms are defined for distributing requests, data manipulation, and for sorting and reporting of results. Results of the experiments demonstrate that large numbers of processors can be used effectively, given a sufficiently large problem.« less