skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: The value of prior knowledge in discovering motifs with MEME

Abstract

MEME is a tool for discovering motifs in sets of protein or DNA sequences. This paper describes several extensions to MEME which increase its ability to find motifs in a totally unsupervised fashion, but which also allow it to benefit when prior knowledge is available. When no background knowledge is asserted, MEME obtains increased robustness from a method for determining motif widths automatically, and from probabilistic models that allow motifs to be absent in some input sequences. On the other hand, MEME can exploit prior knowledge about a motif being present in all input sequences, about the length of a motif and whether it is a palindrome, and (using Dirichlet mixtures) about expected patterns in individual motif positions. Extensive experiments are reported which support the claim that MEME benefits from, but does not require, background knowledge. The experiments use seven previously studied DNA and protein sequence families and 75 of the protein families documented in the Prosite database of sites and patterns, Release 11.1.

Authors:
;  [1]
  1. Univ. of California at San Diego, La Jolla, CA (United States)
Publication Date:
Research Org.:
Stanford Univ., CA (United States)
OSTI Identifier:
401825
Report Number(s):
CONF-9507246-
TRN: 96:005602-0003
Resource Type:
Technical Report
Resource Relation:
Conference: Intelligent Systems for Molecular Biology (ISMB) conference, Cambridge (United Kingdom), 16-19 Jul 1995; Other Information: PBD: 1995; Related Information: Is Part Of ISMB-95 -- Third international conference on intelligent systems for molecular biology: Proceedings; Rawlings, C.; Clark, D.; Altman, R.; Hunter, L.; Lengauer, T.; Wodak, S. [eds.]; PB: 427 p.
Country of Publication:
United States
Language:
English
Subject:
55 BIOLOGY AND MEDICINE, BASIC STUDIES; 99 MATHEMATICS, COMPUTERS, INFORMATION SCIENCE, MANAGEMENT, LAW, MISCELLANEOUS; DNA; M CODES; PROTEIN STRUCTURE; MIXTURES; MOLECULAR BIOLOGY; DNA SEQUENCING; PROBABILITY

Citation Formats

Bailey, T.L., and Elkan, C.. The value of prior knowledge in discovering motifs with MEME. United States: N. p., 1995. Web.
Bailey, T.L., & Elkan, C.. The value of prior knowledge in discovering motifs with MEME. United States.
Bailey, T.L., and Elkan, C.. 1995. "The value of prior knowledge in discovering motifs with MEME". United States. doi:.
@article{osti_401825,
title = {The value of prior knowledge in discovering motifs with MEME},
author = {Bailey, T.L. and Elkan, C.},
abstractNote = {MEME is a tool for discovering motifs in sets of protein or DNA sequences. This paper describes several extensions to MEME which increase its ability to find motifs in a totally unsupervised fashion, but which also allow it to benefit when prior knowledge is available. When no background knowledge is asserted, MEME obtains increased robustness from a method for determining motif widths automatically, and from probabilistic models that allow motifs to be absent in some input sequences. On the other hand, MEME can exploit prior knowledge about a motif being present in all input sequences, about the length of a motif and whether it is a palindrome, and (using Dirichlet mixtures) about expected patterns in individual motif positions. Extensive experiments are reported which support the claim that MEME benefits from, but does not require, background knowledge. The experiments use seven previously studied DNA and protein sequence families and 75 of the protein families documented in the Prosite database of sites and patterns, Release 11.1.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = 1995,
month =
}

Technical Report:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that may hold this item. Keep in mind that many technical reports are not cataloged in WorldCat.

Save / Share:
  • Advanced Natural Language Processing Tools for Web Information Retrieval, Content Analysis, and Synthesis. The goal of this SBIR was to implement and evaluate several advanced Natural Language Processing (NLP) tools and techniques to enhance the precision and relevance of search results by analyzing and augmenting search queries and by helping to organize the search output obtained from heterogeneous databases and web pages containing textual information of interest to DOE and the scientific-technical user communities in general. The SBIR investigated 1) the incorporation of spelling checkers in search applications, 2) identification of significant phrases and concepts using a combination of linguisticmore » and statistical techniques, and 3) enhancement of the query interface and search retrieval results through the use of semantic resources, such as thesauri. A search program with a flexible query interface was developed to search reference databases with the objective of enhancing search results from web queries or queries of specialized search systems such as DOE's Information Bridge. The DOE ETDE/INIS Joint Thesaurus was processed to create a searchable database. Term frequencies and term co-occurrences were used to enhance the web information retrieval by providing algorithmically-derived objective criteria to organize relevant documents into clusters containing significant terms. A thesaurus provides an authoritative overview and classification of a field of knowledge. By organizing the results of a search using the thesaurus terminology, the output is more meaningful than when the results are just organized based on the terms that co-occur in the retrieved documents, some of which may not be significant. An attempt was made to take advantage of the hierarchy provided by broader and narrower terms, as well as other field-specific information in the thesauri. The search program uses linguistic morphological routines to find relevant entries regardless of whether terms are stored in singular or plural form. Implementation of additional inflectional morphology processes for verbs can enhance retrieval further, but this has to be balanced by the possibility of broadening the results too much. In addition to the DOE energy thesaurus, other sources of specialized organized knowledge such as the Medical Subject Headings (MeSH), the Unified Medical Language System (UMLS), and Wikipedia were investigated. The supporting role of the NLP thesaurus search program was enhanced by incorporating spelling aid and a part-of-speech tagger to cope with misspellings in the queries and to determine the grammatical roles of the query words and identify nouns for special processing. To improve precision, multiple modes of searching were implemented including Boolean operators, and field-specific searches. Programs to convert a thesaurus or reference file into searchable support files can be deployed easily, and the resulting files are immediately searchable to produce relevance-ranked results with builtin spelling aid, morphological processing, and advanced search logic. Demonstration systems were built for several databases, including the DOE energy thesaurus.« less
  • A new category of protein motif is introduced. This type of motif captures, in addition to global structure, the nested structure of its component parts. A dataset of four proteins is represented using this scheme. A structured machine discovery procedure is used to discover recurrent amino acid motifs and this knowledge is utilized for the expression of subsequent protein motif discoveries. Examples of discovered multilevel motifs are presented.
  • Advances in component life prediction techniques have prompted increased interest in quantitative nondestructive characterization of flaws in engineering materials. Flaw characterization techniques utilize a signature from the flaw. In ultrasonics, the signature is estimated from noise-corrupted experimental measurements of the scattered acoustic wave field resulting from insonification of the flaw. Estimating the flaw's signature involves removing the effects of the measurement system in the presence of noise. In the frequency domain, the flaw's signature is called a scattering amplitude. The purpose of this work is to evaluate an optimal Wiener filtering approach to scattering amplitude estimation.
  • The Value Prior to Pulping (VPP) project goal was to demonstrate the technical and commercial feasibility of introducing a new value stream into existing pulp and paper mills. Essentially the intent was to transfer the energy content of extracted hemicellulose from electricity and steam generated in the recovery boiler to a liquid transportation fuel. The hemicellulose fraction was extracted prior to pulping, fractionated, or conditioned if necessary, and fermented to ethanol. Commercial adaptation of the process to wood hemicelluloses was a prerequisite for using this less currently valued component available from biomass and wood. These hemicelluloses are predominately glucurono-xylan inmore » hardwoods and galactoglucomannan in softwoods (with a significant softwood component of an arabino-xylan) and will yield fermentation substrates different from cellulose. NREL provided its expertise in the area of fermentation host evaluation using its Zymomonas strains on the CleanTech Partner's (CTP) VPP project. The project was focused on the production of fuel ethanol and acetic acid from hemicellulose streams generated from wood chips of industrially important hardwood and softwood species. NREL was one of four partners whose ethanologen was tested on the hydrolyzed extracts. The use of commercially available enzymes to treat oligomeric sugar extracts was also investigated and coupled with fermentation. Fermentations by NREL were conducted with the Zymomonas mobilis organism with most of the work being performed with the 8b strain. The wood extracts hydrolyzed and/or fermented by NREL were those derived from maple, mixed southern hardwoods, and loblolly pine. An unhydrolyzed variant of the mixed southern hardwood extract possessed a large concentration of oligomeric sugars and enzymatic hydrolysis was performed with a number of enzymes, followed by fermentation. The fermentation of the wood extracts was carried out at bench scale in flasks or small bioreactors, with a maximum volume of 500 mL.« less
  • The objective of this project was to evaluate the technical and economic viability of producing biofuels from hemicellulose extracted from wood chips prior to pulp and paper production.