MeSH key terms for validation and annotation of gene expression clusters

Rechtsteiner, A; Rocha, L M

Title: MeSH key terms for validation and annotation of gene expression clusters

Conference · Thu Jan 01 00:00:00 EST 2004

OSTI ID:977432

Rechtsteiner, A ^[1]; Rocha, L M ^[2]

Andreas
Luis Mateus

Integration of different sources of information is a great challenge for the analysis of gene expression data, and for the field of Functional Genomics in general. As the availability of numerical data from high-throughput methods increases, so does the need for technologies that assist in the validation and evaluation of the biological significance of results extracted from these data. In mRNA assaying with microarrays, for example, numerical analysis often attempts to identify clusters of co-expressed genes. The important task to find the biological significance of the results and validate them has so far mostly fallen to the biological expert who had to perform this task manually. One of the most promising avenues to develop automated and integrative technology for such tasks lies in the application of modern Information Retrieval (IR) and Knowledge Management (KM) algorithms to databases with biomedical publications and data. Examples of databases available for the field are bibliographic databases c ntaining scientific publications (e.g. MEDLINE/PUBMED), databases containing sequence data (e.g. GenBank) and databases of semantic annotations (e.g. the Gene Ontology Consortium and Medical Subject Headings (MeSH)). We present here an approach that uses the MeSH terms and their concept hierarchies to validate and obtain functional information for gene expression clusters. The controlled and hierarchical MeSH vocabulary is used by the National Library of Medicine (NLM) to index all the articles cited in MEDLINE. Such indexing with a controlled vocabulary eliminates some of the ambiguity due to polysemy (terms that have multiple meanings) and synonymy (multiple terms have similar meaning) that would be encountered if terms would be extracted directly from the articles due to differing article contexts or author preferences and background. Further, the hierarchical organization of the MeSH terms can illustrate the conceptuallfunctional relationships of genes associated with MeSH terms. MeSH terms can be associated with genes through co-occurrence of these in MEDLINE citations, i.e. the genes occur in titles or abstracts and the MeSH terms are assigned by experts. To identify MeSH terms associated with a group of genes we used the tool MESHGENE developed at the Information Dynamics Lab at HP Labs (http://www-idl.hpl.hp.com/meshgene/). When presented with a list of human genes, MESHGENE uses some sophisticated techniques to search for these gene symbols in the titles and abstracts of all MEDLINE citations. MeSH terms and the number of co-occurrences can be retrieved. Gene symbols that are aliases of each other are pooled from several databases. This addresses the problem of synonymy, the fact that several symbols can refer to the same gene. MESHGENE employs some sophisticated algorithms that disregards symbols that are likely to be acronyms for other concepts than a gene. This addresses the problem of polysemy, i.e. possible multiple meanings of a gene symbol. We applied our approach to gene expression data from herpes virus infected human fibroblast cells. The data contains 12 time-points, between 1/2 hrs and 48 hrs after infection. Singular Value Decomposition was used to identify the dominant modes of expression. 75% of the variance in the expression data was captured by the first two modes, the first exhibiting a monotonly increasing expression pattern and the second a more transient pattern. Projection of the gene expression vectors onto this first two modes identified 3 statistically significant clusters of co-expressed genes. 500 genes from cluster 1 and 300 genes from clusters 2 and 3 each were uploaded to MESHGENE and the MeSH terms and co-occurrence values were retrieved. MeSH terms were also obtained for 5 groups of randomly selected genes with similar numbers of genes. The log was taken of the co-occurrence values and for each MeSH term these log co-occurrence values were summed for each group over the genes in that group. A matrix with 8 columns for the 8 groups of genes and with 14,000 rows with the MeSH terms was obtained. To analyze this association matrix we used a Latent Semantic Analysis (LSA) approach. We applied SVD to this gene-group vs. MeSH term association matrix. The first 2 modes that capture most of the variation (and therefore most times also information) in the association matrix were highly associated with MeSH terms that occurred uniquely or disproportionally in the 3 gene clusters. MeSH terms highly associated with the 5 groups of randomly selected genes were associated with the lower modes. These modes seem to just capture 'noise' in the association matrix. This result by itself is of great interest for gene expression analysis. We were able to show that the 3 clusters of genes not only separated in 'expression space' but also in the MeSH term space with which they are associated through the literature.

View Conference

Cite

Export

Save

Research Organization:: Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)

Sponsoring Organization:: USDOE

OSTI ID:: 977432

Report Number(s):: LA-UR-04-0545; LA-UR-04-545; TRN: US201009%%760

Resource Relation:: Conference: Submitted to: 8th annual International Conference on Research in Computational Molecular Biology

Country of Publication:: United States

Language:: English

Similar Records

41. DISCOVERY, SEARCH, AND COMMUNICATION OF TEXTUAL KNOWLEDGE RESOURCES IN DISTRIBUTED SYSTEMS a. Discovering and Utilizing Knowledge Sources for Metasearch Knowledge Systems

Technical Report · Tue Mar 18 00:00:00 EDT 2008 · OSTI ID:977432

Zamora, Antonio

Automatic image analysis for gene expression patterns of fly embryos

Journal Article · Sun Jul 01 00:00:00 EDT 2007 · BMC Cell Biology · OSTI ID:977432

Peng, Hanchuan; Long, Fuhui; Zhou, Jie; +3 more

On the universal structure of human lexical semantics

Journal Article · Mon Feb 01 00:00:00 EST 2016 · Proceedings of the National Academy of Sciences of the United States of America · OSTI ID:977432

Youn, Hyejin; Sutton, Logan; Smith, Eric; +5 more

Related Subjects

59 BASIC BIOLOGICAL SCIENCES
99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE
ALGORITHMS
AVAILABILITY
EVALUATION
FIBROBLASTS
FUNCTIONALS
GENES
INFORMATION RETRIEVAL
KNOWLEDGE MANAGEMENT
MEDICINE
MOLECULAR BIOLOGY
NUMERICAL ANALYSIS
NUMERICAL DATA
TRANSIENTS
VALIDATION
VECTORS

Title: MeSH key terms for validation and annotation of gene expression clusters

Citation Formats

Similar Records

Related Subjects