Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span

Journal Article · · BMC Bioinformatics
 [1];  [2];  [3];  [2]
  1. Princeton Univ., NJ (United States). Computer Science Dept.; DOE/OSTI
  2. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States). Life Sciences Division
  3. Univ. of California, Berkeley, CA (United States). Dept. of Statistics; Univ. of California, Berkeley, CA (United States). Dept. of EECS

Background: The statistical modeling of biomedical corpora could yield integrated, coarse-to-fine views of biological phenomena that complement discoveries made from analysis of molecular sequence and profiling data. Here, the potential of such modeling is demonstrated by examining the 5,225 free-text items in the Caenorhabditis Genetic Center (CGC) Bibliography using techniques from statistical information retrieval. Items in the CGC biomedical text corpus were modeled using the Latent Dirichlet Allocation (LDA) model. LDA is a hierarchical Bayesian model which represents a document as a random mixture over latent topics; each topic is characterized by a distribution over words. Results: An LDA model estimated from CGC items had better predictive performance than two standard models (unigram and mixture of unigrams) trained using the same data. To illustrate the practical utility of LDA models of biomedical corpora, a trained CGC LDA model was used for a retrospective study of nematode genes known to be associated with life span modification. Corpus- , document-, and word-level LDA parameters were combined with terms from the Gene Ontology to enhance the explanatory value of the CGC LDA model, and to suggest additional candidates for age-related genes. A novel, pairwise document similarity measure based on the posterior distribution on the topic simplex was formulated and used to search the CGC database for "homologs" of a "query" document discussing the life span-modifying clk-2 gene. Inspection of these document homologs enabled and facilitated the production of hypotheses about the function and role of clk-2. Conclusion: Like other graphical models for genetic, genomic and other types of biological data, LDA provides a method for extracting unanticipated insights and generating predictions amenable to subsequent experimental validation.

Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Biological and Environmental Research (BER). Biological Systems Science Division
Grant/Contract Number:
AC02-05CH11231
OSTI ID:
1626320
Journal Information:
BMC Bioinformatics, Journal Name: BMC Bioinformatics Journal Issue: 1 Vol. 7; ISSN 1471-2105
Publisher:
BioMed CentralCopyright Statement
Country of Publication:
United States
Language:
English

References (24)

The yeast VPS genes affect telomere length regulation journal November 2004
Aging: From Radiant Youth to an Abrupt End journal April 2002
Putting tumours in context journal October 2001
A literature network of human genes for high-throughput analysis of gene expression journal May 2001
Genomics and natural language processing journal August 2002
Finding scientific topics journal February 2004
A genome-wide telomere screen in yeast: The long and short of it all journal June 2004
Singular value decomposition for genome-wide expression data processing and modeling journal August 2000
Mining the Biomedical Literature in the Genomic Era: An Overview journal December 2003
SAWTED: Structure Assignment With Text Description--Enhanced detection of remote homologues with automated SWISS-PROT annotation comparisons journal February 2000
Accomplishments and challenges in literature data mining for biology journal December 2002
The Pfam Protein Families Database journal January 2000
Analysis of molecular profile data using generative and discriminative methods journal December 2000
Integrating naive Bayes models and external knowledge to examine copper and iron homeostasis in S. cerevisiae journal December 2000
Fibroblast Growth Factor-2, But Not Vascular Endothelial Growth Factor, Upregulates Telomerase Activity in Human Endothelial Cells journal May 2003
Systematic Association of Genes to Phenotypes by Genome and Literature Mining journal April 2005
Aging: From Radiant Youth to an Abrupt End journal April 2002
Biological Sequence Analysis book January 2012
Text Classification from Labeled and Unlabeled Documents using EM journal January 2000
A genome-wide screen for Saccharomyces cerevisiae deletion mutants that affect telomere length journal May 2004
Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses journal November 2001
The computational analysis of scientific literature to define and recognize gene expression clusters journal August 2003
Fibroblast Growth Factor-2, But Not Vascular Endothelial Growth Factor, Upregulates Telomerase Activity in Human Endothelial Cells journal May 2003
Graphical Models journal February 2004

Cited By (7)

Mining Relational Paths in Integrated Biomedical Data journal December 2011
Getting started in probabilistic graphical models text January 2007
Discovering topic structures of a temporally evolving document corpus journal August 2017
Opioid Discussion in the Twittersphere journal April 2018
Semantic Breakthrough in Drug Discovery journal October 2014
Opioid Discussion in the Twittersphere text January 2018
Opioid Discussion in the Twittersphere text January 2018

Similar Records

Word prediction
Technical Report · Mon May 01 00:00:00 EDT 1995 · OSTI ID:123254

Genes that regulate both development and longevity in Caenorhabditis elegans
Journal Article · Fri Mar 31 23:00:00 EST 1995 · Genetics · OSTI ID:91162

A novel procedure on next generation sequencing data analysis using text mining algorithm
Journal Article · Fri May 13 00:00:00 EDT 2016 · BMC Bioinformatics · OSTI ID:1626761