Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span

Blei, D. M.; Franks, K.; Jordan, M. I.; Mian, I. S.

doi:10.1186/1471-2105-7-250

Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span

Journal Article · Mon May 08 00:00:00 EDT 2006 · BMC Bioinformatics

DOI:https://doi.org/10.1186/1471-2105-7-250· OSTI ID:1626320

Blei, D. M. ^[1]; Franks, K. ^[2]; Jordan, M. I. ^[3]; Mian, I. S. ^[2]

Princeton Univ., NJ (United States). Computer Science Dept.; DOE/OSTI
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States). Life Sciences Division
Univ. of California, Berkeley, CA (United States). Dept. of Statistics; Univ. of California, Berkeley, CA (United States). Dept. of EECS

Background: The statistical modeling of biomedical corpora could yield integrated, coarse-to-fine views of biological phenomena that complement discoveries made from analysis of molecular sequence and profiling data. Here, the potential of such modeling is demonstrated by examining the 5,225 free-text items in the Caenorhabditis Genetic Center (CGC) Bibliography using techniques from statistical information retrieval. Items in the CGC biomedical text corpus were modeled using the Latent Dirichlet Allocation (LDA) model. LDA is a hierarchical Bayesian model which represents a document as a random mixture over latent topics; each topic is characterized by a distribution over words. Results: An LDA model estimated from CGC items had better predictive performance than two standard models (unigram and mixture of unigrams) trained using the same data. To illustrate the practical utility of LDA models of biomedical corpora, a trained CGC LDA model was used for a retrospective study of nematode genes known to be associated with life span modification. Corpus- , document-, and word-level LDA parameters were combined with terms from the Gene Ontology to enhance the explanatory value of the CGC LDA model, and to suggest additional candidates for age-related genes. A novel, pairwise document similarity measure based on the posterior distribution on the topic simplex was formulated and used to search the CGC database for "homologs" of a "query" document discussing the life span-modifying clk-2 gene. Inspection of these document homologs enabled and facilitated the production of hypotheses about the function and role of clk-2. Conclusion: Like other graphical models for genetic, genomic and other types of biological data, LDA provides a method for extracting unanticipated insights and generating predictions amenable to subsequent experimental validation.

View Accepted Manuscript (DOE)

Research Organization:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Organization:: USDOE Office of Science (SC), Biological and Environmental Research (BER). Biological Systems Science Division

Grant/Contract Number:: AC02-05CH11231

OSTI ID:: 1626320

Journal Information:: BMC Bioinformatics, Journal Name: BMC Bioinformatics Journal Issue: 1 Vol. 7; ISSN 1471-2105

Publisher:: BioMed CentralCopyright Statement

Country of Publication:: United States

Language:: English

References (24)

The yeast VPS genes affect telomere length regulation Rog, Ofer; Smolikov, Sarit; Krauskopf, Anat Current Genetics, Vol. 47, Issue 1 https://doi.org/10.1007/s00294-004-0548-y	journal	November 2004
Aging: From Radiant Youth to an Abrupt End Rothman, Joel H. Current Biology, Vol. 12, Issue 7 https://doi.org/10.1016/s0960-9822(02)00780-7	journal	April 2002
Putting tumours in context Bissell, Mina J.; Radisky, Derek Nature Reviews Cancer, Vol. 1, Issue 1 https://doi.org/10.1038/35094059	journal	October 2001
A literature network of human genes for high-throughput analysis of gene expression Jenssen, Tor-Kristian; Lægreid, Astrid; Komorowski, Jan Nature Genetics, Vol. 28, Issue 1 https://doi.org/10.1038/ng0501-21	journal	May 2001
Genomics and natural language processing Yandell, Mark D.; Majoros, William H. Nature Reviews Genetics, Vol. 3, Issue 8 https://doi.org/10.1038/nrg861	journal	August 2002
Finding scientific topics Griffiths, T. L.; Steyvers, M. Proceedings of the National Academy of Sciences, Vol. 101, Issue Supplement 1 https://doi.org/10.1073/pnas.0307752101	journal	February 2004
A genome-wide telomere screen in yeast: The long and short of it all Edmonds, Dawn; Breitkreutz, Bobby-Joe; Harrington, Lea Proceedings of the National Academy of Sciences, Vol. 101, Issue 26 https://doi.org/10.1073/pnas.0403378101	journal	June 2004
Singular value decomposition for genome-wide expression data processing and modeling Alter, O.; Brown, P. O.; Botstein, D. Proceedings of the National Academy of Sciences, Vol. 97, Issue 18, p. 10101-10106 https://doi.org/10.1073/pnas.97.18.10101	journal	August 2000
Mining the Biomedical Literature in the Genomic Era: An Overview Shatkay, Hagit; Feldman, Ronen Journal of Computational Biology, Vol. 10, Issue 6 https://doi.org/10.1089/106652703322756104	journal	December 2003
SAWTED: Structure Assignment With Text Description--Enhanced detection of remote homologues with automated SWISS-PROT annotation comparisons MacCallum, R. M.; Kelley, L. A.; Sternberg, M. J. E. Bioinformatics, Vol. 16, Issue 2 https://doi.org/10.1093/bioinformatics/16.2.125	journal	February 2000
Accomplishments and challenges in literature data mining for biology Hirschman, L.; Park, J. C.; Tsujii, J. Bioinformatics, Vol. 18, Issue 12 https://doi.org/10.1093/bioinformatics/18.12.1553	journal	December 2002
The Pfam Protein Families Database Bateman, A. Nucleic Acids Research, Vol. 28, Issue 1 https://doi.org/10.1093/nar/28.1.263	journal	January 2000
Analysis of molecular profile data using generative and discriminative methods Moler, E. J.; Chow, M. L.; Mian, I. S. Physiological Genomics, Vol. 4, Issue 2 https://doi.org/10.1152/physiolgenomics.2000.4.2.109	journal	December 2000
Integrating naive Bayes models and external knowledge to examine copper and iron homeostasis in S. cerevisiae Moler, E. J.; Radisky, D. C.; Mian, I. S. Physiological Genomics, Vol. 4, Issue 2 https://doi.org/10.1152/physiolgenomics.2000.4.2.127	journal	December 2000
Fibroblast Growth Factor-2, But Not Vascular Endothelial Growth Factor, Upregulates Telomerase Activity in Human Endothelial Cells Kurz, David J.; Hong, Ying; Trivier, Elizabeth Arteriosclerosis, Thrombosis, and Vascular Biology, Vol. 23, Issue 5 https://doi.org/10.1161/01.atv.0000069624.55424.61	journal	May 2003
Systematic Association of Genes to Phenotypes by Genome and Literature Mining Korbel, Jan O.; Doerks, Tobias; Jensen, Lars J. PLoS Biology, Vol. 3, Issue 5 https://doi.org/10.1371/journal.pbio.0030134	journal	April 2005
Aging: From Radiant Youth to an Abrupt End Rothman, Joel H. Current Biology, Vol. 12, Issue 7 https://doi.org/10.1016/S0960-9822(02)00780-7	journal	April 2002
Biological Sequence Analysis Durbin, Richard; Eddy, Sean R.; Krogh, Anders Cambridge University Press https://doi.org/10.1017/CBO9780511790492	book	January 2012
Text Classification from Labeled and Unlabeled Documents using EM Nigam, Kamal; Mccallum, Andrew Kachites; Thrun, Sebastian Machine Learning, Vol. 39, Issue 2/3, p. 103-134 https://doi.org/10.1023/A:1007692713085	journal	January 2000
A genome-wide screen for Saccharomyces cerevisiae deletion mutants that affect telomere length Askree, S. H.; Yehuda, T.; Smolikov, S. Proceedings of the National Academy of Sciences, Vol. 101, Issue 23 https://doi.org/10.1073/pnas.0401263101	journal	May 2004
Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses Bhattacharjee, A.; Richards, W. G.; Staunton, J. Proceedings of the National Academy of Sciences, Vol. 98, Issue 24 https://doi.org/10.1073/pnas.191502998	journal	November 2001
The computational analysis of scientific literature to define and recognize gene expression clusters Raychaudhuri, S. Nucleic Acids Research, Vol. 31, Issue 15 https://doi.org/10.1093/nar/gkg636	journal	August 2003
Fibroblast Growth Factor-2, But Not Vascular Endothelial Growth Factor, Upregulates Telomerase Activity in Human Endothelial Cells Kurz, David J.; Hong, Ying; Trivier, Elizabeth Arteriosclerosis, Thrombosis, and Vascular Biology, Vol. 23, Issue 5 https://doi.org/10.1161/01.ATV.0000069624.55424.61	journal	May 2003
Graphical Models Jordan, Michael I. Statistical Science, Vol. 19, Issue 1 https://doi.org/10.1214/088342304000000026	journal	February 2004

Cited By (7)

Mining Relational Paths in Integrated Biomedical Data He, Bing; Tang, Jie; Ding, Ying PLoS ONE, Vol. 6, Issue 12 https://doi.org/10.1371/journal.pone.0027506	journal	December 2011
Getting started in probabilistic graphical models Airoldi, Edoardo M. arXiv https://doi.org/10.48550/arxiv.0706.2040	text	January 2007
Discovering topic structures of a temporally evolving document corpus Beykikhoshk, Adham; Arandjelović, Ognjen; Phung, Dinh Knowledge and Information Systems, Vol. 55, Issue 3 https://doi.org/10.1007/s10115-017-1095-4	journal	August 2017
Opioid Discussion in the Twittersphere Graves, Rachel L.; Tufts, Christopher; Meisel, Zachary F. Substance Use & Misuse, Vol. 53, Issue 13 https://doi.org/10.1080/10826084.2018.1458319	journal	April 2018
Semantic Breakthrough in Drug Discovery Chen, Bin; Wang, Huijun; Ding, Ying Synthesis Lectures on the Semantic Web: Theory and Technology, Vol. 4, Issue 2 https://doi.org/10.2200/s00600ed1v01y201409web009	journal	October 2014
Opioid Discussion in the Twittersphere Graves, Rachel L.; Tufts, Christopher; Meisel, Zachary F. Taylor & Francis https://doi.org/10.6084/m9.figshare.6144590	text	January 2018
Opioid Discussion in the Twittersphere Graves, Rachel L.; Tufts, Christopher; Meisel, Zachary F. Taylor & Francis https://doi.org/10.6084/m9.figshare.6144590.v1	text	January 2018

Similar Records

Word prediction

Technical Report · Mon May 01 00:00:00 EDT 1995 · OSTI ID:123254

Genes that regulate both development and longevity in Caenorhabditis elegans

Journal Article · Fri Mar 31 23:00:00 EST 1995 · Genetics · OSTI ID:91162

A novel procedure on next generation sequencing data analysis using text mining algorithm

Journal Article · Fri May 13 00:00:00 EDT 2016 · BMC Bioinformatics · OSTI ID:1626761

Related Subjects

59 BASIC BIOLOGICAL SCIENCES
97 MATHEMATICS AND COMPUTING
Biochemistry & Molecular Biology
Biotechnology & Applied Microbiology
Mathematical & Computational Biology

Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span

Citation Formats

References (24)

Cited By (7)

Similar Records

Related Subjects