Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Uncovering protein interaction in abstracts and text using a novel linear model and word proximity networks

Journal Article · · GenomeBiology.com
 [1];  [2];  [3];  [2];  [2];  [4];  [2];  [5]
  1. Indiana University, Bloomington, IN (United States); Instituto Gulbenkian de Ciência, Oeiras (Portugal); DOE/OSTI
  2. Indiana University, Bloomington, IN (United States)
  3. Universidad Nacional del Sur, Buenos Aires (Argentina)
  4. Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)
  5. Indiana University, Bloomington, IN (United States); Instituto Gulbenkian de Ciência, Oeiras (Portugal)

Second BioCreative Challenge: classification of abstracts relevant for protein-protein interaction (interaction article subtask [IAS]), discovery of protein pairs (interaction pair subtask [IPS]), and identification of text passages characterizing protein interaction (interaction sentences subtask [ISS]) in full-text documents. We approached the abstract classification task with a novel, lightweight linear model inspired by spam detection techniques, as well as an uncertainty-based integration scheme. We also used a support vector machine and singular value decomposition on the same features for comparison purposes. Our approach to the full-text subtasks (protein pair and passage identification) includes a feature expansion method based on word proximity networks. Our approach to the abstract classification task (IAS) was among the top submissions for this task in terms of measures of performance used in the challenge evaluation (accuracy, F-score, and area under the receiver operating characteristic curve). We also report on a web tool that we produced using our approach: the Protein Interaction Abstract Relevance Evaluator (PIARE). Our approach to the full-text tasks resulted in one of the highest recall rates as well as mean reciprocal rank of correct passages. Our approach to abstract classification shows that a simple linear model, using relatively few features, can generalize and uncover the conceptual nature of protein-protein interactions from the bibliome. Because the novel approach is based on a rather lightweight linear model, it can easily be ported and applied to similar problems. In full-text problems, the expansion of word features with word proximity networks is shown to be useful, although the need for some improvements is discussed.

Research Organization:
Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Biological and Environmental Research (BER)
Grant/Contract Number:
AC52-06NA25396
OSTI ID:
1626732
Journal Information:
GenomeBiology.com, Journal Name: GenomeBiology.com Journal Issue: Suppl 2 Vol. 9; ISSN 1465-6906
Publisher:
BioMed CentralCopyright Statement
Country of Publication:
United States
Language:
English

References (12)

Support Vector Machines book January 2002
Literature mining for the biologist: from information retrieval to biological discovery journal February 2006
Stable Association of 70-kDa Heat Shock Protein Induces Latent Multisite Specificity of a Unisite-specific Endonuclease in Yeast Mitochondria journal September 1999
Mining the Biomedical Literature in the Genomic Era: An Overview journal December 2003
ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text journal April 2005
MIPS: analysis and annotation of proteins from whole genomes journal January 2004
The Universal Protein Resource (UniProt) journal January 2007
MINT: the Molecular INTeraction database journal January 2007
IntAct--open source resource for molecular interaction data journal January 2007
Large-Scale Testing of Bibliome Informatics Using pfam Protein Families conference December 2005
Overview of BioCreAtIvE: critical assessment of information extraction for biology journal January 2005
Protein annotation as term categorization in the gene ontology using word proximity networks journal January 2005

Cited By (8)

Construction of antimicrobial peptide-drug combination networks from scientific literature based on a semi-automated curation workflow journal January 2016
The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text journal October 2011
Extraction of Pharmacokinetic Evidence of Drug–Drug Interactions from the Literature journal May 2015
Prioritization of Therapeutic Targets of Inflammation Using Proteomics, Bioinformatics, and In Silico Cell-Cell Interactomics book January 2019
Prioritization of Therapeutic Targets of Inflammation Using Proteomics, Bioinformatics, and In Silico Cell–Cell Interactomics book January 2013
Distance closures on complex networks journal March 2015
Quorum sensing inhibition in Pseudomonas aeruginosa biofilms: new insights through network mining journal January 2017
The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text text January 2011

Similar Records

Protein annotation as term categorization in the gene ontology using word proximity networks
Journal Article · Tue May 24 00:00:00 EDT 2005 · BMC Bioinformatics · OSTI ID:1626313

An Introduction to Word Embeddings and Language Models
Technical Report · Thu Apr 01 00:00:00 EDT 2021 · OSTI ID:1773690

PNNL: A Supervised Maximum Entropy Approach to Word Sense Disambiguation
Conference · Sat Jun 23 00:00:00 EDT 2007 · OSTI ID:924370