Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

The development of PIPA: an integrated and automated pipeline for genome-wide protein function annotation

Journal Article · · BMC Bioinformatics
 [1];  [2];  [2];  [3];  [4];  [2]
  1. US Army Medical Research and Materiel Command, Ft. Detrick, MD (United States). Telemedicine and Advanced Technology Research Center. Biotechnology HPC Software Applications Inst.; DOE/OSTI
  2. US Army Medical Research and Materiel Command, Ft. Detrick, MD (United States). Telemedicine and Advanced Technology Research Center. Biotechnology HPC Software Applications Inst.
  3. George Mason Univ., Fairfax, VA (United States)
  4. Argonne National Lab. (ANL), Argonne, IL (United States). Biosciences Division
Background: Automated protein function prediction methods are needed to keep pace with high-throughput sequencing. With the existence of many programs and databases for inferring different protein functions, a pipeline that properly integrates these resources will benefit from the advantages of each method. However, integrated systems usually do not provide mechanisms to generate customized databases to predict particular protein functions. Here, we describe a tool termed PIPA (Pipeline for Protein Annotation) that has these capabilities. Results: PIPA annotates protein functions by combining the results of multiple programs and databases, such as InterPro and the Conserved Domains Database, into common Gene Ontology (GO) terms. The major algorithms implemented in PIPA are: (1) a profile database generation algorithm, which generates customized profile databases to predict particular protein functions, (2) an automated ontology mapping generation algorithm, which maps various classification schemes into GO, and (3) a consensus algorithm to reconcile annotations from the integrated programs and databases. PIPA's profile generation algorithm is employed to construct the enzyme profile database CatFam, which predicts catalytic functions described by Enzyme Commission (EC) numbers. Validation tests show that CatFam yields average recall and precision larger than 95.0%. CatFam is integrated with PIPA. We use an association rule mining algorithm to automatically generate mappings between terms of two ontologies from annotated sample proteins. Incorporating the ontologies' hierarchical topology into the algorithm increases the number of generated mappings. In particular, it generates 40.0% additional mappings from the Clusters of Orthologous Groups (COG) to EC numbers and a six-fold increase in mappings from COG to GO terms. The mappings to EC numbers show a very high precision (99.8%) and recall (96.6%), while the mappings to GO terms show moderate precision (80.0%) and low recall (33.0%). Our consensus algorithm for GO annotation is based on the computation and propagation of likelihood scores associated with GO terms. The test results suggest that, for a given recall, the application of the consensus algorithm yields higher precision than when consensus is not used. Conclusion: The algorithms implemented in PIPA provide automated genome-wide protein function annotation based on reconciled predictions from multiple resources.
Research Organization:
Argonne National Laboratory (ANL), Argonne, IL (United States)
Sponsoring Organization:
USDOE Office of Science (SC)
OSTI ID:
1626362
Journal Information:
BMC Bioinformatics, Journal Name: BMC Bioinformatics Journal Issue: 1 Vol. 9; ISSN 1471-2105
Publisher:
BioMed CentralCopyright Statement
Country of Publication:
United States
Language:
English

References (31)

Beyond annotation transfer by homology: novel protein-function prediction methods to assist drug discovery journal November 2005
The coding capacity of SARS-CoV-2 journal September 2020
Automated protein function prediction--the genomic challenge journal May 2006
Mapping gene ontology to proteins based on protein–protein interaction data journal April 2004
Phylogenomic inference of protein molecular function: advances and challenges journal January 2004
PSORTb v.2.0: Expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis journal October 2004
The COG database: a tool for genome-scale analysis of protein functions and evolution journal January 2000
GenDB--an open source genome annotation system for prokaryote genomes journal April 2003
Recent improvements to the PROSITE database journal January 2004
EFICAz: a comprehensive approach for accurate genome-scale enzyme function inference journal January 2004
The ProDom database of protein domain families: more emphasis on 3D journal December 2004
BASys: a web server for automated bacterial genome annotation journal July 2005
PUMA2--grid-based high-throughput analysis of genomes and metabolic pathways journal January 2006
Pfam: clans, web tools and services journal January 2006
MaGe: a microbial genome annotation system supported by synteny results journal January 2006
AGMIAL: implementing an annotation strategy for prokaryote genomes as a distributed system journal July 2006
Improving Protein Function Prediction using the Hierarchical Structure of the Gene Ontology conference January 2005
A categorization approach to automated ontological function annotation journal June 2006
Data clustering: a review journal September 1999
GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes journal November 2004
Association algorithm to mine the rules that govern enzyme definition and to classify protein sequences journal June 2006
Applying negative rule mining to improve genome annotation journal July 2007
Beyond annotation transfer by homology: novel protein-function prediction methods to assist drug discovery journal November 2005
A Combined Transmembrane Topology and Signal Peptide Prediction Method journal May 2004
Prediction of protein function from protein sequence and structure journal January 1999
Prediction of human protein function according to Gene Ontology categories journal March 2003
GoFigure: Automated Gene Ontology™ annotation journal December 2003
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs journal September 1997
Multiple sequence alignment with the Clustal series of programs journal July 2003
Enzyme-specific profiles for genome annotation: PRIAM journal November 2003
The integrated microbial genomes (IMG) system journal January 2006

Cited By (16)

Histidine catabolism is a major determinant of methotrexate sensitivity journal July 2018
Quantitative frame analysis and the annotation of GC-rich (and other) prokaryotic genomes. An application to Anaeromyxobacter dehalogenans journal June 2015
COFACTOR: improved protein function prediction by combining structure, sequence and protein–protein interaction information journal May 2017
Supervised Learning Based Hypothesis Generation from Biomedical Literature journal January 2015
A Relation Extraction Framework for Biomedical Text Using Hybrid Feature Set journal January 2015
A Shortest Dependency Path Based Convolutional Neural Network for Protein-Protein Relation Extraction journal January 2016
PoGO: Prediction of Gene Ontology terms for fungal proteins journal April 2010
Integration of bioinformatics to biodegradation journal April 2014
PSPP: A Protein Structure Prediction Pipeline for Computing Clusters journal July 2009
Identification and Optimization of Classifier Genes from Multi-Class Earthworm Microarray Dataset journal October 2010
Design, Validation and Annotation of Transcriptome-Wide Oligonucleotide Probes for the Oligochaete Annelid Eisenia fetida journal December 2010
Structural Relationships in the Lysozyme Superfamily: Significant Evidence for Glycoside Hydrolase Signature Motifs journal November 2010
AGeS: A Software System for Microbial Genome Sequence Annotation journal March 2011
Workflow management systems for gene sequence analysis and evolutionary studies – A Review journal July 2013
Novel Burkholderia mallei Virulence Factors Linked to Specific Host-Pathogen Protein Interactions journal June 2013
COFACTOR: an accurate comparative algorithm for structure-based protein function annotation journal May 2012