Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Integrating Subcellular Location for Improving Machine Learning Models of Remote Homology Detection in Eukaryotic Organisms

Journal Article · · Computational Biology and Chemistry, 31(2):138-142
Motivation: At the center of bioinformatics, genomics, and pro-teomics is the need for highly accurate genome annotations. Producing high-quality reliable annotations depends on identifying sequences which are related evolutionarily (homologs) on which to infer function. Homology detection is one of the oldest tasks in bioinformatics, however most approaches still fail when presented with sequences that have low residue similarity despite a distant evolutionary relationship (remote homology). Recently, discriminative approaches, such as support vector machines (SVMs) have demonstrated a vast improvement in sensitivity for remote homology detection. These methods however have only focused on one aspect of the sequence at a time, e.g., sequence similarity or motif based scores. However, supplementary information, such as the sub-cellular location of a protein within the cell would give further clues as to possible homologous pairs, additionally eliminating false relationships due to simple functional roles that cannot exist due to location. We have developed a method, SVM-SimLoc that integrates sub-cellular location with sequence similarity information into a pro-tein family classifier and compared it to one of the most accurate sequence based SVM approaches, SVM-Pairwise. Results: The SCOP 1.53 benchmark data set was utilized to assess the performance of SVM-SimLoc. As cellular location prediction is dependent upon the type of sequence, eukaryotic or prokaryotic, the analysis is restricted to the 2630 eukaryotic sequences in the benchmark dataset, evaluating a total of 27 protein families. We demonstrate that the integration of sequence similarity and sub-cellular location yields notably more accurate results than using sequence similarity independently at a significance level of 0.006.
Research Organization:
Pacific Northwest National Laboratory (PNNL), Richland, WA (US)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-76RL01830
OSTI ID:
903458
Report Number(s):
PNNL-SA-48402
Journal Information:
Computational Biology and Chemistry, 31(2):138-142, Journal Name: Computational Biology and Chemistry, 31(2):138-142 Journal Issue: 2 Vol. 31
Country of Publication:
United States
Language:
English

Similar Records

SVM-Hustle - An iterative semi-supervised machine learning approach for pairwise protein remote homology detection
Journal Article · Sat Mar 15 00:00:00 EDT 2008 · Bioinformatics, 24(6):783-790 · OSTI ID:985035

Physicochemical property distributions for accurate and rapid pairwise protein homology detection
Journal Article · Thu Mar 18 20:00:00 EDT 2010 · BMC Bioinformatics · OSTI ID:1626267