skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Integrating Subcellular Location for Improving Machine Learning Models of Remote Homology Detection in Eukaryotic Organisms

Abstract

Motivation: At the center of bioinformatics, genomics, and pro-teomics is the need for highly accurate genome annotations. Producing high-quality reliable annotations depends on identifying sequences which are related evolutionarily (homologs) on which to infer function. Homology detection is one of the oldest tasks in bioinformatics, however most approaches still fail when presented with sequences that have low residue similarity despite a distant evolutionary relationship (remote homology). Recently, discriminative approaches, such as support vector machines (SVMs) have demonstrated a vast improvement in sensitivity for remote homology detection. These methods however have only focused on one aspect of the sequence at a time, e.g., sequence similarity or motif based scores. However, supplementary information, such as the sub-cellular location of a protein within the cell would give further clues as to possible homologous pairs, additionally eliminating false relationships due to simple functional roles that cannot exist due to location. We have developed a method, SVM-SimLoc that integrates sub-cellular location with sequence similarity information into a pro-tein family classifier and compared it to one of the most accurate sequence based SVM approaches, SVM-Pairwise. Results: The SCOP 1.53 benchmark data set was utilized to assess the performance of SVM-SimLoc. As cellular location prediction is dependentmore » upon the type of sequence, eukaryotic or prokaryotic, the analysis is restricted to the 2630 eukaryotic sequences in the benchmark dataset, evaluating a total of 27 protein families. We demonstrate that the integration of sequence similarity and sub-cellular location yields notably more accurate results than using sequence similarity independently at a significance level of 0.006.« less

Authors:
; ; ;
Publication Date:
Research Org.:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
903458
Report Number(s):
PNNL-SA-48402
TRN: US200720%%370
DOE Contract Number:  
AC05-76RL01830
Resource Type:
Journal Article
Resource Relation:
Journal Name: Computational Biology and Chemistry, 31(2):138-142; Journal Volume: 31; Journal Issue: 2
Country of Publication:
United States
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES; BENCHMARKS; DETECTION; FORECASTING; FUNCTIONALS; LEARNING; PERFORMANCE; PROTEINS; RESIDUES; SENSITIVITY; VECTORS

Citation Formats

Shah, Anuj R., Oehmen, Chris S., Harper, Jill K., and Webb-Robertson, Bobbie-Jo M. Integrating Subcellular Location for Improving Machine Learning Models of Remote Homology Detection in Eukaryotic Organisms. United States: N. p., 2007. Web. doi:10.1016/j.compbiolchem.2007.02.012.
Shah, Anuj R., Oehmen, Chris S., Harper, Jill K., & Webb-Robertson, Bobbie-Jo M. Integrating Subcellular Location for Improving Machine Learning Models of Remote Homology Detection in Eukaryotic Organisms. United States. doi:10.1016/j.compbiolchem.2007.02.012.
Shah, Anuj R., Oehmen, Chris S., Harper, Jill K., and Webb-Robertson, Bobbie-Jo M. Fri . "Integrating Subcellular Location for Improving Machine Learning Models of Remote Homology Detection in Eukaryotic Organisms". United States. doi:10.1016/j.compbiolchem.2007.02.012.
@article{osti_903458,
title = {Integrating Subcellular Location for Improving Machine Learning Models of Remote Homology Detection in Eukaryotic Organisms},
author = {Shah, Anuj R. and Oehmen, Chris S. and Harper, Jill K. and Webb-Robertson, Bobbie-Jo M.},
abstractNote = {Motivation: At the center of bioinformatics, genomics, and pro-teomics is the need for highly accurate genome annotations. Producing high-quality reliable annotations depends on identifying sequences which are related evolutionarily (homologs) on which to infer function. Homology detection is one of the oldest tasks in bioinformatics, however most approaches still fail when presented with sequences that have low residue similarity despite a distant evolutionary relationship (remote homology). Recently, discriminative approaches, such as support vector machines (SVMs) have demonstrated a vast improvement in sensitivity for remote homology detection. These methods however have only focused on one aspect of the sequence at a time, e.g., sequence similarity or motif based scores. However, supplementary information, such as the sub-cellular location of a protein within the cell would give further clues as to possible homologous pairs, additionally eliminating false relationships due to simple functional roles that cannot exist due to location. We have developed a method, SVM-SimLoc that integrates sub-cellular location with sequence similarity information into a pro-tein family classifier and compared it to one of the most accurate sequence based SVM approaches, SVM-Pairwise. Results: The SCOP 1.53 benchmark data set was utilized to assess the performance of SVM-SimLoc. As cellular location prediction is dependent upon the type of sequence, eukaryotic or prokaryotic, the analysis is restricted to the 2630 eukaryotic sequences in the benchmark dataset, evaluating a total of 27 protein families. We demonstrate that the integration of sequence similarity and sub-cellular location yields notably more accurate results than using sequence similarity independently at a significance level of 0.006.},
doi = {10.1016/j.compbiolchem.2007.02.012},
journal = {Computational Biology and Chemistry, 31(2):138-142},
number = 2,
volume = 31,
place = {United States},
year = {Fri Feb 23 00:00:00 EST 2007},
month = {Fri Feb 23 00:00:00 EST 2007}
}