skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Integrating Subcellular Location for Improving Machine Learning Models of Remote Homology Detection in Eukaryotic Organisms

Abstract

Motivation: At the center of bioinformatics, genomics, and pro-teomics is the need for highly accurate genome annotations. Producing high-quality reliable annotations depends on identifying sequences which are related evolutionarily (homologs) on which to infer function. Homology detection is one of the oldest tasks in bioinformatics, however most approaches still fail when presented with sequences that have low residue similarity despite a distant evolutionary relationship (remote homology). Recently, discriminative approaches, such as support vector machines (SVMs) have demonstrated a vast improvement in sensitivity for remote homology detection. These methods however have only focused on one aspect of the sequence at a time, e.g., sequence similarity or motif based scores. However, supplementary information, such as the sub-cellular location of a protein within the cell would give further clues as to possible homologous pairs, additionally eliminating false relationships due to simple functional roles that cannot exist due to location. We have developed a method, SVM-SimLoc that integrates sub-cellular location with sequence similarity information into a pro-tein family classifier and compared it to one of the most accurate sequence based SVM approaches, SVM-Pairwise. Results: The SCOP 1.53 benchmark data set was utilized to assess the performance of SVM-SimLoc. As cellular location prediction is dependentmore » upon the type of sequence, eukaryotic or prokaryotic, the analysis is restricted to the 2630 eukaryotic sequences in the benchmark dataset, evaluating a total of 27 protein families. We demonstrate that the integration of sequence similarity and sub-cellular location yields notably more accurate results than using sequence similarity independently at a significance level of 0.006.« less

Authors:
; ; ;
Publication Date:
Research Org.:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
903458
Report Number(s):
PNNL-SA-48402
TRN: US200720%%370
DOE Contract Number:
AC05-76RL01830
Resource Type:
Journal Article
Resource Relation:
Journal Name: Computational Biology and Chemistry, 31(2):138-142; Journal Volume: 31; Journal Issue: 2
Country of Publication:
United States
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES; BENCHMARKS; DETECTION; FORECASTING; FUNCTIONALS; LEARNING; PERFORMANCE; PROTEINS; RESIDUES; SENSITIVITY; VECTORS

Citation Formats

Shah, Anuj R., Oehmen, Chris S., Harper, Jill K., and Webb-Robertson, Bobbie-Jo M. Integrating Subcellular Location for Improving Machine Learning Models of Remote Homology Detection in Eukaryotic Organisms. United States: N. p., 2007. Web. doi:10.1016/j.compbiolchem.2007.02.012.
Shah, Anuj R., Oehmen, Chris S., Harper, Jill K., & Webb-Robertson, Bobbie-Jo M. Integrating Subcellular Location for Improving Machine Learning Models of Remote Homology Detection in Eukaryotic Organisms. United States. doi:10.1016/j.compbiolchem.2007.02.012.
Shah, Anuj R., Oehmen, Chris S., Harper, Jill K., and Webb-Robertson, Bobbie-Jo M. Fri . "Integrating Subcellular Location for Improving Machine Learning Models of Remote Homology Detection in Eukaryotic Organisms". United States. doi:10.1016/j.compbiolchem.2007.02.012.
@article{osti_903458,
title = {Integrating Subcellular Location for Improving Machine Learning Models of Remote Homology Detection in Eukaryotic Organisms},
author = {Shah, Anuj R. and Oehmen, Chris S. and Harper, Jill K. and Webb-Robertson, Bobbie-Jo M.},
abstractNote = {Motivation: At the center of bioinformatics, genomics, and pro-teomics is the need for highly accurate genome annotations. Producing high-quality reliable annotations depends on identifying sequences which are related evolutionarily (homologs) on which to infer function. Homology detection is one of the oldest tasks in bioinformatics, however most approaches still fail when presented with sequences that have low residue similarity despite a distant evolutionary relationship (remote homology). Recently, discriminative approaches, such as support vector machines (SVMs) have demonstrated a vast improvement in sensitivity for remote homology detection. These methods however have only focused on one aspect of the sequence at a time, e.g., sequence similarity or motif based scores. However, supplementary information, such as the sub-cellular location of a protein within the cell would give further clues as to possible homologous pairs, additionally eliminating false relationships due to simple functional roles that cannot exist due to location. We have developed a method, SVM-SimLoc that integrates sub-cellular location with sequence similarity information into a pro-tein family classifier and compared it to one of the most accurate sequence based SVM approaches, SVM-Pairwise. Results: The SCOP 1.53 benchmark data set was utilized to assess the performance of SVM-SimLoc. As cellular location prediction is dependent upon the type of sequence, eukaryotic or prokaryotic, the analysis is restricted to the 2630 eukaryotic sequences in the benchmark dataset, evaluating a total of 27 protein families. We demonstrate that the integration of sequence similarity and sub-cellular location yields notably more accurate results than using sequence similarity independently at a significance level of 0.006.},
doi = {10.1016/j.compbiolchem.2007.02.012},
journal = {Computational Biology and Chemistry, 31(2):138-142},
number = 2,
volume = 31,
place = {United States},
year = {Fri Feb 23 00:00:00 EST 2007},
month = {Fri Feb 23 00:00:00 EST 2007}
}
  • Motivation: As the amount of biological sequence data continues to grow exponentially we face the increasing challenge of assigning function to this enormous molecular ‘parts list’. The most popular approaches to this challenge make use of the simplifying assumption that similar functional molecules, or proteins, sometimes have similar composition, or sequence. However, these algorithms often fail to identify remote homologs (proteins with similar function but dissimilar sequence) which often are a significant fraction of the total homolog collection for a given sequence. We introduce a Support Vector Machine (SVM)-based tool to detect Homology Using Semisupervised iTerative LEarning (SVM-HUSTLE) that detectsmore » significantly more remote homologs than current state-of-the-art sequence or cluster-based methods. As opposed to building profiles or position specific scoring matrices, SVM-HUSTLE builds an SVM classifier for a query sequence by training on a collection of representative highconfidence training sets. SVM-HUSTLE combines principles of semi-supervised learning theory with statistical sampling to create many concurrent classifiers to iteratively detect and refine on-the-fly patterns indicating homology. Results: When compared against existing methods for identifying protein homologs (BLASTp, PSI-BLAST, RANKPROP, MOTIFPROP and their variants) on the SCOP 1.59 benchmark dataset consisting of 7329 protein sequences, SVM-HUSTLE significantly outperforms each of the above methods using the most stringent ROC1 statistic with p-values less than 1e-20.« less
  • Highlights: • ABCD proteins classifies based on with or without NH{sub 2}-terminal hydrophobic segment. • The ABCD proteins with the segment are targeted peroxisomes. • The ABCD proteins without the segment are targeted to the endoplasmic reticulum. • The role of the segment in organelle targeting is conserved in eukaryotic organisms. - Abstract: In mammals, four ATP-binding cassette (ABC) proteins belonging to subfamily D have been identified. ABCD1–3 possesses the NH{sub 2}-terminal hydrophobic region and are targeted to peroxisomes, while ABCD4 lacking the region is targeted to the endoplasmic reticulum (ER). Based on hydropathy plot analysis, we found that severalmore » eukaryotes have ABCD protein homologs lacking the NH{sub 2}-terminal hydrophobic segment (H0 motif). To investigate whether the role of the NH{sub 2}-terminal H0 motif in subcellular localization is conserved across species, we expressed ABCD proteins from several species (metazoan, plant and fungi) in fusion with GFP in CHO cells and examined their subcellular localization. ABCD proteins possessing the NH{sub 2}-terminal H0 motif were localized to peroxisomes, while ABCD proteins lacking this region lost this capacity. In addition, the deletion of the NH{sub 2}-terminal H0 motif of ABCD protein resulted in their localization to the ER. These results suggest that the role of the NH{sub 2}-terminal H0 motif in organelle targeting is widely conserved in living organisms.« less
  • Many receptors involved in clathrin-mediated protein transport through the endocytic and secretary pathways of yeast and animal cells share common features. They are all type I integral membrane proteins containing cysteine-rich lumenal domains and cytoplasmic tails with tyrosine-containing sorting signals. The cysteine-rich domains are thought to be involved in ligand binding, whereas the cytoplasmic tyrosine motifs interact with clathrin-associated adaptor proteins during protein sorting along these pathways. in addition, tyrosine-containing signals are required for the retention and recycling of some of these membrane proteins to the trans-Golgi network. Here we report the characterization of an approximately 80-kD epidermal growth factormore » receptor-like type I integral membrane protein containing all of these functional motifs from Arabidopsis thaliana (called AtELP for A. thaliana Epidermal growth factor receptor-Like Protein). Biochemical analysis indicates that AtELP is a membrane protein found at high levels in the roots of both monocots and dicots. Subcellular fractionation studies indicate that the AtELP protein is present in two membrane fractions corresponding to a novel, undefined compartment and a fraction enriched in vesicles containing clathrin and its associated adaptor proteins. AtELP may therefore serve as a marker for compartments involved in intracellular protein trafficking in the plant cell. 87 refs., 7 figs.« less
  • This study investigates how machine learning methods can be used to improve hydraulic head predictions by integrating different types of data, including data from numerical models, in a hierarchical approach. A suite of four machine learning methods (decision trees, instance-based weighting, inverse distance weighting, and neural networks) are tested in several hierarchical configurations with different types of data from the 317/319 area at Argonne National Laboratory-East. The best machine learning model had a mean predicted head error 50% smaller than an existing MODFLOW numerical flow model, and a standard deviation of predicted head error 67% lower than the MODFLOW model,more » computed across all sampled locations used for calibrating the MODFLOW model. These predictions were obtained using decision trees trained with all historical quarterly data; the hourly head measurements were not as useful for prediction, most likely because of their poor spatial coverage. The results show promise for using hierarchical machine learning approaches to improve predictions and to identify the most essential types of data to guide future sampling efforts. Decision trees were also combined with an existing MODFLOW model to test their capabilities for updating numerical models to improve predictions as new data are collected. The combined model had a mean error 50% lower than the MODFLOW model alone. These results demonstrate that hierarchical machine learning approaches can be used to improve predictive performance of existing numerical models in areas with good data coverage. Further research is needed to compare this approach with methods such as Kalman filtering.« less
  • Using biopolymer sequence comparison methods to identify evolutionarily related proteins is one of the most common tasks in bioinformatics. Recently, support vector machines (SVMs) utilizing statistical learning theory have been employed in the problem of remote homology detection and shown to outperform iterative profile methods such as PSI-BLAST. In this study we demonstrate the utilization of a Bayesian alignment score, which accounts for the uncertainty of all possible alignments, in the SVM construction improves sensitivity compared to the traditional dynamic programming implementation.