Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

SVM-Hustle - An iterative semi-supervised machine learning approach for pairwise protein remote homology detection

Journal Article · · Bioinformatics, 24(6):783-790
Motivation: As the amount of biological sequence data continues to grow exponentially we face the increasing challenge of assigning function to this enormous molecular ‘parts list’. The most popular approaches to this challenge make use of the simplifying assumption that similar functional molecules, or proteins, sometimes have similar composition, or sequence. However, these algorithms often fail to identify remote homologs (proteins with similar function but dissimilar sequence) which often are a significant fraction of the total homolog collection for a given sequence. We introduce a Support Vector Machine (SVM)-based tool to detect Homology Using Semisupervised iTerative LEarning (SVM-HUSTLE) that detects significantly more remote homologs than current state-of-the-art sequence or cluster-based methods. As opposed to building profiles or position specific scoring matrices, SVM-HUSTLE builds an SVM classifier for a query sequence by training on a collection of representative highconfidence training sets. SVM-HUSTLE combines principles of semi-supervised learning theory with statistical sampling to create many concurrent classifiers to iteratively detect and refine on-the-fly patterns indicating homology. Results: When compared against existing methods for identifying protein homologs (BLASTp, PSI-BLAST, RANKPROP, MOTIFPROP and their variants) on the SCOP 1.59 benchmark dataset consisting of 7329 protein sequences, SVM-HUSTLE significantly outperforms each of the above methods using the most stringent ROC1 statistic with p-values less than 1e-20.
Research Organization:
Pacific Northwest National Laboratory (PNNL), Richland, WA (US), Environmental Molecular Sciences Laboratory (EMSL)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-76RL01830
OSTI ID:
985035
Report Number(s):
PNNL-SA-56589; 20905; KJ0101030
Journal Information:
Bioinformatics, 24(6):783-790, Journal Name: Bioinformatics, 24(6):783-790 Journal Issue: 6 Vol. 24; ISSN 1460-2059; ISSN 1367-4803
Country of Publication:
United States
Language:
English

Similar Records

SVM-BALSA: Remote Homology Detection based on Bayesian Sequence Alignment
Journal Article · Wed Nov 09 23:00:00 EST 2005 · Computational Biology and Chemistry, 29(6):440-3 · OSTI ID:878675

Physicochemical property distributions for accurate and rapid pairwise protein homology detection
Journal Article · Thu Mar 18 20:00:00 EDT 2010 · BMC Bioinformatics · OSTI ID:1626267

Integrating Subcellular Location for Improving Machine Learning Models of Remote Homology Detection in Eukaryotic Organisms
Journal Article · Thu Feb 22 23:00:00 EST 2007 · Computational Biology and Chemistry, 31(2):138-142 · OSTI ID:903458