skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: SVM-Hustle - An iterative semi-supervised machine learning approach for pairwise protein remote homology detection

Journal Article · · Bioinformatics, 24(6):783-790

Motivation: As the amount of biological sequence data continues to grow exponentially we face the increasing challenge of assigning function to this enormous molecular ‘parts list’. The most popular approaches to this challenge make use of the simplifying assumption that similar functional molecules, or proteins, sometimes have similar composition, or sequence. However, these algorithms often fail to identify remote homologs (proteins with similar function but dissimilar sequence) which often are a significant fraction of the total homolog collection for a given sequence. We introduce a Support Vector Machine (SVM)-based tool to detect Homology Using Semisupervised iTerative LEarning (SVM-HUSTLE) that detects significantly more remote homologs than current state-of-the-art sequence or cluster-based methods. As opposed to building profiles or position specific scoring matrices, SVM-HUSTLE builds an SVM classifier for a query sequence by training on a collection of representative highconfidence training sets. SVM-HUSTLE combines principles of semi-supervised learning theory with statistical sampling to create many concurrent classifiers to iteratively detect and refine on-the-fly patterns indicating homology. Results: When compared against existing methods for identifying protein homologs (BLASTp, PSI-BLAST, RANKPROP, MOTIFPROP and their variants) on the SCOP 1.59 benchmark dataset consisting of 7329 protein sequences, SVM-HUSTLE significantly outperforms each of the above methods using the most stringent ROC1 statistic with p-values less than 1e-20.

Research Organization:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States). Environmental Molecular Sciences Lab. (EMSL)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-76RL01830
OSTI ID:
985035
Report Number(s):
PNNL-SA-56589; ISSN 1460-2059; 20905; KJ0101030; TRN: US201016%%1728
Journal Information:
Bioinformatics, 24(6):783-790, Vol. 24, Issue 6; ISSN 1367-4803
Country of Publication:
United States
Language:
English