skip to main content

SciTech ConnectSciTech Connect

Title: FASTERp: A Feature Array Search Tool for Estimating Resemblance of Protein Sequences

Metagenome sequencing efforts have provided a large pool of billions of genes for identifying enzymes with desirable biochemical traits. However, homology search with billions of genes in a rapidly growing database has become increasingly computationally impractical. Here we present our pilot efforts to develop a novel alignment-free algorithm for homology search. Specifically, we represent individual proteins as feature vectors that denote the presence or absence of short kmers in the protein sequence. Similarity between feature vectors is then computed using the Tanimoto score, a distance metric that can be rapidly computed on bit string representations of feature vectors. Preliminary results indicate good correlation with optimal alignment algorithms (Spearman r of 0.87, ~;;1,000,000 proteins from Pfam), as well as with heuristic algorithms such as BLAST (Spearman r of 0.86, ~;;1,000,000 proteins). Furthermore, a prototype of FASTERp implemented in Python runs approximately four times faster than BLAST on a small scale dataset (~;;1000 proteins). We are optimizing and scaling to improve FASTERp to enable rapid homology searches against billion-protein databases, thereby enabling more comprehensive gene annotation efforts.
; ;
Publication Date:
OSTI Identifier:
Report Number(s):
DOE Contract Number:
Resource Type:
Resource Relation:
Conference: 9th Annual JGI User Meeting
Research Org:
Ernest Orlando Lawrence Berkeley National Laboratory, Berkeley, CA (US)
Sponsoring Org:
USDOE Office of Science (SC)
Country of Publication:
United States