Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding
The vast expansion of sequence data generated from single organisms and microbiomes has precipitated the need for faster and more sensitive methods to assess evolutionary and functional relationships between proteins. Representing proteins as sets of short peptide sequences (kmers) has been used for rapid, accurate classification of proteins into functional categories; however, this approach employs an exact-match methodology and thus may be limited in terms of sensitivity and coverage. We have previously used similarity groupings, based on the chemical properties of amino acids, to form reduced character sets and recode proteins. This amino acid recoding (AAR) approach simplifies the construction of protein representations in the form of kmer vectors, which can link sequences with distant sequence similarity and provide accurate classification of problematic protein families.
ResultsHere, we describe Snekmer, a software tool for recoding proteins into AAR kmer vectors and performing either (i) construction of supervised classification models trained on input protein families or (ii) clustering for de novo determination of protein families. We provide examples of the operation of the tool against a set of nitrogen cycling families originally collected using both standard hidden Markov models and a larger set of proteins from Uniprot and demonstrate that our method accurately differentiates these sequences in both operation modes.
Availability and implementationSnekmer is written in Python using Snakemake. Code and data used in this article, along with tutorial notebooks, are available at http://github.com/PNNL-CompBio/Snekmer under an open-source BSD-3 license.
Supplementary informationSupplementary data are available at Bioinformatics Advances online.
- Research Organization:
- Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
- Sponsoring Organization:
- Defense Threat Reduction Agency (DTRA); National Science Foundation (NSF); USDOE; USDOE Laboratory Directed Research and Development (LDRD) Program; USDOE Office of Science (SC), Biological and Environmental Research (BER)
- Grant/Contract Number:
- AC05-76RL01830
- OSTI ID:
- 1924146
- Alternate ID(s):
- OSTI ID: 1969036
- Report Number(s):
- PNNL-SA-169271; vbad005
- Journal Information:
- Bioinformatics Advances, Journal Name: Bioinformatics Advances Journal Issue: 1 Vol. 3; ISSN 2635-0041
- Publisher:
- Oxford University PressCopyright Statement
- Country of Publication:
- United Kingdom
- Language:
- English