Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding

Journal Article · · Bioinformatics Advances
Abstract Motivation

The vast expansion of sequence data generated from single organisms and microbiomes has precipitated the need for faster and more sensitive methods to assess evolutionary and functional relationships between proteins. Representing proteins as sets of short peptide sequences (kmers) has been used for rapid, accurate classification of proteins into functional categories; however, this approach employs an exact-match methodology and thus may be limited in terms of sensitivity and coverage. We have previously used similarity groupings, based on the chemical properties of amino acids, to form reduced character sets and recode proteins. This amino acid recoding (AAR) approach simplifies the construction of protein representations in the form of kmer vectors, which can link sequences with distant sequence similarity and provide accurate classification of problematic protein families.

Results

Here, we describe Snekmer, a software tool for recoding proteins into AAR kmer vectors and performing either (i) construction of supervised classification models trained on input protein families or (ii) clustering for de novo determination of protein families. We provide examples of the operation of the tool against a set of nitrogen cycling families originally collected using both standard hidden Markov models and a larger set of proteins from Uniprot and demonstrate that our method accurately differentiates these sequences in both operation modes.

Availability and implementation

Snekmer is written in Python using Snakemake. Code and data used in this article, along with tutorial notebooks, are available at http://github.com/PNNL-CompBio/Snekmer under an open-source BSD-3 license.

Supplementary information

Supplementary data are available at Bioinformatics Advances online.

Research Organization:
Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
Sponsoring Organization:
Defense Threat Reduction Agency (DTRA); National Science Foundation (NSF); USDOE; USDOE Laboratory Directed Research and Development (LDRD) Program; USDOE Office of Science (SC), Biological and Environmental Research (BER)
Grant/Contract Number:
AC05-76RL01830
OSTI ID:
1924146
Alternate ID(s):
OSTI ID: 1969036
Report Number(s):
PNNL-SA-169271; vbad005
Journal Information:
Bioinformatics Advances, Journal Name: Bioinformatics Advances Journal Issue: 1 Vol. 3; ISSN 2635-0041
Publisher:
Oxford University PressCopyright Statement
Country of Publication:
United Kingdom
Language:
English

References (29)

Research progress of reduced amino acid alphabets in protein analysis and prediction journal January 2022
MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets journal October 2017
Array programming with NumPy journal September 2020
Bioconda: sustainable and comprehensive software distribution for the life sciences journal July 2018
Sensitive protein alignments at tree-of-life scale using DIAMOND journal April 2021
Profile hidden Markov models journal October 1998
Real Time Metagenomics: Using k-mers to annotate metagenomes journal October 2012
Revisiting amino acid substitution matrices for identifying distantly related proteins journal November 2013
Lambda: the local aligner for massive biological data journal August 2014
MMseqs2 desktop and local web server app for fast, interactive sequence searches journal January 2019
Snakemake—a scalable bioinformatics workflow engine journal May 2018
The Pfam Protein Families Database journal January 2000
Pfam: The protein families database in 2021 journal October 2020
Expasy, the Swiss Bioinformatics Resource Portal, as designed by its users journal April 2021
TIGRFAMs and Genome Properties in 2013 journal November 2012
The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST) journal November 2013
Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation journal November 2015
Computational Prediction of Type III and IV Secreted Effectors in Gram-Negative Bacteria journal October 2010
Automated Alphabet Reduction for Protein Datasets journal January 2009
Blazing Signature Filter: a library for fast pairwise similarity comparisons journal June 2018
Next-generation genome annotation: we still struggle to get it right journal May 2019
Prediction of multi-drug resistance transporters using a novel sequence analysis method journal March 2015
Accelerated Profile HMM Searches journal October 2011
Distinct temporal diversity profiles for nitrogen cycling genes in a hyporheic microbiome journal January 2020
Accurate Prediction of Secreted Substrates and Identification of a Conserved Putative Secretion Signal for Type III Secretion Systems journal April 2009
Sequence-Based Prediction of Type III Secreted Proteins journal April 2009
UMAP: Uniform Manifold Approximation and Projection journal September 2018
Data Structures for Statistical Computing in Python conference January 2010
Prediction of bacterial E3 ubiquitin ligase effectors using reduced amino acid peptide fingerprinting journal June 2019