Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding

Chang, Christine H.; Nelson, William C.; Jerger, Abby; Wright, Aaron T.; Egbert, Robert G.; McDermott, Jason E.; Arighi, ed., Cecilia

doi:10.1093/bioadv/vbad005

Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding

Journal Article · Thu Feb 02 00:00:00 EST 2023 · Bioinformatics Advances

DOI:https://doi.org/10.1093/bioadv/vbad005· OSTI ID:1924146

Chang, Christine H.; Nelson, William C.; Jerger, Abby; ; ; ; Arighi, ed., Cecilia

Abstract Motivation

The vast expansion of sequence data generated from single organisms and microbiomes has precipitated the need for faster and more sensitive methods to assess evolutionary and functional relationships between proteins. Representing proteins as sets of short peptide sequences (kmers) has been used for rapid, accurate classification of proteins into functional categories; however, this approach employs an exact-match methodology and thus may be limited in terms of sensitivity and coverage. We have previously used similarity groupings, based on the chemical properties of amino acids, to form reduced character sets and recode proteins. This amino acid recoding (AAR) approach simplifies the construction of protein representations in the form of kmer vectors, which can link sequences with distant sequence similarity and provide accurate classification of problematic protein families.

Results

Here, we describe Snekmer, a software tool for recoding proteins into AAR kmer vectors and performing either (i) construction of supervised classification models trained on input protein families or (ii) clustering for de novo determination of protein families. We provide examples of the operation of the tool against a set of nitrogen cycling families originally collected using both standard hidden Markov models and a larger set of proteins from Uniprot and demonstrate that our method accurately differentiates these sequences in both operation modes.

Availability and implementation

Snekmer is written in Python using Snakemake. Code and data used in this article, along with tutorial notebooks, are available at http://github.com/PNNL-CompBio/Snekmer under an open-source BSD-3 license.

Supplementary information

Supplementary data are available at Bioinformatics Advances online.

View Journal Article

Research Organization:: Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)

Sponsoring Organization:: Defense Threat Reduction Agency (DTRA); National Science Foundation (NSF); USDOE; USDOE Laboratory Directed Research and Development (LDRD) Program; USDOE Office of Science (SC), Biological and Environmental Research (BER)

Grant/Contract Number:: AC05-76RL01830

OSTI ID:: 1924146

Alternate ID(s):: OSTI ID: 1969036

Report Number(s):: PNNL-SA-169271; vbad005

Journal Information:: Bioinformatics Advances, Journal Name: Bioinformatics Advances Journal Issue: 1 Vol. 3; ISSN 2635-0041

Publisher:: Oxford University PressCopyright Statement

Country of Publication:: United Kingdom

Language:: English

References (29)

Research progress of reduced amino acid alphabets in protein analysis and prediction Liang, Yuchao; Yang, Siqi; Zheng, Lei Computational and Structural Biotechnology Journal, Vol. 20 https://doi.org/10.1016/j.csbj.2022.07.001	journal	January 2022
MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets Steinegger, Martin; Söding, Johannes Nature Biotechnology, Vol. 35, Issue 11 https://doi.org/10.1038/nbt.3988	journal	October 2017
Array programming with NumPy Harris, Charles R.; Millman, K. Jarrod; van der Walt, Stéfan J. Nature, Vol. 585, Issue 7825 https://doi.org/10.1038/s41586-020-2649-2	journal	September 2020
Bioconda: sustainable and comprehensive software distribution for the life sciences Grüning, Björn; Dale, Ryan; Sjödin, Andreas Nature Methods, Vol. 15, Issue 7 https://doi.org/10.1038/s41592-018-0046-7	journal	July 2018
Sensitive protein alignments at tree-of-life scale using DIAMOND Buchfink, Benjamin; Reuter, Klaus; Drost, Hajk-Georg Nature Methods, Vol. 18, Issue 4 https://doi.org/10.1038/s41592-021-01101-x	journal	April 2021
Profile hidden Markov models Eddy, S. R. Bioinformatics, Vol. 14, Issue 9 https://doi.org/10.1093/bioinformatics/14.9.755	journal	October 1998
Real Time Metagenomics: Using k-mers to annotate metagenomes Edwards, Robert A.; Olson, Robert; Disz, Terry Bioinformatics, Vol. 28, Issue 24 https://doi.org/10.1093/bioinformatics/bts599	journal	October 2012
Revisiting amino acid substitution matrices for identifying distantly related proteins Yamada, Kazunori; Tomii, Kentaro Bioinformatics, Vol. 30, Issue 3 https://doi.org/10.1093/bioinformatics/btt694	journal	November 2013
Lambda: the local aligner for massive biological data Hauswedell, Hannes; Singer, Jochen; Reinert, Knut Bioinformatics, Vol. 30, Issue 17 https://doi.org/10.1093/bioinformatics/btu439	journal	August 2014
MMseqs2 desktop and local web server app for fast, interactive sequence searches Mirdita, Milot; Steinegger, Martin; Söding, Johannes Bioinformatics, Vol. 35, Issue 16 https://doi.org/10.1093/bioinformatics/bty1057	journal	January 2019
Snakemake—a scalable bioinformatics workflow engine Köster, Johannes; Rahmann, Sven Bioinformatics, Vol. 34, Issue 20 https://doi.org/10.1093/bioinformatics/bty350	journal	May 2018
The Pfam Protein Families Database Bateman, A. Nucleic Acids Research, Vol. 28, Issue 1 https://doi.org/10.1093/nar/28.1.263	journal	January 2000
Pfam: The protein families database in 2021 Mistry, Jaina; Chuguransky, Sara; Williams, Lowri Nucleic Acids Research, Vol. 49, Issue D1 https://doi.org/10.1093/nar/gkaa913	journal	October 2020
Expasy, the Swiss Bioinformatics Resource Portal, as designed by its users Duvaud, Séverine; Gabella, Chiara; Lisacek, Frédérique Nucleic Acids Research, Vol. 49, Issue W1 https://doi.org/10.1093/nar/gkab225	journal	April 2021
TIGRFAMs and Genome Properties in 2013 Haft, Daniel H.; Selengut, Jeremy D.; Richter, Roland A. Nucleic Acids Research, Vol. 41, Issue D1 https://doi.org/10.1093/nar/gks1234	journal	November 2012
The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST) Overbeek, Ross; Olson, Robert; Pusch, Gordon D. Nucleic Acids Research, Vol. 42, Issue D1 https://doi.org/10.1093/nar/gkt1226	journal	November 2013
Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation O'Leary, Nuala A.; Wright, Mathew W.; Brister, J. Rodney Nucleic Acids Research, Vol. 44, Issue D1 https://doi.org/10.1093/nar/gkv1189	journal	November 2015
Computational Prediction of Type III and IV Secreted Effectors in Gram-Negative Bacteria McDermott, Jason E.; Corrigan, Abigail; Peterson, Elena Infection and Immunity, Vol. 79, Issue 1 https://doi.org/10.1128/IAI.00537-10	journal	October 2010
Automated Alphabet Reduction for Protein Datasets Bacardit, Jaume; Stout, Michael; Hirst, Jonathan D. BMC Bioinformatics, Vol. 10, Issue 1 https://doi.org/10.1186/1471-2105-10-6	journal	January 2009
Blazing Signature Filter: a library for fast pairwise similarity comparisons Lee, Joon-Yong; Fujimoto, Grant M.; Wilson, Ryan BMC Bioinformatics, Vol. 19, Issue 1 https://doi.org/10.1186/s12859-018-2210-6	journal	June 2018
Next-generation genome annotation: we still struggle to get it right Salzberg, Steven L. Genome Biology, Vol. 20, Issue 1 https://doi.org/10.1186/s13059-019-1715-2	journal	May 2019
Prediction of multi-drug resistance transporters using a novel sequence analysis method McDermott, Jason E.; Bruillard, Paul; Overall, Christopher C. F1000Research, Vol. 4 https://doi.org/10.12688/f1000research.6200.1	journal	March 2015
Accelerated Profile HMM Searches Eddy, Sean R. PLoS Computational Biology, Vol. 7, Issue 10 https://doi.org/10.1371/journal.pcbi.1002195	journal	October 2011
Distinct temporal diversity profiles for nitrogen cycling genes in a hyporheic microbiome Nelson, William C.; Graham, Emily B.; Crump, Alex R. PLOS ONE, Vol. 15, Issue 1 https://doi.org/10.1371/journal.pone.0228165	journal	January 2020
Accurate Prediction of Secreted Substrates and Identification of a Conserved Putative Secretion Signal for Type III Secretion Systems Samudrala, Ram; Heffron, Fred; McDermott, Jason E. PLoS Pathogens, Vol. 5, Issue 4 https://doi.org/10.1371/journal.ppat.1000375	journal	April 2009
Sequence-Based Prediction of Type III Secreted Proteins Arnold, Roland; Brandmaier, Stefan; Kleine, Frederick PLoS Pathogens, Vol. 5, Issue 4 https://doi.org/10.1371/journal.ppat.1000376	journal	April 2009
UMAP: Uniform Manifold Approximation and Projection McInnes, Leland; Healy, John; Saul, Nathaniel Journal of Open Source Software, Vol. 3, Issue 29 https://doi.org/10.21105/joss.00861	journal	September 2018
Data Structures for Statistical Computing in Python McKinney, Wes Proceedings of the Python in Science Conference https://doi.org/10.25080/Majora-92bf1922-00a	conference	January 2010
Prediction of bacterial E3 ubiquitin ligase effectors using reduced amino acid peptide fingerprinting McDermott, Jason E.; Cort, John R.; Nakayasu, Ernesto S. PeerJ, Vol. 7 https://doi.org/10.7717/peerj.7055	journal	June 2019

Similar Records

Prediction of bacterial E3 ubiquitin ligase effectors using reduced amino acid peptide fingerprinting

Journal Article · Thu Jun 06 20:00:00 EDT 2019 · PeerJ · OSTI ID:1525490

Related Subjects

59 BASIC BIOLOGICAL SCIENCES
functional prediction
machine learning
protein function
sequence analysis

Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding

Citation Formats

References (29)

Similar Records

Related Subjects