skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)

Journal Article ·
DOI:https://doi.org/10.1101/345843· OSTI ID:1559145

ABSTRACT In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variable-length protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw k-mer features. Availability Implementations of our method will be available under the Apache 2 licence at http://llp.berkeley.edu/dimotif and http://llp.berkeley.edu/protvecx .

Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC)
DOE Contract Number:
AC02-05CH11231
OSTI ID:
1559145
Country of Publication:
United States
Language:
English

References (43)

Some biological sequence metrics journal June 1976
The language of genes journal November 2002
Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics journal November 2015
Neural Architectures for Named Entity Recognition
  • Lample, Guillaume; Ballesteros, Miguel; Subramanian, Sandeep
  • Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies https://doi.org/10.18653/v1/N16-1030
conference January 2016
Deep Recurrent Conditional Random Field Network for Protein Secondary Prediction
  • Johansen, Alexander Rosenberg; Sønderby, Casper Kaae; Sønderby, Søren Kaae
  • Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics - ACM-BCB '17 https://doi.org/10.1145/3107411.3107489
conference January 2017
Bacterial cell shape journal July 2005
Full-length transcriptome assembly from RNA-Seq data without a reference genome journal May 2011
DNA-Binding Specificities of Human Transcription Factors journal January 2013
Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning journal July 2015
Epigenomic k -mer dictionaries: shedding light on how sequence composition influences in vivo nucleosome positioning journal May 2015
Kraken: ultrafast metagenomic sequence classification using exact alignments journal January 2014
MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples journal June 2018
Neural Machine Translation of Rare Words with Subword Units
  • Sennrich, Rico; Haddow, Barry; Birch, Alexandra
  • Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) https://doi.org/10.18653/v1/P16-1162
conference January 2016
SLiMSearch 2.0: biological context for short linear motifs in proteins journal May 2011
SLiMFinder: A Probabilistic Method for Identifying Over-Represented, Convergently Evolved, Short Linear Motifs in Proteins journal October 2007
Discovering Sequence Motifs with Arbitrary Insertions and Deletions journal May 2008
MEME SUITE: tools for motif discovery and searching journal May 2009
HH-MOTiF: de novo detection of short linear motifs in proteins by Hidden Markov Model comparisons journal April 2017
An algorithmic perspective of de novo cis-regulatory motif finding based on ChIP-seq data journal March 2017
Discriminative motif discovery in DNA and protein sequences using the DEME algorithm journal October 2007
Fast and Accurate Discovery of Degenerate Linear Motifs in Protein Sequences journal September 2014
DLocalMotif: a discriminative approach for discovering local motifs in protein sequences journal November 2012
SLALOM, a flexible method for the identification and statistical analysis of overlapping continuous sequence elements in sequence- and time-series data journal January 2018
Protein classification using modified n-grams and skip-grams journal December 2017
Deep learning in bioinformatics journal July 2016
Mut2Vec: distributed representation of cancerous mutations journal April 2018
Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition journal December 2017
Locating proteins in the cell using TargetP, SignalP and related tools journal April 2007
Machine learning can differentiate venom toxins from other proteins having non-toxic physiological functions journal January 2016
Tox-Prot, the toxin protein annotation program of the Swiss-Prot protein knowledgebase journal March 2005
UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View book January 2016
On Information and Sufficiency journal March 1951
Accuracy of protein flexibility predictions journal June 1994
Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence journal January 1990
A simple method for displaying the hydropathic character of a protein journal May 1982
Prediction of protein antigenic determinants from amino acid sequences. journal June 1981
Enriching Word Vectors with Subword Information journal December 2017
Lymphoid cells recognize an alternatively spliced segment of fibronectin via the integrin receptor α4β1 journal January 1990
Rgd and Other Recognition Sequences for Integrins journal November 1996
Ligand Binding to Integrins journal May 2000
The effect of Arg-Gly-Asp-containing peptides on fibrinogen and von Willebrand factor binding to platelets. journal December 1985
A Comprehensive Evaluation of the Activity and Selectivity Profile of Ligands for RGD-binding Integrins journal January 2017
Effect of RGD secondary structure and the synergy site PHSRN on cell adhesion, spreading and specific integrin engagement journal July 2006