DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)

Abstract

In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having amore » comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variablelength protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw amino acids k-mer features.« less

Authors:
ORCiD logo [1]; ORCiD logo [2]; ORCiD logo [3]
  1. Univ. of California, Berkeley, CA (United States). Molecular Cell Biomechanics Lab., Depts. of Bioengineering and Mechanical Engineering; Helmholtz Centre for Infection Research, Brunswick (Germany)
  2. Helmholtz Centre for Infection Research, Brunswick (Germany)
  3. Univ. of California, Berkeley, CA (United States). Molecular Cell Biomechanics Lab., Depts. of Bioengineering and Mechanical Engineering
Publication Date:
Research Org.:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1559191
Grant/Contract Number:  
AC02-05CH11231
Resource Type:
Accepted Manuscript
Journal Name:
Scientific Reports
Additional Journal Information:
Journal Volume: 9; Journal Issue: 1; Journal ID: ISSN 2045-2322
Publisher:
Nature Publishing Group
Country of Publication:
United States
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES

Citation Formats

Asgari, Ehsaneddin, McHardy, Alice C., and Mofrad, Mohammad R. K. Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX). United States: N. p., 2019. Web. doi:10.1038/s41598-019-38746-w.
Asgari, Ehsaneddin, McHardy, Alice C., & Mofrad, Mohammad R. K. Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX). United States. https://doi.org/10.1038/s41598-019-38746-w
Asgari, Ehsaneddin, McHardy, Alice C., and Mofrad, Mohammad R. K. Tue . "Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)". United States. https://doi.org/10.1038/s41598-019-38746-w. https://www.osti.gov/servlets/purl/1559191.
@article{osti_1559191,
title = {Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)},
author = {Asgari, Ehsaneddin and McHardy, Alice C. and Mofrad, Mohammad R. K.},
abstractNote = {In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variablelength protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw amino acids k-mer features.},
doi = {10.1038/s41598-019-38746-w},
journal = {Scientific Reports},
number = 1,
volume = 9,
place = {United States},
year = {Tue Mar 05 00:00:00 EST 2019},
month = {Tue Mar 05 00:00:00 EST 2019}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 35 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples
journal, August 2018


Discovering Sequence Motifs with Arbitrary Insertions and Deletions
journal, May 2008


On the Activation of Integrin αIIbβ3: Outside-in and Inside-out Pathways
journal, September 2013

  • Mehrbod, Mehrdad; Trisno, Stephen; Mofrad, Mohammad R. K.
  • Biophysical Journal, Vol. 105, Issue 6
  • DOI: 10.1016/j.bpj.2013.07.055

Ligand Binding to Integrins
journal, May 2000

  • Plow, Edward F.; Haas, Thomas A.; Zhang, Li
  • Journal of Biological Chemistry, Vol. 275, Issue 29
  • DOI: 10.1074/jbc.r000003200

Discriminative motif discovery in DNA and protein sequences using the DEME algorithm
journal, October 2007


Locating proteins in the cell using TargetP, SignalP and related tools
journal, April 2007

  • Emanuelsson, Olof; Brunak, Søren; von Heijne, Gunnar
  • Nature Protocols, Vol. 2, Issue 4
  • DOI: 10.1038/nprot.2007.131

Comparing Fifty Natural Languages and Twelve Genetic Languages Using Word Embedding Language Divergence (WELD) as a Quantitative Measure of Language Distance
conference, January 2016

  • Asgari, Ehsaneddin; Mofrad, Mohammad R. K.
  • Proceedings of the Workshop on Multilingual and Cross­-lingual Methods in NLP
  • DOI: 10.18653/v1/w16-1208

Identifying antimicrobial peptides using word embedding with deep recurrent neural networks
journal, November 2018


Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence
journal, January 1990

  • Guruprasad, Kunchur; Reddy, B. V. Bhasker; Pandit, Madhusudan W.
  • "Protein Engineering, Design and Selection", Vol. 4, Issue 2
  • DOI: 10.1093/protein/4.2.155

Accuracy of protein flexibility predictions
journal, June 1994

  • Vihinen, Mauno; Torkkila, Esa; Riikonen, Pentti
  • Proteins: Structure, Function, and Genetics, Vol. 19, Issue 2
  • DOI: 10.1002/prot.340190207

Epigenomic k -mer dictionaries: shedding light on how sequence composition influences in vivo nucleosome positioning
journal, May 2015


Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning
journal, July 2015

  • Alipanahi, Babak; Delong, Andrew; Weirauch, Matthew T.
  • Nature Biotechnology, Vol. 33, Issue 8
  • DOI: 10.1038/nbt.3300

Full-length transcriptome assembly from RNA-Seq data without a reference genome
journal, May 2011

  • Grabherr, Manfred G.; Haas, Brian J.; Yassour, Moran
  • Nature Biotechnology, Vol. 29, Issue 7
  • DOI: 10.1038/nbt.1883

Neural Machine Translation of Rare Words with Subword Units
preprint, January 2015


Genomics and natural language processing
journal, August 2002

  • Yandell, Mark D.; Majoros, William H.
  • Nature Reviews Genetics, Vol. 3, Issue 8
  • DOI: 10.1038/nrg861

SLiMSearch 2.0: biological context for short linear motifs in proteins
journal, May 2011

  • Davey, N. E.; Haslam, N. J.; Shields, D. C.
  • Nucleic Acids Research, Vol. 39, Issue suppl
  • DOI: 10.1093/nar/gkr402

The effect of Arg-Gly-Asp-containing peptides on fibrinogen and von Willebrand factor binding to platelets.
journal, December 1985

  • Plow, E. F.; Pierschbacher, M. D.; Ruoslahti, E.
  • Proceedings of the National Academy of Sciences, Vol. 82, Issue 23
  • DOI: 10.1073/pnas.82.23.8057

Kraken: ultrafast metagenomic sequence classification using exact alignments
journal, January 2014


NLSdb—major update for database of nuclear localization signals and nuclear export signals
journal, November 2017

  • Bernhofer, Michael; Goldberg, Tatyana; Wolf, Silvana
  • Nucleic Acids Research, Vol. 46, Issue D1
  • DOI: 10.1093/nar/gkx1021

Computational complexity of algorithms for sequence comparison, short-read assembly and genome alignment
journal, June 2017


UniProt: the universal protein knowledgebase
journal, November 2016


Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition
journal, December 2017

  • Jaeger, Sabrina; Fulle, Simone; Turk, Samo
  • Journal of Chemical Information and Modeling, Vol. 58, Issue 1
  • DOI: 10.1021/acs.jcim.7b00616

Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition
journal, December 2017

  • Jaeger, Sabrina; Fulle, Simone; Turk, Samo
  • Journal of Chemical Information and Modeling, Vol. 58, Issue 1
  • DOI: 10.1021/acs.jcim.7b00616

HH-MOTiF: de novo detection of short linear motifs in proteins by Hidden Markov Model comparisons
journal, April 2017

  • Prytuliak, Roman; Volkmer, Michael; Meier, Markus
  • Nucleic Acids Research, Vol. 45, Issue W1
  • DOI: 10.1093/nar/gkx341

Fast and Accurate Discovery of Degenerate Linear Motifs in Protein Sequences
journal, September 2014


SLiMFinder: A Probabilistic Method for Identifying Over-Represented, Convergently Evolved, Short Linear Motifs in Proteins
journal, October 2007


DLocalMotif: a discriminative approach for discovering local motifs in protein sequences
journal, November 2012


Protein classification using modified n-grams and skip-grams
journal, December 2017


MEME SUITE: tools for motif discovery and searching
journal, May 2009

  • Bailey, T. L.; Boden, M.; Buske, F. A.
  • Nucleic Acids Research, Vol. 37, Issue Web Server
  • DOI: 10.1093/nar/gkp335

Neural Machine Translation of Rare Words with Subword Units
conference, January 2016

  • Sennrich, Rico; Haddow, Barry; Birch, Alexandra
  • Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
  • DOI: 10.18653/v1/P16-1162

Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics
journal, November 2015


Prediction of protein antigenic determinants from amino acid sequences.
journal, June 1981

  • Hopp, T. P.; Woods, K. R.
  • Proceedings of the National Academy of Sciences, Vol. 78, Issue 6
  • DOI: 10.1073/pnas.78.6.3824

Phosphorylation Facilitates the Integrin Binding of Filamin under Force
journal, December 2009

  • Chen, Harvey S.; Kolahi, Kevin S.; Mofrad, Mohammad R. K.
  • Biophysical Journal, Vol. 97, Issue 12
  • DOI: 10.1016/j.bpj.2009.08.059

An algorithmic perspective of de novo cis-regulatory motif finding based on ChIP-seq data
journal, March 2017

  • Liu, Bingqiang; Yang, Jinyu; Li, Yang
  • Briefings in Bioinformatics, Vol. 19, Issue 5
  • DOI: 10.1093/bib/bbx026

Kraken: ultrafast metagenomic sequence classification using exact alignments
journal, January 2014


The language of genes
journal, November 2002


ELM--the database of eukaryotic linear motifs
journal, November 2011

  • Dinkel, H.; Michael, S.; Weatheritt, R. J.
  • Nucleic Acids Research, Vol. 40, Issue D1
  • DOI: 10.1093/nar/gkr1064

Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification
conference, January 2014

  • Tang, Duyu; Wei, Furu; Yang, Nan
  • Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
  • DOI: 10.3115/v1/P14-1146

Tox-Prot, the toxin protein annotation program of the Swiss-Prot protein knowledgebase
journal, March 2005


Mut2Vec: distributed representation of cancerous mutations
journal, April 2018


SLiMSearch 2.0: biological context for short linear motifs in proteins
journal, May 2011

  • Davey, N. E.; Haslam, N. J.; Shields, D. C.
  • Nucleic Acids Research, Vol. 39, Issue suppl
  • DOI: 10.1093/nar/gkr402

Genomics and natural language processing
journal, August 2002

  • Yandell, Mark D.; Majoros, William H.
  • Nature Reviews Genetics, Vol. 3, Issue 8
  • DOI: 10.1038/nrg861

Rgd and Other Recognition Sequences for Integrins
journal, November 1996


DNA-Binding Specificities of Human Transcription Factors
journal, January 2013


Localized Lipid Packing of Transmembrane Domains Impedes Integrin Clustering
journal, March 2013


Discriminative motif discovery in DNA and protein sequences using the DEME algorithm
journal, October 2007


Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence
journal, January 1990

  • Guruprasad, Kunchur; Reddy, B. V. Bhasker; Pandit, Madhusudan W.
  • "Protein Engineering, Design and Selection", Vol. 4, Issue 2
  • DOI: 10.1093/protein/4.2.155

Some biological sequence metrics
journal, June 1976


Discovering Sequence Motifs with Arbitrary Insertions and Deletions
journal, May 2008


The effect of Arg-Gly-Asp-containing peptides on fibrinogen and von Willebrand factor binding to platelets.
journal, December 1985

  • Plow, E. F.; Pierschbacher, M. D.; Ruoslahti, E.
  • Proceedings of the National Academy of Sciences, Vol. 82, Issue 23
  • DOI: 10.1073/pnas.82.23.8057

The LINC and NPC relationship – it's complicated!
journal, August 2016

  • Jahed, Zeinab; Soheilypour, Mohammad; Peyro, Mohaddeseh
  • Journal of Cell Science, Vol. 129, Issue 17
  • DOI: 10.1242/jcs.184184

Locating proteins in the cell using TargetP, SignalP and related tools
journal, April 2007

  • Emanuelsson, Olof; Brunak, Søren; von Heijne, Gunnar
  • Nature Protocols, Vol. 2, Issue 4
  • DOI: 10.1038/nprot.2007.131

MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples
journal, June 2018


The language of genes
journal, November 2002


SLALOM, a flexible method for the identification and statistical analysis of overlapping continuous sequence elements in sequence- and time-series data
journal, January 2018

  • Prytuliak, Roman; Pfeiffer, Friedhelm; Habermann, Bianca Hermine
  • BMC Bioinformatics, Vol. 19, Issue 1
  • DOI: 10.1186/s12859-018-2020-x

Deep Recurrent Conditional Random Field Network for Protein Secondary Prediction
conference, January 2017

  • Johansen, Alexander Rosenberg; Sønderby, Casper Kaae; Sønderby, Søren Kaae
  • Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics - ACM-BCB '17
  • DOI: 10.1145/3107411.3107489

Neural Architectures for Named Entity Recognition
conference, January 2016

  • Lample, Guillaume; Ballesteros, Miguel; Subramanian, Sandeep
  • Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
  • DOI: 10.18653/v1/N16-1030

The “Stressful” Life of Cell Adhesion Molecules: On the Mechanosensitivity of Integrin Adhesome
journal, January 2018

  • Shams, Hengameh; Hoffman, Brenton D.; Mofrad, Mohammad R. K.
  • Journal of Biomechanical Engineering, Vol. 140, Issue 2
  • DOI: 10.1115/1.4038812

α-Actinin Induces a Kink in the Transmembrane Domain of β3-Integrin and Impairs Activation via Talin
journal, August 2017


Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics
journal, November 2015


DiTaxa: nucleotide-pair encoding of 16S rRNA for host phenotype and biomarker detection
journal, November 2018


Fast and Accurate Discovery of Degenerate Linear Motifs in Protein Sequences
journal, September 2014


HH-MOTiF: de novo detection of short linear motifs in proteins by Hidden Markov Model comparisons
journal, April 2017

  • Prytuliak, Roman; Volkmer, Michael; Meier, Markus
  • Nucleic Acids Research, Vol. 45, Issue W1
  • DOI: 10.1093/nar/gkx341

A simple method for displaying the hydropathic character of a protein
journal, May 1982


Protein Classification using Modified N-Gram and Skip-Gram Models: Extended Abstract
conference, August 2017

  • Islam, SM Ashiqul; Kearney, Christopher Michel; Choudhury, Ankan
  • BCB '17: 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics
  • DOI: 10.1145/3107411.3108193

DNA-Binding Specificities of Human Transcription Factors
journal, January 2013


Structure of polymerized type V pilin reveals assembly mechanism involving protease-mediated strand exchange
journal, April 2020


Localized Lipid Packing of Transmembrane Domains Impedes Integrin Clustering
journal, March 2013


MEME SUITE: tools for motif discovery and searching
journal, May 2009

  • Bailey, T. L.; Boden, M.; Buske, F. A.
  • Nucleic Acids Research, Vol. 37, Issue Web Server
  • DOI: 10.1093/nar/gkp335

On Information and Sufficiency
journal, March 1951

  • Kullback, S.; Leibler, R. A.
  • The Annals of Mathematical Statistics, Vol. 22, Issue 1
  • DOI: 10.1214/aoms/1177729694

Induction of hepatitis A virus-neutralizing antibody by a virus-specific synthetic peptide.
journal, January 1985


Enriching Word Vectors with Subword Information
preprint, January 2016


Epigenomic k -mer dictionaries: shedding light on how sequence composition influences in vivo nucleosome positioning
journal, May 2015


Mechanisms of integrin and filamin binding and their interplay with talin during early focal adhesion formation
journal, January 2015

  • Truong, Tiffany; Shams, Hengameh; Mofrad, Mohammad R. K.
  • Integrative Biology, Vol. 7, Issue 10
  • DOI: 10.1039/C5IB00133A

Accuracy of protein flexibility predictions
journal, June 1994

  • Vihinen, Mauno; Torkkila, Esa; Riikonen, Pentti
  • Proteins: Structure, Function, and Genetics, Vol. 19, Issue 2
  • DOI: 10.1002/prot.340190207

Machine learning can differentiate venom toxins from other proteins having non-toxic physiological functions
journal, January 2016

  • Gacesa, Ranko; Barlow, David J.; Long, Paul F.
  • PeerJ Computer Science, Vol. 2
  • DOI: 10.7717/peerj-cs.90

Mut2Vec: distributed representation of cancerous mutations
journal, April 2018


HH-MOTiF: de novo detection of short linear motifs in proteins by Hidden Markov Model comparisons
journal, September 2017

  • Prytuliak, Roman; Volkmer, Michael; Meier, Markus
  • Nucleic Acids Research, Vol. 45, Issue 18
  • DOI: 10.1093/nar/gkx810

ELM--the database of eukaryotic linear motifs
journal, November 2011

  • Dinkel, H.; Michael, S.; Weatheritt, R. J.
  • Nucleic Acids Research, Vol. 40, Issue D1
  • DOI: 10.1093/nar/gkr1064

A Comprehensive Evaluation of the Activity and Selectivity Profile of Ligands for RGD-binding Integrins
journal, January 2017

  • Kapp, Tobias G.; Rechenmacher, Florian; Neubauer, Stefanie
  • Scientific Reports, Vol. 7, Issue 1
  • DOI: 10.1038/srep39805

Gene2vec: distributed representation of genes based on co-expression
journal, February 2019


Phosphorylation Facilitates the Integrin Binding of Filamin under Force
journal, December 2009

  • Chen, Harvey S.; Kolahi, Kevin S.; Mofrad, Mohammad R. K.
  • Biophysical Journal, Vol. 97, Issue 12
  • DOI: 10.1016/j.bpj.2009.08.059

Full-length transcriptome assembly from RNA-Seq data without a reference genome
journal, May 2011

  • Grabherr, Manfred G.; Haas, Brian J.; Yassour, Moran
  • Nature Biotechnology, Vol. 29, Issue 7
  • DOI: 10.1038/nbt.1883

Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning
journal, July 2015

  • Alipanahi, Babak; Delong, Andrew; Weirauch, Matthew T.
  • Nature Biotechnology, Vol. 33, Issue 8
  • DOI: 10.1038/nbt.3300

Enriching Word Vectors with Subword Information
journal, December 2017

  • Bojanowski, Piotr; Grave, Edouard; Joulin, Armand
  • Transactions of the Association for Computational Linguistics, Vol. 5
  • DOI: 10.1162/tacl_a_00051

Induction of hepatitis A virus-neutralizing antibody by a virus-specific synthetic peptide.
journal, January 1985


Works referencing / citing this record:

Modeling aspects of the language of life through transfer-learning protein sequences
journal, December 2019


Learning supervised embeddings for large scale sequence comparisons
journal, March 2020


Identifying SNAREs by Incorporating Deep Learning Architecture and Amino Acid Embedding Representation
journal, December 2019


Modeling aspects of the language of life through transfer-learning protein sequences
text, January 2019

  • Michael, Heinzinger,; Ahmed, Elnaggar,; Yu, Wang,
  • Columbia University
  • DOI: 10.7916/tt83-bq62