Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)

Asgari, Ehsaneddin; McHardy, Alice; Mofrad, Mohammad RK

doi:10.1101/345843

Title: Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)

Journal Article · Wed Jun 13 00:00:00 EDT 2018

DOI:https://doi.org/10.1101/345843· OSTI ID:1559145

Asgari, Ehsaneddin; McHardy, Alice; Mofrad, Mohammad RK

ABSTRACT In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variable-length protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw k-mer features. Availability Implementations of our method will be available under the Apache 2 licence at http://llp.berkeley.edu/dimotif and http://llp.berkeley.edu/protvecx .

Cite

Export

Save

Research Organization:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Organization:: USDOE Office of Science (SC)

DOE Contract Number:: AC02-05CH11231

OSTI ID:: 1559145

Country of Publication:: United States

Language:: English

References (43)

Some biological sequence metrics Waterman, M. S.; Smith, T. F.; Beyer, W. A. Advances in Mathematics, Vol. 20, Issue 3 https://doi.org/10.1016/0001-8708(76)90202-4	journal	June 1976
The language of genes Searls, David B. Nature, Vol. 420, Issue 6912 https://doi.org/10.1038/nature01255	journal	November 2002
Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics Asgari, Ehsaneddin; Mofrad, Mohammad R. K. PLOS ONE, Vol. 10, Issue 11 https://doi.org/10.1371/journal.pone.0141287	journal	November 2015
Neural Architectures for Named Entity Recognition Lample, Guillaume; Ballesteros, Miguel; Subramanian, Sandeep Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies https://doi.org/10.18653/v1/N16-1030	conference	January 2016
Deep Recurrent Conditional Random Field Network for Protein Secondary Prediction Johansen, Alexander Rosenberg; Sønderby, Casper Kaae; Sønderby, Søren Kaae Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics - ACM-BCB '17 https://doi.org/10.1145/3107411.3107489	conference	January 2017
Bacterial cell shape Cabeen, Matthew T.; Jacobs-Wagner, Christine Nature Reviews Microbiology, Vol. 3, Issue 8 https://doi.org/10.1038/nrmicro1205	journal	July 2005
Full-length transcriptome assembly from RNA-Seq data without a reference genome Grabherr, Manfred G.; Haas, Brian J.; Yassour, Moran Nature Biotechnology, Vol. 29, Issue 7 https://doi.org/10.1038/nbt.1883	journal	May 2011
DNA-Binding Specificities of Human Transcription Factors Jolma, Arttu; Yan, Jian; Whitington, Thomas Cell, Vol. 152, Issue 1-2 https://doi.org/10.1016/j.cell.2012.12.009	journal	January 2013
Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning Alipanahi, Babak; Delong, Andrew; Weirauch, Matthew T. Nature Biotechnology, Vol. 33, Issue 8 https://doi.org/10.1038/nbt.3300	journal	July 2015
Epigenomic k -mer dictionaries: shedding light on how sequence composition influences in vivo nucleosome positioning Giancarlo, Raffaele; Rombo, Simona E.; Utro, Filippo Bioinformatics, Vol. 31, Issue 18 https://doi.org/10.1093/bioinformatics/btv295	journal	May 2015
Kraken: ultrafast metagenomic sequence classification using exact alignments Wood, Derrick E.; Salzberg, Steven L. Genome Biology, Vol. 15, Issue 3 https://doi.org/10.1186/gb-2014-15-3-r46	journal	January 2014
MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples Asgari, Ehsaneddin; Garakani, Kiavash; McHardy, Alice C. Bioinformatics, Vol. 34, Issue 13 https://doi.org/10.1093/bioinformatics/bty296	journal	June 2018
Neural Machine Translation of Rare Words with Subword Units Sennrich, Rico; Haddow, Barry; Birch, Alexandra Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) https://doi.org/10.18653/v1/P16-1162	conference	January 2016
SLiMSearch 2.0: biological context for short linear motifs in proteins Davey, N. E.; Haslam, N. J.; Shields, D. C. Nucleic Acids Research, Vol. 39, Issue suppl https://doi.org/10.1093/nar/gkr402	journal	May 2011
SLiMFinder: A Probabilistic Method for Identifying Over-Represented, Convergently Evolved, Short Linear Motifs in Proteins Edwards, Richard J.; Davey, Norman E.; Shields, Denis C. PLoS ONE, Vol. 2, Issue 10 https://doi.org/10.1371/journal.pone.0000967	journal	October 2007
Discovering Sequence Motifs with Arbitrary Insertions and Deletions Frith, Martin C.; Saunders, Neil F. W.; Kobe, Bostjan PLoS Computational Biology, Vol. 4, Issue 5 https://doi.org/10.1371/journal.pcbi.1000071	journal	May 2008
MEME SUITE: tools for motif discovery and searching Bailey, T. L.; Boden, M.; Buske, F. A. Nucleic Acids Research, Vol. 37, Issue Web Server https://doi.org/10.1093/nar/gkp335	journal	May 2009
HH-MOTiF: de novo detection of short linear motifs in proteins by Hidden Markov Model comparisons Prytuliak, Roman; Volkmer, Michael; Meier, Markus Nucleic Acids Research, Vol. 45, Issue W1 https://doi.org/10.1093/nar/gkx341	journal	April 2017
An algorithmic perspective of de novo cis-regulatory motif finding based on ChIP-seq data Liu, Bingqiang; Yang, Jinyu; Li, Yang Briefings in Bioinformatics, Vol. 19, Issue 5 https://doi.org/10.1093/bib/bbx026	journal	March 2017
Discriminative motif discovery in DNA and protein sequences using the DEME algorithm Redhead, Emma; Bailey, Timothy L. BMC Bioinformatics, Vol. 8, Issue 1 https://doi.org/10.1186/1471-2105-8-385	journal	October 2007
Fast and Accurate Discovery of Degenerate Linear Motifs in Protein Sequences Kelil, Abdellali; Dubreuil, Benjamin; Levy, Emmanuel D. PLoS ONE, Vol. 9, Issue 9 https://doi.org/10.1371/journal.pone.0106081	journal	September 2014
DLocalMotif: a discriminative approach for discovering local motifs in protein sequences Mehdi, Ahmed M.; Sehgal, Muhammad Shoaib B.; Kobe, Bostjan Bioinformatics, Vol. 29, Issue 1 https://doi.org/10.1093/bioinformatics/bts654	journal	November 2012
SLALOM, a flexible method for the identification and statistical analysis of overlapping continuous sequence elements in sequence- and time-series data Prytuliak, Roman; Pfeiffer, Friedhelm; Habermann, Bianca Hermine BMC Bioinformatics, Vol. 19, Issue 1 https://doi.org/10.1186/s12859-018-2020-x	journal	January 2018
Protein classification using modified n-grams and skip-grams Islam, S. M. Ashiqul; Heil, Benjamin J.; Kearney, Christopher Michel Bioinformatics, Vol. 34, Issue 9 https://doi.org/10.1093/bioinformatics/btx823	journal	December 2017
Deep learning in bioinformatics Min, Seonwoo; Lee, Byunghan; Yoon, Sungroh Briefings in Bioinformatics https://doi.org/10.1093/bib/bbw068	journal	July 2016
Mut2Vec: distributed representation of cancerous mutations Kim, Sunkyu; Lee, Heewon; Kim, Keonwoo BMC Medical Genomics, Vol. 11, Issue S2 https://doi.org/10.1186/s12920-018-0349-7	journal	April 2018
Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition Jaeger, Sabrina; Fulle, Simone; Turk, Samo Journal of Chemical Information and Modeling, Vol. 58, Issue 1 https://doi.org/10.1021/acs.jcim.7b00616	journal	December 2017
Locating proteins in the cell using TargetP, SignalP and related tools Emanuelsson, Olof; Brunak, Søren; von Heijne, Gunnar Nature Protocols, Vol. 2, Issue 4 https://doi.org/10.1038/nprot.2007.131	journal	April 2007
Machine learning can differentiate venom toxins from other proteins having non-toxic physiological functions Gacesa, Ranko; Barlow, David J.; Long, Paul F. PeerJ Computer Science, Vol. 2 https://doi.org/10.7717/peerj-cs.90	journal	January 2016
Tox-Prot, the toxin protein annotation program of the Swiss-Prot protein knowledgebase Jungo, Florence; Bairoch, Amos Toxicon, Vol. 45, Issue 3 https://doi.org/10.1016/j.toxicon.2004.10.018	journal	March 2005
UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View Boutet, Emmanuel; Lieberherr, Damien; Tognolli, Michael Plant Bioinformatics https://doi.org/10.1007/978-1-4939-3167-5_2	book	January 2016
On Information and Sufficiency Kullback, S.; Leibler, R. A. The Annals of Mathematical Statistics, Vol. 22, Issue 1 https://doi.org/10.1214/aoms/1177729694	journal	March 1951
Accuracy of protein flexibility predictions Vihinen, Mauno; Torkkila, Esa; Riikonen, Pentti Proteins: Structure, Function, and Genetics, Vol. 19, Issue 2 https://doi.org/10.1002/prot.340190207	journal	June 1994
Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence Guruprasad, Kunchur; Reddy, B. V. Bhasker; Pandit, Madhusudan W. "Protein Engineering, Design and Selection", Vol. 4, Issue 2 https://doi.org/10.1093/protein/4.2.155	journal	January 1990
A simple method for displaying the hydropathic character of a protein Kyte, Jack; Doolittle, Russell F. Journal of Molecular Biology, Vol. 157, Issue 1, p. 105-132 https://doi.org/10.1016/0022-2836(82)90515-0	journal	May 1982
Prediction of protein antigenic determinants from amino acid sequences. Hopp, T. P.; Woods, K. R. Proceedings of the National Academy of Sciences, Vol. 78, Issue 6 https://doi.org/10.1073/pnas.78.6.3824	journal	June 1981
Enriching Word Vectors with Subword Information Bojanowski, Piotr; Grave, Edouard; Joulin, Armand Transactions of the Association for Computational Linguistics, Vol. 5 https://doi.org/10.1162/tacl_a_00051	journal	December 2017
Lymphoid cells recognize an alternatively spliced segment of fibronectin via the integrin receptor α4β1 Guan, Jun-Lin; Hynes, Richard O. Cell, Vol. 60, Issue 1 https://doi.org/10.1016/0092-8674(90)90715-Q	journal	January 1990
Rgd and Other Recognition Sequences for Integrins Ruoslahti, Erkki Annual Review of Cell and Developmental Biology, Vol. 12, Issue 1 https://doi.org/10.1146/annurev.cellbio.12.1.697	journal	November 1996
Ligand Binding to Integrins Plow, Edward F.; Haas, Thomas A.; Zhang, Li Journal of Biological Chemistry, Vol. 275, Issue 29 https://doi.org/10.1074/jbc.R000003200	journal	May 2000
The effect of Arg-Gly-Asp-containing peptides on fibrinogen and von Willebrand factor binding to platelets. Plow, E. F.; Pierschbacher, M. D.; Ruoslahti, E. Proceedings of the National Academy of Sciences, Vol. 82, Issue 23 https://doi.org/10.1073/pnas.82.23.8057	journal	December 1985
A Comprehensive Evaluation of the Activity and Selectivity Profile of Ligands for RGD-binding Integrins Kapp, Tobias G.; Rechenmacher, Florian; Neubauer, Stefanie Scientific Reports, Vol. 7, Issue 1 https://doi.org/10.1038/srep39805	journal	January 2017
Effect of RGD secondary structure and the synergy site PHSRN on cell adhesion, spreading and specific integrin engagement Ochsenhirt, S.; Kokkoli, E.; Mccarthy, J. Biomaterials, Vol. 27, Issue 20 https://doi.org/10.1016/j.biomaterials.2005.12.012	journal	July 2006

Similar Records

Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)

Journal Article · Tue Mar 05 00:00:00 EST 2019 · Scientific Reports · OSTI ID:1559145

Asgari, Ehsaneddin; McHardy, Alice C.; Mofrad, Mohammad R. K.

Automating cell detection and classification in human brain fluorescent microscopy images using dictionary learning and sparse coding

Journal Article · Sat Mar 04 00:00:00 EST 2017 · Journal of Neuroscience Methods · OSTI ID:1559145

Alegro, Maryana; Theofilas, Panagiotis; Nguy, Austin; +5 more

MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples

Journal Article · Wed Jun 27 00:00:00 EDT 2018 · Bioinformatics · OSTI ID:1559145

Asgari, Ehsaneddin; Garakani, Kiavash; McHardy, Alice C.; +1 more

Title: Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)

Citation Formats

References (43)

Similar Records

Related Subjects