Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)
Abstract
In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having amore »
- Authors:
-
- Univ. of California, Berkeley, CA (United States). Molecular Cell Biomechanics Lab., Depts. of Bioengineering and Mechanical Engineering; Helmholtz Centre for Infection Research, Brunswick (Germany)
- Helmholtz Centre for Infection Research, Brunswick (Germany)
- Univ. of California, Berkeley, CA (United States). Molecular Cell Biomechanics Lab., Depts. of Bioengineering and Mechanical Engineering
- Publication Date:
- Research Org.:
- Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
- Sponsoring Org.:
- USDOE Office of Science (SC)
- OSTI Identifier:
- 1559191
- Grant/Contract Number:
- AC02-05CH11231
- Resource Type:
- Accepted Manuscript
- Journal Name:
- Scientific Reports
- Additional Journal Information:
- Journal Volume: 9; Journal Issue: 1; Journal ID: ISSN 2045-2322
- Publisher:
- Nature Publishing Group
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 59 BASIC BIOLOGICAL SCIENCES
Citation Formats
Asgari, Ehsaneddin, McHardy, Alice C., and Mofrad, Mohammad R. K. Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX). United States: N. p., 2019.
Web. doi:10.1038/s41598-019-38746-w.
Asgari, Ehsaneddin, McHardy, Alice C., & Mofrad, Mohammad R. K. Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX). United States. https://doi.org/10.1038/s41598-019-38746-w
Asgari, Ehsaneddin, McHardy, Alice C., and Mofrad, Mohammad R. K. Tue .
"Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)". United States. https://doi.org/10.1038/s41598-019-38746-w. https://www.osti.gov/servlets/purl/1559191.
@article{osti_1559191,
title = {Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)},
author = {Asgari, Ehsaneddin and McHardy, Alice C. and Mofrad, Mohammad R. K.},
abstractNote = {In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variablelength protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw amino acids k-mer features.},
doi = {10.1038/s41598-019-38746-w},
journal = {Scientific Reports},
number = 1,
volume = 9,
place = {United States},
year = {Tue Mar 05 00:00:00 EST 2019},
month = {Tue Mar 05 00:00:00 EST 2019}
}
Web of Science
Works referenced in this record:
MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples
journal, August 2018
- Asgari, Ehsaneddin; Garakani, Kiavash; McHardy, Alice C.
- Bioinformatics, Vol. 35, Issue 6
Effect of RGD secondary structure and the synergy site PHSRN on cell adhesion, spreading and specific integrin engagement
journal, July 2006
- Ochsenhirt, S.; Kokkoli, E.; Mccarthy, J.
- Biomaterials, Vol. 27, Issue 20
Discovering Sequence Motifs with Arbitrary Insertions and Deletions
journal, May 2008
- Frith, Martin C.; Saunders, Neil F. W.; Kobe, Bostjan
- PLoS Computational Biology, Vol. 4, Issue 5
On the Activation of Integrin αIIbβ3: Outside-in and Inside-out Pathways
journal, September 2013
- Mehrbod, Mehrdad; Trisno, Stephen; Mofrad, Mohammad R. K.
- Biophysical Journal, Vol. 105, Issue 6
Ligand Binding to Integrins
journal, May 2000
- Plow, Edward F.; Haas, Thomas A.; Zhang, Li
- Journal of Biological Chemistry, Vol. 275, Issue 29
Discriminative motif discovery in DNA and protein sequences using the DEME algorithm
journal, October 2007
- Redhead, Emma; Bailey, Timothy L.
- BMC Bioinformatics, Vol. 8, Issue 1
Locating proteins in the cell using TargetP, SignalP and related tools
journal, April 2007
- Emanuelsson, Olof; Brunak, Søren; von Heijne, Gunnar
- Nature Protocols, Vol. 2, Issue 4
Comparing Fifty Natural Languages and Twelve Genetic Languages Using Word Embedding Language Divergence (WELD) as a Quantitative Measure of Language Distance
conference, January 2016
- Asgari, Ehsaneddin; Mofrad, Mohammad R. K.
- Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP
Identifying antimicrobial peptides using word embedding with deep recurrent neural networks
journal, November 2018
- Hamid, Md-Nafiz; Friedberg, Iddo
- Bioinformatics, Vol. 35, Issue 12
Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence
journal, January 1990
- Guruprasad, Kunchur; Reddy, B. V. Bhasker; Pandit, Madhusudan W.
- "Protein Engineering, Design and Selection", Vol. 4, Issue 2
Accuracy of protein flexibility predictions
journal, June 1994
- Vihinen, Mauno; Torkkila, Esa; Riikonen, Pentti
- Proteins: Structure, Function, and Genetics, Vol. 19, Issue 2
Epigenomic k -mer dictionaries: shedding light on how sequence composition influences in vivo nucleosome positioning
journal, May 2015
- Giancarlo, Raffaele; Rombo, Simona E.; Utro, Filippo
- Bioinformatics, Vol. 31, Issue 18
Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning
journal, July 2015
- Alipanahi, Babak; Delong, Andrew; Weirauch, Matthew T.
- Nature Biotechnology, Vol. 33, Issue 8
Full-length transcriptome assembly from RNA-Seq data without a reference genome
journal, May 2011
- Grabherr, Manfred G.; Haas, Brian J.; Yassour, Moran
- Nature Biotechnology, Vol. 29, Issue 7
Neural Machine Translation of Rare Words with Subword Units
preprint, January 2015
- Sennrich, Rico; Haddow, Barry; Birch, Alexandra
- arXiv
Genomics and natural language processing
journal, August 2002
- Yandell, Mark D.; Majoros, William H.
- Nature Reviews Genetics, Vol. 3, Issue 8
SLiMSearch 2.0: biological context for short linear motifs in proteins
journal, May 2011
- Davey, N. E.; Haslam, N. J.; Shields, D. C.
- Nucleic Acids Research, Vol. 39, Issue suppl
The effect of Arg-Gly-Asp-containing peptides on fibrinogen and von Willebrand factor binding to platelets.
journal, December 1985
- Plow, E. F.; Pierschbacher, M. D.; Ruoslahti, E.
- Proceedings of the National Academy of Sciences, Vol. 82, Issue 23
Kraken: ultrafast metagenomic sequence classification using exact alignments
journal, January 2014
- Wood, Derrick E.; Salzberg, Steven L.
- Genome Biology, Vol. 15, Issue 3
NLSdb—major update for database of nuclear localization signals and nuclear export signals
journal, November 2017
- Bernhofer, Michael; Goldberg, Tatyana; Wolf, Silvana
- Nucleic Acids Research, Vol. 46, Issue D1
Computational complexity of algorithms for sequence comparison, short-read assembly and genome alignment
journal, June 2017
- Baichoo, Shakuntala; Ouzounis, Christos A.
- Biosystems, Vol. 156-157
UniProt: the universal protein knowledgebase
journal, November 2016
- :(unav),
- Nucleic Acids Research, Vol. 45, Issue D1, p. D158-D169
Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition
journal, December 2017
- Jaeger, Sabrina; Fulle, Simone; Turk, Samo
- Journal of Chemical Information and Modeling, Vol. 58, Issue 1
Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition
journal, December 2017
- Jaeger, Sabrina; Fulle, Simone; Turk, Samo
- Journal of Chemical Information and Modeling, Vol. 58, Issue 1
An agent based model of integrin clustering: Exploring the role of ligand clustering, integrin homo-oligomerization, integrin–ligand affinity, membrane crowdedness and ligand mobility
journal, July 2013
- Jamali, Yousef; Jamali, Tahereh; Mofrad, Mohammad R. K.
- Journal of Computational Physics, Vol. 244
HH-MOTiF: de novo detection of short linear motifs in proteins by Hidden Markov Model comparisons
journal, April 2017
- Prytuliak, Roman; Volkmer, Michael; Meier, Markus
- Nucleic Acids Research, Vol. 45, Issue W1
A bioinformatics pipeline to search functional motifs within whole-proteome data: a case study of poxviruses
journal, December 2016
- Sobhy, Haitham
- Virus Genes, Vol. 53, Issue 2
Fast and Accurate Discovery of Degenerate Linear Motifs in Protein Sequences
journal, September 2014
- Kelil, Abdellali; Dubreuil, Benjamin; Levy, Emmanuel D.
- PLoS ONE, Vol. 9, Issue 9
SLiMFinder: A Probabilistic Method for Identifying Over-Represented, Convergently Evolved, Short Linear Motifs in Proteins
journal, October 2007
- Edwards, Richard J.; Davey, Norman E.; Shields, Denis C.
- PLoS ONE, Vol. 2, Issue 10
DLocalMotif: a discriminative approach for discovering local motifs in protein sequences
journal, November 2012
- Mehdi, Ahmed M.; Sehgal, Muhammad Shoaib B.; Kobe, Bostjan
- Bioinformatics, Vol. 29, Issue 1
Protein classification using modified n-grams and skip-grams
journal, December 2017
- Islam, S. M. Ashiqul; Heil, Benjamin J.; Kearney, Christopher Michel
- Bioinformatics, Vol. 34, Issue 9
MEME SUITE: tools for motif discovery and searching
journal, May 2009
- Bailey, T. L.; Boden, M.; Buske, F. A.
- Nucleic Acids Research, Vol. 37, Issue Web Server
An agent based model of integrin clustering: Exploring the role of ligand clustering, integrin homo-oligomerization, integrin–ligand affinity, membrane crowdedness and ligand mobility
journal, July 2013
- Jamali, Yousef; Jamali, Tahereh; Mofrad, Mohammad R. K.
- Journal of Computational Physics, Vol. 244
Neural Machine Translation of Rare Words with Subword Units
conference, January 2016
- Sennrich, Rico; Haddow, Barry; Birch, Alexandra
- Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics
journal, November 2015
- Asgari, Ehsaneddin; Mofrad, Mohammad R. K.
- PLOS ONE, Vol. 10, Issue 11
Prediction of protein antigenic determinants from amino acid sequences.
journal, June 1981
- Hopp, T. P.; Woods, K. R.
- Proceedings of the National Academy of Sciences, Vol. 78, Issue 6
Phosphorylation Facilitates the Integrin Binding of Filamin under Force
journal, December 2009
- Chen, Harvey S.; Kolahi, Kevin S.; Mofrad, Mohammad R. K.
- Biophysical Journal, Vol. 97, Issue 12
An algorithmic perspective of de novo cis-regulatory motif finding based on ChIP-seq data
journal, March 2017
- Liu, Bingqiang; Yang, Jinyu; Li, Yang
- Briefings in Bioinformatics, Vol. 19, Issue 5
Kraken: ultrafast metagenomic sequence classification using exact alignments
journal, January 2014
- Wood, Derrick E.; Salzberg, Steven L.
- Genome Biology, Vol. 15, Issue 3
ELM--the database of eukaryotic linear motifs
journal, November 2011
- Dinkel, H.; Michael, S.; Weatheritt, R. J.
- Nucleic Acids Research, Vol. 40, Issue D1
Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification
conference, January 2014
- Tang, Duyu; Wei, Furu; Yang, Nan
- Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tox-Prot, the toxin protein annotation program of the Swiss-Prot protein knowledgebase
journal, March 2005
- Jungo, Florence; Bairoch, Amos
- Toxicon, Vol. 45, Issue 3
Mut2Vec: distributed representation of cancerous mutations
journal, April 2018
- Kim, Sunkyu; Lee, Heewon; Kim, Keonwoo
- BMC Medical Genomics, Vol. 11, Issue S2
SLiMSearch 2.0: biological context for short linear motifs in proteins
journal, May 2011
- Davey, N. E.; Haslam, N. J.; Shields, D. C.
- Nucleic Acids Research, Vol. 39, Issue suppl
Genomics and natural language processing
journal, August 2002
- Yandell, Mark D.; Majoros, William H.
- Nature Reviews Genetics, Vol. 3, Issue 8
Rgd and Other Recognition Sequences for Integrins
journal, November 1996
- Ruoslahti, Erkki
- Annual Review of Cell and Developmental Biology, Vol. 12, Issue 1
DNA-Binding Specificities of Human Transcription Factors
journal, January 2013
- Jolma, Arttu; Yan, Jian; Whitington, Thomas
- Cell, Vol. 152, Issue 1-2
Localized Lipid Packing of Transmembrane Domains Impedes Integrin Clustering
journal, March 2013
- Mehrbod, Mehrdad; Mofrad, Mohammad R. K.
- PLoS Computational Biology, Vol. 9, Issue 3
Effect of RGD secondary structure and the synergy site PHSRN on cell adhesion, spreading and specific integrin engagement
journal, July 2006
- Ochsenhirt, S.; Kokkoli, E.; Mccarthy, J.
- Biomaterials, Vol. 27, Issue 20
Discriminative motif discovery in DNA and protein sequences using the DEME algorithm
journal, October 2007
- Redhead, Emma; Bailey, Timothy L.
- BMC Bioinformatics, Vol. 8, Issue 1
Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence
journal, January 1990
- Guruprasad, Kunchur; Reddy, B. V. Bhasker; Pandit, Madhusudan W.
- "Protein Engineering, Design and Selection", Vol. 4, Issue 2
Some biological sequence metrics
journal, June 1976
- Waterman, M. S.; Smith, T. F.; Beyer, W. A.
- Advances in Mathematics, Vol. 20, Issue 3
Lymphoid cells recognize an alternatively spliced segment of fibronectin via the integrin receptor α4β1
journal, January 1990
- Guan, Jun-Lin; Hynes, Richard O.
- Cell, Vol. 60, Issue 1
Discovering Sequence Motifs with Arbitrary Insertions and Deletions
journal, May 2008
- Frith, Martin C.; Saunders, Neil F. W.; Kobe, Bostjan
- PLoS Computational Biology, Vol. 4, Issue 5
The effect of Arg-Gly-Asp-containing peptides on fibrinogen and von Willebrand factor binding to platelets.
journal, December 1985
- Plow, E. F.; Pierschbacher, M. D.; Ruoslahti, E.
- Proceedings of the National Academy of Sciences, Vol. 82, Issue 23
The LINC and NPC relationship – it's complicated!
journal, August 2016
- Jahed, Zeinab; Soheilypour, Mohammad; Peyro, Mohaddeseh
- Journal of Cell Science, Vol. 129, Issue 17
Locating proteins in the cell using TargetP, SignalP and related tools
journal, April 2007
- Emanuelsson, Olof; Brunak, Søren; von Heijne, Gunnar
- Nature Protocols, Vol. 2, Issue 4
MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples
journal, June 2018
- Asgari, Ehsaneddin; Garakani, Kiavash; McHardy, Alice C.
- Bioinformatics, Vol. 34, Issue 13
Prediction of nucleosome positioning by the incorporation of frequencies and distributions of three different nucleotide segment lengths into a general pseudo k-tuple nucleotide composition
journal, August 2016
- Awazu, Akinori
- Bioinformatics, Vol. 33, Issue 1
SLALOM, a flexible method for the identification and statistical analysis of overlapping continuous sequence elements in sequence- and time-series data
journal, January 2018
- Prytuliak, Roman; Pfeiffer, Friedhelm; Habermann, Bianca Hermine
- BMC Bioinformatics, Vol. 19, Issue 1
Deep Recurrent Conditional Random Field Network for Protein Secondary Prediction
conference, January 2017
- Johansen, Alexander Rosenberg; Sønderby, Casper Kaae; Sønderby, Søren Kaae
- Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics - ACM-BCB '17
Neural Architectures for Named Entity Recognition
conference, January 2016
- Lample, Guillaume; Ballesteros, Miguel; Subramanian, Sandeep
- Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
The “Stressful” Life of Cell Adhesion Molecules: On the Mechanosensitivity of Integrin Adhesome
journal, January 2018
- Shams, Hengameh; Hoffman, Brenton D.; Mofrad, Mohammad R. K.
- Journal of Biomechanical Engineering, Vol. 140, Issue 2
α-Actinin Induces a Kink in the Transmembrane Domain of β3-Integrin and Impairs Activation via Talin
journal, August 2017
- Shams, Hengameh; Mofrad, Mohammad R. K.
- Biophysical Journal, Vol. 113, Issue 4
Comparing Fifty Natural Languages and Twelve Genetic Languages Using Word Embedding Language Divergence (WELD) as a Quantitative Measure of Language Distance
preprint, January 2016
- Asgari, Ehsaneddin; Mofrad, Mohammad R. K.
- arXiv
Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics
journal, November 2015
- Asgari, Ehsaneddin; Mofrad, Mohammad R. K.
- PLOS ONE, Vol. 10, Issue 11
DiTaxa: nucleotide-pair encoding of 16S rRNA for host phenotype and biomarker detection
journal, November 2018
- Asgari, Ehsaneddin; Münch, Philipp C.; Lesker, Till R.
- Bioinformatics, Vol. 35, Issue 14
Fast and Accurate Discovery of Degenerate Linear Motifs in Protein Sequences
journal, September 2014
- Kelil, Abdellali; Dubreuil, Benjamin; Levy, Emmanuel D.
- PLoS ONE, Vol. 9, Issue 9
HH-MOTiF: de novo detection of short linear motifs in proteins by Hidden Markov Model comparisons
journal, April 2017
- Prytuliak, Roman; Volkmer, Michael; Meier, Markus
- Nucleic Acids Research, Vol. 45, Issue W1
A simple method for displaying the hydropathic character of a protein
journal, May 1982
- Kyte, Jack; Doolittle, Russell F.
- Journal of Molecular Biology, Vol. 157, Issue 1, p. 105-132
Protein Classification using Modified N-Gram and Skip-Gram Models: Extended Abstract
conference, August 2017
- Islam, SM Ashiqul; Kearney, Christopher Michel; Choudhury, Ankan
- BCB '17: 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics
DNA-Binding Specificities of Human Transcription Factors
journal, January 2013
- Jolma, Arttu; Yan, Jian; Whitington, Thomas
- Cell, Vol. 152, Issue 1-2
Structure of polymerized type V pilin reveals assembly mechanism involving protease-mediated strand exchange
journal, April 2020
- Shibata, Satoshi; Shoji, Mikio; Okada, Kodai
- Nature Microbiology, Vol. 5, Issue 6
Localized Lipid Packing of Transmembrane Domains Impedes Integrin Clustering
journal, March 2013
- Mehrbod, Mehrdad; Mofrad, Mohammad R. K.
- PLoS Computational Biology, Vol. 9, Issue 3
MEME SUITE: tools for motif discovery and searching
journal, May 2009
- Bailey, T. L.; Boden, M.; Buske, F. A.
- Nucleic Acids Research, Vol. 37, Issue Web Server
On Information and Sufficiency
journal, March 1951
- Kullback, S.; Leibler, R. A.
- The Annals of Mathematical Statistics, Vol. 22, Issue 1
Induction of hepatitis A virus-neutralizing antibody by a virus-specific synthetic peptide.
journal, January 1985
- Emini, E. A.; Hughes, J. V.; Perlow, D. S.
- Journal of Virology, Vol. 55, Issue 3
Enriching Word Vectors with Subword Information
preprint, January 2016
- Bojanowski, Piotr; Grave, Edouard; Joulin, Armand
- arXiv
Prediction of nucleosome positioning by the incorporation of frequencies and distributions of three different nucleotide segment lengths into a general pseudo k-tuple nucleotide composition
journal, August 2016
- Awazu, Akinori
- Bioinformatics, Vol. 33, Issue 1
Epigenomic k -mer dictionaries: shedding light on how sequence composition influences in vivo nucleosome positioning
journal, May 2015
- Giancarlo, Raffaele; Rombo, Simona E.; Utro, Filippo
- Bioinformatics, Vol. 31, Issue 18
Mechanisms of integrin and filamin binding and their interplay with talin during early focal adhesion formation
journal, January 2015
- Truong, Tiffany; Shams, Hengameh; Mofrad, Mohammad R. K.
- Integrative Biology, Vol. 7, Issue 10
Accuracy of protein flexibility predictions
journal, June 1994
- Vihinen, Mauno; Torkkila, Esa; Riikonen, Pentti
- Proteins: Structure, Function, and Genetics, Vol. 19, Issue 2
Machine learning can differentiate venom toxins from other proteins having non-toxic physiological functions
journal, January 2016
- Gacesa, Ranko; Barlow, David J.; Long, Paul F.
- PeerJ Computer Science, Vol. 2
Mut2Vec: distributed representation of cancerous mutations
journal, April 2018
- Kim, Sunkyu; Lee, Heewon; Kim, Keonwoo
- BMC Medical Genomics, Vol. 11, Issue S2
HH-MOTiF: de novo detection of short linear motifs in proteins by Hidden Markov Model comparisons
journal, September 2017
- Prytuliak, Roman; Volkmer, Michael; Meier, Markus
- Nucleic Acids Research, Vol. 45, Issue 18
ELM--the database of eukaryotic linear motifs
journal, November 2011
- Dinkel, H.; Michael, S.; Weatheritt, R. J.
- Nucleic Acids Research, Vol. 40, Issue D1
A Comprehensive Evaluation of the Activity and Selectivity Profile of Ligands for RGD-binding Integrins
journal, January 2017
- Kapp, Tobias G.; Rechenmacher, Florian; Neubauer, Stefanie
- Scientific Reports, Vol. 7, Issue 1
Gene2vec: distributed representation of genes based on co-expression
journal, February 2019
- Du, Jingcheng; Jia, Peilin; Dai, Yulin
- BMC Genomics, Vol. 20, Issue S1
Phosphorylation Facilitates the Integrin Binding of Filamin under Force
journal, December 2009
- Chen, Harvey S.; Kolahi, Kevin S.; Mofrad, Mohammad R. K.
- Biophysical Journal, Vol. 97, Issue 12
Full-length transcriptome assembly from RNA-Seq data without a reference genome
journal, May 2011
- Grabherr, Manfred G.; Haas, Brian J.; Yassour, Moran
- Nature Biotechnology, Vol. 29, Issue 7
Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning
journal, July 2015
- Alipanahi, Babak; Delong, Andrew; Weirauch, Matthew T.
- Nature Biotechnology, Vol. 33, Issue 8
Enriching Word Vectors with Subword Information
journal, December 2017
- Bojanowski, Piotr; Grave, Edouard; Joulin, Armand
- Transactions of the Association for Computational Linguistics, Vol. 5
Induction of hepatitis A virus-neutralizing antibody by a virus-specific synthetic peptide.
journal, January 1985
- Emini, E. A.; Hughes, J. V.; Perlow, D. S.
- Journal of Virology, Vol. 55, Issue 3
Works referencing / citing this record:
Modeling aspects of the language of life through transfer-learning protein sequences
journal, December 2019
- Heinzinger, Michael; Elnaggar, Ahmed; Wang, Yu
- BMC Bioinformatics, Vol. 20, Issue 1
Learning supervised embeddings for large scale sequence comparisons
journal, March 2020
- Kimothi, Dhananjay; Biyani, Pravesh; Hogan, James M.
- PLOS ONE, Vol. 15, Issue 3
Identifying SNAREs by Incorporating Deep Learning Architecture and Amino Acid Embedding Representation
journal, December 2019
- Le, Nguyen Quoc Khanh; Huynh, Tuan-Tu
- Frontiers in Physiology, Vol. 10
Modeling aspects of the language of life through transfer-learning protein sequences
text, January 2019
- Michael, Heinzinger,; Ahmed, Elnaggar,; Yu, Wang,
- Columbia University
iN6-methylat (5-step): identifying DNA N6-methyladenine sites in rice genome using continuous bag of nucleobases via Chou’s 5-step rule
journal, May 2019
- Le, Nguyen Quoc Khanh
- Molecular Genetics and Genomics, Vol. 294, Issue 5