Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)

Asgari, Ehsaneddin; McHardy, Alice C.; Mofrad, Mohammad R.  K.

doi:10.1038/s41598-019-38746-w

Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)

Journal Article · Tue Mar 05 00:00:00 EST 2019 · Scientific Reports

DOI:https://doi.org/10.1038/s41598-019-38746-w· OSTI ID:1559191

^[1]; ^[2]; ^[3]

Univ. of California, Berkeley, CA (United States). Molecular Cell Biomechanics Lab., Depts. of Bioengineering and Mechanical Engineering; Helmholtz Centre for Infection Research, Brunswick (Germany)
Helmholtz Centre for Infection Research, Brunswick (Germany)
Univ. of California, Berkeley, CA (United States). Molecular Cell Biomechanics Lab., Depts. of Bioengineering and Mechanical Engineering

In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variablelength protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw amino acids k-mer features.

View Accepted Manuscript (DOE)

Research Organization:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Organization:: USDOE Office of Science (SC)

Grant/Contract Number:: AC02-05CH11231

OSTI ID:: 1559191

Journal Information:: Scientific Reports, Journal Name: Scientific Reports Journal Issue: 1 Vol. 9; ISSN 2045-2322

Publisher:: Nature Publishing GroupCopyright Statement

Country of Publication:: United States

Language:: English

References (74)

Accuracy of protein flexibility predictions Vihinen, Mauno; Torkkila, Esa; Riikonen, Pentti Proteins: Structure, Function, and Genetics, Vol. 19, Issue 2 https://doi.org/10.1002/prot.340190207	journal	June 1994
A bioinformatics pipeline to search functional motifs within whole-proteome data: a case study of poxviruses Sobhy, Haitham Virus Genes, Vol. 53, Issue 2 https://doi.org/10.1007/s11262-016-1416-9	journal	December 2016
Effect of RGD secondary structure and the synergy site PHSRN on cell adhesion, spreading and specific integrin engagement Ochsenhirt, S.; Kokkoli, E.; Mccarthy, J. Biomaterials, Vol. 27, Issue 20 https://doi.org/10.1016/j.biomaterials.2005.12.012	journal	July 2006
Computational complexity of algorithms for sequence comparison, short-read assembly and genome alignment Baichoo, Shakuntala; Ouzounis, Christos A. Biosystems, Vol. 156-157 https://doi.org/10.1016/j.biosystems.2017.03.003	journal	June 2017
Phosphorylation Facilitates the Integrin Binding of Filamin under Force Chen, Harvey S.; Kolahi, Kevin S.; Mofrad, Mohammad R. K. Biophysical Journal, Vol. 97, Issue 12 https://doi.org/10.1016/j.bpj.2009.08.059	journal	December 2009
DNA-Binding Specificities of Human Transcription Factors Jolma, Arttu; Yan, Jian; Whitington, Thomas Cell, Vol. 152, Issue 1-2 https://doi.org/10.1016/j.cell.2012.12.009	journal	January 2013
An agent based model of integrin clustering: Exploring the role of ligand clustering, integrin homo-oligomerization, integrin–ligand affinity, membrane crowdedness and ligand mobility Jamali, Yousef; Jamali, Tahereh; Mofrad, Mohammad R. K. Journal of Computational Physics, Vol. 244 https://doi.org/10.1016/j.jcp.2012.09.010	journal	July 2013
Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition Jaeger, Sabrina; Fulle, Simone; Turk, Samo Journal of Chemical Information and Modeling, Vol. 58, Issue 1 https://doi.org/10.1021/acs.jcim.7b00616	journal	December 2017
The language of genes Searls, David B. Nature, Vol. 420, Issue 6912 https://doi.org/10.1038/nature01255	journal	November 2002
Full-length transcriptome assembly from RNA-Seq data without a reference genome Grabherr, Manfred G.; Haas, Brian J.; Yassour, Moran Nature Biotechnology, Vol. 29, Issue 7 https://doi.org/10.1038/nbt.1883	journal	May 2011
Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning Alipanahi, Babak; Delong, Andrew; Weirauch, Matthew T. Nature Biotechnology, Vol. 33, Issue 8 https://doi.org/10.1038/nbt.3300	journal	July 2015
Locating proteins in the cell using TargetP, SignalP and related tools Emanuelsson, Olof; Brunak, Søren; von Heijne, Gunnar Nature Protocols, Vol. 2, Issue 4 https://doi.org/10.1038/nprot.2007.131	journal	April 2007
Genomics and natural language processing Yandell, Mark D.; Majoros, William H. Nature Reviews Genetics, Vol. 3, Issue 8 https://doi.org/10.1038/nrg861	journal	August 2002
Structure of polymerized type V pilin reveals assembly mechanism involving protease-mediated strand exchange Shibata, Satoshi; Shoji, Mikio; Okada, Kodai Nature Microbiology, Vol. 5, Issue 6 https://doi.org/10.1038/s41564-020-0705-1	journal	April 2020
The effect of Arg-Gly-Asp-containing peptides on fibrinogen and von Willebrand factor binding to platelets. Plow, E. F.; Pierschbacher, M. D.; Ruoslahti, E. Proceedings of the National Academy of Sciences, Vol. 82, Issue 23 https://doi.org/10.1073/pnas.82.23.8057	journal	December 1985
Ligand Binding to Integrins Plow, Edward F.; Haas, Thomas A.; Zhang, Li Journal of Biological Chemistry, Vol. 275, Issue 29 https://doi.org/10.1074/jbc.r000003200	journal	May 2000
Epigenomic k -mer dictionaries: shedding light on how sequence composition influences in vivo nucleosome positioning Giancarlo, Raffaele; Rombo, Simona E.; Utro, Filippo Bioinformatics, Vol. 31, Issue 18 https://doi.org/10.1093/bioinformatics/btv295	journal	May 2015
Prediction of nucleosome positioning by the incorporation of frequencies and distributions of three different nucleotide segment lengths into a general pseudo k-tuple nucleotide composition Awazu, Akinori Bioinformatics, Vol. 33, Issue 1 https://doi.org/10.1093/bioinformatics/btw562	journal	August 2016
MEME SUITE: tools for motif discovery and searching Bailey, T. L.; Boden, M.; Buske, F. A. Nucleic Acids Research, Vol. 37, Issue Web Server https://doi.org/10.1093/nar/gkp335	journal	May 2009
ELM--the database of eukaryotic linear motifs Dinkel, H.; Michael, S.; Weatheritt, R. J. Nucleic Acids Research, Vol. 40, Issue D1 https://doi.org/10.1093/nar/gkr1064	journal	November 2011
SLiMSearch 2.0: biological context for short linear motifs in proteins Davey, N. E.; Haslam, N. J.; Shields, D. C. Nucleic Acids Research, Vol. 39, Issue suppl https://doi.org/10.1093/nar/gkr402	journal	May 2011
UniProt: the universal protein knowledgebase No authors listed Nucleic Acids Research, Vol. 45, Issue D1, p. D158-D169 https://doi.org/10.1093/nar/gkw1099	journal	November 2016
HH-MOTiF: de novo detection of short linear motifs in proteins by Hidden Markov Model comparisons Prytuliak, Roman; Volkmer, Michael; Meier, Markus Nucleic Acids Research, Vol. 45, Issue W1 https://doi.org/10.1093/nar/gkx341	journal	April 2017
Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence Guruprasad, Kunchur; Reddy, B. V. Bhasker; Pandit, Madhusudan W. "Protein Engineering, Design and Selection", Vol. 4, Issue 2 https://doi.org/10.1093/protein/4.2.155	journal	January 1990
Induction of hepatitis A virus-neutralizing antibody by a virus-specific synthetic peptide. Emini, E. A.; Hughes, J. V.; Perlow, D. S. Journal of Virology, Vol. 55, Issue 3 https://doi.org/10.1128/jvi.55.3.836-839.1985	journal	January 1985
Discriminative motif discovery in DNA and protein sequences using the DEME algorithm Redhead, Emma; Bailey, Timothy L. BMC Bioinformatics, Vol. 8, Issue 1 https://doi.org/10.1186/1471-2105-8-385	journal	October 2007
Kraken: ultrafast metagenomic sequence classification using exact alignments Wood, Derrick E.; Salzberg, Steven L. Genome Biology, Vol. 15, Issue 3 https://doi.org/10.1186/gb-2014-15-3-r46	journal	January 2014
Mut2Vec: distributed representation of cancerous mutations Kim, Sunkyu; Lee, Heewon; Kim, Keonwoo BMC Medical Genomics, Vol. 11, Issue S2 https://doi.org/10.1186/s12920-018-0349-7	journal	April 2018
Discovering Sequence Motifs with Arbitrary Insertions and Deletions Frith, Martin C.; Saunders, Neil F. W.; Kobe, Bostjan PLoS Computational Biology, Vol. 4, Issue 5 https://doi.org/10.1371/journal.pcbi.1000071	journal	May 2008
Localized Lipid Packing of Transmembrane Domains Impedes Integrin Clustering Mehrbod, Mehrdad; Mofrad, Mohammad R. K. PLoS Computational Biology, Vol. 9, Issue 3 https://doi.org/10.1371/journal.pcbi.1002948	journal	March 2013
Fast and Accurate Discovery of Degenerate Linear Motifs in Protein Sequences Kelil, Abdellali; Dubreuil, Benjamin; Levy, Emmanuel D. PLoS ONE, Vol. 9, Issue 9 https://doi.org/10.1371/journal.pone.0106081	journal	September 2014
Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics Asgari, Ehsaneddin; Mofrad, Mohammad R. K. PLOS ONE, Vol. 10, Issue 11 https://doi.org/10.1371/journal.pone.0141287	journal	November 2015
Neural Architectures for Named Entity Recognition Lample, Guillaume; Ballesteros, Miguel; Subramanian, Sandeep Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies https://doi.org/10.18653/v1/n16-1030	conference	January 2016
Comparing Fifty Natural Languages and Twelve Genetic Languages Using Word Embedding Language Divergence (WELD) as a Quantitative Measure of Language Distance Asgari, Ehsaneddin; Mofrad, Mohammad R. K. Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP https://doi.org/10.18653/v1/w16-1208	conference	January 2016
UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View Boutet, Emmanuel; Lieberherr, Damien; Tognolli, Michael Plant Bioinformatics https://doi.org/10.1007/978-1-4939-3167-5_2	book	January 2016
Some biological sequence metrics Waterman, M. S.; Smith, T. F.; Beyer, W. A. Advances in Mathematics, Vol. 20, Issue 3 https://doi.org/10.1016/0001-8708(76)90202-4	journal	June 1976
A simple method for displaying the hydropathic character of a protein Kyte, Jack; Doolittle, Russell F. Journal of Molecular Biology, Vol. 157, Issue 1, p. 105-132 https://doi.org/10.1016/0022-2836(82)90515-0	journal	May 1982
Lymphoid cells recognize an alternatively spliced segment of fibronectin via the integrin receptor α4β1 Guan, Jun-Lin; Hynes, Richard O. Cell, Vol. 60, Issue 1 https://doi.org/10.1016/0092-8674(90)90715-Q	journal	January 1990
On the Activation of Integrin αIIbβ3: Outside-in and Inside-out Pathways Mehrbod, Mehrdad; Trisno, Stephen; Mofrad, Mohammad R. K. Biophysical Journal, Vol. 105, Issue 6 https://doi.org/10.1016/j.bpj.2013.07.055	journal	September 2013
α-Actinin Induces a Kink in the Transmembrane Domain of β3-Integrin and Impairs Activation via Talin Shams, Hengameh; Mofrad, Mohammad R. K. Biophysical Journal, Vol. 113, Issue 4 https://doi.org/10.1016/j.bpj.2017.06.064	journal	August 2017
Tox-Prot, the toxin protein annotation program of the Swiss-Prot protein knowledgebase Jungo, Florence; Bairoch, Amos Toxicon, Vol. 45, Issue 3 https://doi.org/10.1016/j.toxicon.2004.10.018	journal	March 2005
A Comprehensive Evaluation of the Activity and Selectivity Profile of Ligands for RGD-binding Integrins Kapp, Tobias G.; Rechenmacher, Florian; Neubauer, Stefanie Scientific Reports, Vol. 7, Issue 1 https://doi.org/10.1038/srep39805	journal	January 2017
Mechanisms of integrin and filamin binding and their interplay with talin during early focal adhesion formation Truong, Tiffany; Shams, Hengameh; Mofrad, Mohammad R. K. Integrative Biology, Vol. 7, Issue 10 https://doi.org/10.1039/C5IB00133A	journal	January 2015
Prediction of protein antigenic determinants from amino acid sequences. Hopp, T. P.; Woods, K. R. Proceedings of the National Academy of Sciences, Vol. 78, Issue 6 https://doi.org/10.1073/pnas.78.6.3824	journal	June 1981
Ligand Binding to Integrins Plow, Edward F.; Haas, Thomas A.; Zhang, Li Journal of Biological Chemistry, Vol. 275, Issue 29 https://doi.org/10.1074/jbc.R000003200	journal	May 2000
An algorithmic perspective of de novo cis-regulatory motif finding based on ChIP-seq data Liu, Bingqiang; Yang, Jinyu; Li, Yang Briefings in Bioinformatics, Vol. 19, Issue 5 https://doi.org/10.1093/bib/bbx026	journal	March 2017
DLocalMotif: a discriminative approach for discovering local motifs in protein sequences Mehdi, Ahmed M.; Sehgal, Muhammad Shoaib B.; Kobe, Bostjan Bioinformatics, Vol. 29, Issue 1 https://doi.org/10.1093/bioinformatics/bts654	journal	November 2012
Protein classification using modified n-grams and skip-grams Islam, S. M. Ashiqul; Heil, Benjamin J.; Kearney, Christopher Michel Bioinformatics, Vol. 34, Issue 9 https://doi.org/10.1093/bioinformatics/btx823	journal	December 2017
MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples Asgari, Ehsaneddin; Garakani, Kiavash; McHardy, Alice C. Bioinformatics, Vol. 34, Issue 13 https://doi.org/10.1093/bioinformatics/bty296	journal	June 2018
MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples Asgari, Ehsaneddin; Garakani, Kiavash; McHardy, Alice C. Bioinformatics, Vol. 35, Issue 6 https://doi.org/10.1093/bioinformatics/bty652	journal	August 2018
Identifying antimicrobial peptides using word embedding with deep recurrent neural networks Hamid, Md-Nafiz; Friedberg, Iddo Bioinformatics, Vol. 35, Issue 12 https://doi.org/10.1093/bioinformatics/bty937	journal	November 2018
DiTaxa: nucleotide-pair encoding of 16S rRNA for host phenotype and biomarker detection Asgari, Ehsaneddin; Münch, Philipp C.; Lesker, Till R. Bioinformatics, Vol. 35, Issue 14 https://doi.org/10.1093/bioinformatics/bty954	journal	November 2018
NLSdb—major update for database of nuclear localization signals and nuclear export signals Bernhofer, Michael; Goldberg, Tatyana; Wolf, Silvana Nucleic Acids Research, Vol. 46, Issue D1 https://doi.org/10.1093/nar/gkx1021	journal	November 2017
HH-MOTiF: de novo detection of short linear motifs in proteins by Hidden Markov Model comparisons Prytuliak, Roman; Volkmer, Michael; Meier, Markus Nucleic Acids Research, Vol. 45, Issue 18 https://doi.org/10.1093/nar/gkx810	journal	September 2017
The “Stressful” Life of Cell Adhesion Molecules: On the Mechanosensitivity of Integrin Adhesome Shams, Hengameh; Hoffman, Brenton D.; Mofrad, Mohammad R. K. Journal of Biomechanical Engineering, Vol. 140, Issue 2 https://doi.org/10.1115/1.4038812	journal	January 2018
Induction of hepatitis A virus-neutralizing antibody by a virus-specific synthetic peptide. Emini, E. A.; Hughes, J. V.; Perlow, D. S. Journal of Virology, Vol. 55, Issue 3 https://doi.org/10.1128/JVI.55.3.836-839.1985	journal	January 1985
Deep Recurrent Conditional Random Field Network for Protein Secondary Prediction Johansen, Alexander Rosenberg; Sønderby, Casper Kaae; Sønderby, Søren Kaae Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics - ACM-BCB '17 https://doi.org/10.1145/3107411.3107489	conference	January 2017
Protein Classification using Modified N-Gram and Skip-Gram Models: Extended Abstract Islam, SM Ashiqul; Kearney, Christopher Michel; Choudhury, Ankan BCB '17: 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics https://doi.org/10.1145/3107411.3108193	conference	August 2017
Rgd and Other Recognition Sequences for Integrins Ruoslahti, Erkki Annual Review of Cell and Developmental Biology, Vol. 12, Issue 1 https://doi.org/10.1146/annurev.cellbio.12.1.697	journal	November 1996
Enriching Word Vectors with Subword Information Bojanowski, Piotr; Grave, Edouard; Joulin, Armand Transactions of the Association for Computational Linguistics, Vol. 5 https://doi.org/10.1162/tacl_a_00051	journal	December 2017
SLALOM, a flexible method for the identification and statistical analysis of overlapping continuous sequence elements in sequence- and time-series data Prytuliak, Roman; Pfeiffer, Friedhelm; Habermann, Bianca Hermine BMC Bioinformatics, Vol. 19, Issue 1 https://doi.org/10.1186/s12859-018-2020-x	journal	January 2018
Gene2vec: distributed representation of genes based on co-expression Du, Jingcheng; Jia, Peilin; Dai, Yulin BMC Genomics, Vol. 20, Issue S1 https://doi.org/10.1186/s12864-018-5370-x	journal	February 2019
On Information and Sufficiency Kullback, S.; Leibler, R. A. The Annals of Mathematical Statistics, Vol. 22, Issue 1 https://doi.org/10.1214/aoms/1177729694	journal	March 1951
The LINC and NPC relationship – it's complicated! Jahed, Zeinab; Soheilypour, Mohammad; Peyro, Mohaddeseh Journal of Cell Science, Vol. 129, Issue 17 https://doi.org/10.1242/jcs.184184	journal	August 2016
SLiMFinder: A Probabilistic Method for Identifying Over-Represented, Convergently Evolved, Short Linear Motifs in Proteins Edwards, Richard J.; Davey, Norman E.; Shields, Denis C. PLoS ONE, Vol. 2, Issue 10 https://doi.org/10.1371/journal.pone.0000967	journal	October 2007
Neural Architectures for Named Entity Recognition Lample, Guillaume; Ballesteros, Miguel; Subramanian, Sandeep Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies https://doi.org/10.18653/v1/N16-1030	conference	January 2016
Neural Machine Translation of Rare Words with Subword Units Sennrich, Rico; Haddow, Barry; Birch, Alexandra Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) https://doi.org/10.18653/v1/P16-1162	conference	January 2016
Comparing Fifty Natural Languages and Twelve Genetic Languages Using Word Embedding Language Divergence (WELD) as a Quantitative Measure of Language Distance Asgari, Ehsaneddin; Mofrad, Mohammad R. K. Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP https://doi.org/10.18653/v1/W16-1208	conference	January 2016
Neural Machine Translation of Rare Words with Subword Units Sennrich, Rico; Haddow, Barry; Birch, Alexandra Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) https://doi.org/10.18653/v1/p16-1162	conference	January 2016
Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification Tang, Duyu; Wei, Furu; Yang, Nan Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) https://doi.org/10.3115/v1/P14-1146	conference	January 2014
Neural Machine Translation of Rare Words with Subword Units Sennrich, Rico; Haddow, Barry; Birch, Alexandra arXiv https://doi.org/10.48550/arxiv.1508.07909	preprint	January 2015
Comparing Fifty Natural Languages and Twelve Genetic Languages Using Word Embedding Language Divergence (WELD) as a Quantitative Measure of Language Distance Asgari, Ehsaneddin; Mofrad, Mohammad R. K. arXiv https://doi.org/10.48550/arxiv.1604.08561	preprint	January 2016
Enriching Word Vectors with Subword Information Bojanowski, Piotr; Grave, Edouard; Joulin, Armand arXiv https://doi.org/10.48550/arxiv.1607.04606	preprint	January 2016
Machine learning can differentiate venom toxins from other proteins having non-toxic physiological functions Gacesa, Ranko; Barlow, David J.; Long, Paul F. PeerJ Computer Science, Vol. 2 https://doi.org/10.7717/peerj-cs.90	journal	January 2016

Cited By (5)

Learning supervised embeddings for large scale sequence comparisons Kimothi, Dhananjay; Biyani, Pravesh; Hogan, James M. PLOS ONE, Vol. 15, Issue 3 https://doi.org/10.1371/journal.pone.0216636	journal	March 2020
Identifying SNAREs by Incorporating Deep Learning Architecture and Amino Acid Embedding Representation Le, Nguyen Quoc Khanh; Huynh, Tuan-Tu Frontiers in Physiology, Vol. 10 https://doi.org/10.3389/fphys.2019.01501	journal	December 2019
Modeling aspects of the language of life through transfer-learning protein sequences Michael, Heinzinger,; Ahmed, Elnaggar,; Yu, Wang, Columbia University https://doi.org/10.7916/tt83-bq62	text	January 2019
iN6-methylat (5-step): identifying DNA N6-methyladenine sites in rice genome using continuous bag of nucleobases via Chou’s 5-step rule Le, Nguyen Quoc Khanh Molecular Genetics and Genomics, Vol. 294, Issue 5 https://doi.org/10.1007/s00438-019-01570-y	journal	May 2019
Modeling aspects of the language of life through transfer-learning protein sequences Heinzinger, Michael; Elnaggar, Ahmed; Wang, Yu BMC Bioinformatics, Vol. 20, Issue 1 https://doi.org/10.1186/s12859-019-3220-8	journal	December 2019

Similar Records

Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)

Journal Article · Wed Jun 13 00:00:00 EDT 2018 · OSTI ID:1559145

Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics

Journal Article · Mon Nov 09 19:00:00 EST 2015 · PLoS ONE · OSTI ID:1627767

Predicting variable gene content in Escherichia coli using conserved genes

Journal Article · Tue Jun 13 20:00:00 EDT 2023 · mSystems · OSTI ID:2324772

Related Subjects

59 BASIC BIOLOGICAL SCIENCES

Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)

Citation Formats

References (74)

Cited By (5)

Similar Records

Related Subjects