Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)
Journal Article
·
· Scientific Reports
- Univ. of California, Berkeley, CA (United States). Molecular Cell Biomechanics Lab., Depts. of Bioengineering and Mechanical Engineering; Helmholtz Centre for Infection Research, Brunswick (Germany)
- Helmholtz Centre for Infection Research, Brunswick (Germany)
- Univ. of California, Berkeley, CA (United States). Molecular Cell Biomechanics Lab., Depts. of Bioengineering and Mechanical Engineering
In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variablelength protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw amino acids k-mer features.
- Research Organization:
- Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC)
- Grant/Contract Number:
- AC02-05CH11231
- OSTI ID:
- 1559191
- Journal Information:
- Scientific Reports, Journal Name: Scientific Reports Journal Issue: 1 Vol. 9; ISSN 2045-2322
- Publisher:
- Nature Publishing GroupCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Learning supervised embeddings for large scale sequence comparisons
|
journal | March 2020 |
Identifying SNAREs by Incorporating Deep Learning Architecture and Amino Acid Embedding Representation
|
journal | December 2019 |
Modeling aspects of the language of life through transfer-learning protein sequences
|
text | January 2019 |
iN6-methylat (5-step): identifying DNA N6-methyladenine sites in rice genome using continuous bag of nucleobases via Chou’s 5-step rule
|
journal | May 2019 |
Modeling aspects of the language of life through transfer-learning protein sequences
|
journal | December 2019 |
Similar Records
Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)
Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics
Predicting variable gene content in Escherichia coli using conserved genes
Journal Article
·
Wed Jun 13 00:00:00 EDT 2018
·
OSTI ID:1559145
Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics
Journal Article
·
Mon Nov 09 19:00:00 EST 2015
· PLoS ONE
·
OSTI ID:1627767
Predicting variable gene content in Escherichia coli using conserved genes
Journal Article
·
Tue Jun 13 20:00:00 EDT 2023
· mSystems
·
OSTI ID:2324772