Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics

Journal Article · · PLoS ONE
 [1];  [2];
  1. Univ. of California, Berkeley, CA (United States). Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering; DOE/OSTI
  2. Univ. of California, Berkeley, CA (United States). Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering; Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States). Physical Biosciences Div.
We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93%±0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be determined. Importantly, this model needs to be trained only once and can then be applied to extract a comprehensive set of information regarding proteins of interest. Moreover, this representation can be considered as pre-training for various applications of deep learning in bioinformatics.
Research Organization:
Lawrence Berkeley National Lab (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
National Science Foundation (NSF); USDOE Office of Science (SC)
Grant/Contract Number:
AC02-05CH11231
OSTI ID:
1627767
Journal Information:
PLoS ONE, Journal Name: PLoS ONE Journal Issue: 11 Vol. 10; ISSN 1932-6203
Publisher:
Public Library of ScienceCopyright Statement
Country of Publication:
United States
Language:
English

References (30)

Predicting function: from genes to genomes and back 1 1Edited by P. E. Wright journal November 1998
Function and structure of inherently disordered proteins journal December 2008
Proteins and Proteomics: A Laboratory Manual journal August 2004
Predicting intrinsic disorder in proteins: an overview journal July 2009
Mechanism of coupled folding and binding of an intrinsically disordered protein journal May 2007
Visualization of multiple alignments, phylogenies and gene family evolution journal March 2010
Genomics and natural language processing journal August 2002
Artemis: sequence visualization and annotation journal October 2000
An efficient algorithm for large-scale detection of protein families journal April 2002
Predicting Protein Function by Genomic Context: Quantitative Evaluation and Qualitative Inferences journal August 2000
Mining for class-specific motifs in protein sequence classification journal January 2013
Subfamily specific conservation profiles for proteins based on n-gram patterns journal January 2008
Characterization of Protein Hubs by Inferring Interacting Motifs from Protein Interactions journal January 2005
Modeling Structure-Function Relationships in Synthetic DNA Sequences using Attribute Grammars journal October 2009
Word Decoding of Protein Amino Acid Sequences with Availability Analysis: A Linguistic Approach journal November 2012
Visualization of SNPs with t-SNE journal February 2013
Higher Nucleoporin-Importinβ Affinity at the Nuclear Basket Increases Nucleocytoplasmic Import journal November 2013
SCOP: A structural classification of proteins database for the investigation of sequences and structures journal April 1995
Least Squares Support Vector Machine Classifiers journal June 1999
The language of genes journal November 2002
HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment journal December 2011
Intrinsically unstructured proteins and their functions journal March 2005
SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence journal July 2003
DisProt: the Database of Disordered Proteins journal January 2007
The RCSB Protein Data Bank: new resources for research and education journal November 2012
The human splicing code reveals new insights into the genetic determinants of disease journal December 2014
Characterization of Protein Hubs by Inferring Interacting Motifs from Protein Interactions journal September 2007
Computational Phenotype Discovery Using Unsupervised Feature Learning over Noisy, Sparse, and Irregular Clinical Data journal June 2013
Physical Motif Clustering within Intrinsically Disordered Nucleoporin Sequences Reveals Universal Functional Features journal September 2013
Comparative n-gram analysis of whole-genome protein sequences conference January 2002

Cited By (119)

SpliceVec: Distributed feature representations for splice junction prediction journal June 2018
ABLE: Attention based learning for enzyme classification journal October 2021
Prediction of human-virus protein-protein interactions through a sequence embedding-based machine learning method journal January 2020
DeepCPI: A Deep Learning-based Framework for Large-scale in silico Drug Screening journal October 2019
Distributed Representation of Chemical Fragments journal March 2018
Conformational Ensembles Exhibit Extensive Molecular Recognition Features journal August 2018
PhosContext2vec: a distributed representation of residue-level sequence contexts and its application to general and kinase-specific phosphorylation site prediction journal May 2018
Using machine learning tools for protein database biocuration assistance journal July 2018
Recurrent Neural Network for Predicting Transcription Factor Binding Sites journal October 2018
Delfos: deep learning model for prediction of solvation free energies in generic organic solvents journal January 2019
UDSMProt: universal deep sequence models for protein classification journal January 2020
HLA class I binding prediction via convolutional neural networks journal April 2017
Convolutional neural networks for classification of alignments of non-coding RNA sequences journal June 2018
DeepFam: deep learning based alignment-free method for protein family modeling and prediction journal June 2018
A novel methodology on distributed representations of proteins using their interacting ligands journal June 2018
Identifying antimicrobial peptides using word embedding with deep recurrent neural networks journal November 2018
A deep learning genome-mining strategy for biosynthetic gene cluster prediction journal August 2019
Opportunities and obstacles for deep learning in biology and medicine journal April 2018
DELPHI: accurate deep ensemble model for protein interaction sites prediction journal April 2020
Predicting CTCF-mediated chromatin loops using CTCF-MP posted_content February 2018
A Deep Learning Approach for Learning Intrinsic Protein-RNA Binding Preferences journal May 2018
Infer related genes from large scale gene expression dataset with embedding preprint July 2018
Inferring Protein Domain Semantic Roles Using word2vec preprint April 2019
Protein structure featurization via standard image classification neural networks preprint November 2019
Prediction of 8-state protein secondary structures by 1D-Inception and BD-LSTM preprint December 2019
Identifying and predicting social lifestyles in people’s trajectories by neural networks journal October 2018
De novo profile generation based on sequence context specificity with the long short-term memory network journal July 2018
MHCSeqNet: a deep neural network model for universal MHC binding prediction journal May 2019
The assessment of efficient representation of drug features using deep learning for drug repositioning journal November 2019
Antimicrobial peptide identification using multi-scale convolutional network journal December 2019
Gene2vec: distributed representation of genes based on co-expression journal February 2019
A k-mer grammar analysis to uncover maize regulatory architecture journal March 2019
Mut2Vec: distributed representation of cancerous mutations journal April 2018
Encodings and models for antimicrobial peptide classification for multi-resistant pathogens journal March 2019
16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses journal February 2019
An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes journal November 2018
Learning supervised embeddings for large scale sequence comparisons journal March 2020
Classifying Promoters by Interpreting the Hidden Information of DNA Sequences via Deep Learning and Combination of Continuous FastText N-Grams journal November 2019
SPVec: A Word2vec-Inspired Feature Representation Method for Drug-Target Interaction Prediction journal January 2020
Predicting Sites of Epitranscriptome Modifications Using Unsupervised Representation Learning Based on Generative Adversarial Networks journal June 2020
Identifying SNAREs by Incorporating Deep Learning Architecture and Amino Acid Embedding Representation journal December 2019
Global Vectors Representation of Protein Sequences and Its Application for Predicting Self-Interacting Proteins with Multi-Grained Cascade Forest Model journal November 2019
Taxonomic Classification for Living Organisms Using Convolutional Neural Networks journal November 2017
Molecular Cavity Topological Representation for Pattern Analysis: A NLP Analogy-Based Word2Vec Method journal November 2019
A New Approach for Scalable Analysis of Microbial Communities preprint January 2015
Distributed Representations for Biological Sequence Analysis preprint January 2016
dna2vec: Consistent vector representations of variable-length k-mers preprint January 2017
MUFold-SS: Protein Secondary Structure Prediction Using Deep Inception-Inside-Inception Networks preprint January 2017
Multi-view Banded Spectral Clustering with Application to ICD9 Clustering preprint January 2018
Network Representation of Large-Scale Heterogeneous RNA Sequences with Integration of Diverse Multi-omics, Interactions, and Annotations Data text January 2019
Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural Information preprint January 2019
ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing preprint January 2020
Deep Learning in Protein Structural Modeling and Design preprint January 2020
PS8-Net: A Deep Convolutional Neural Network to Predict the Eight-State Protein Secondary Structure preprint January 2020
DenseHMM: Learning Hidden Markov Models by Learning Dense Representations preprint January 2020
Functional Annotation of Proteins using Domain Embedding based Sequence Classification
  • Sarker, Bishnu; Ritchie, David; Aridhi, Sabeur
  • Proceedings of the 11th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management https://doi.org/10.5220/0008353401630170
conference January 2019
Additional file 1 of Deep neural networks for inferring binding sites of RNA-binding proteins by using distributed representations of RNA primary sequence and secondary structure dataset January 2020
Additional file 2 of Deep neural networks for inferring binding sites of RNA-binding proteins by using distributed representations of RNA primary sequence and secondary structure dataset January 2020
Additional file 1 of A representation learning model based on variational inference and graph autoencoder for predicting lncRNA-disease associations dataset January 2021
Additional file 6 of A representation learning model based on variational inference and graph autoencoder for predicting lncRNA-disease associations dataset January 2021
Additional file 7 of A representation learning model based on variational inference and graph autoencoder for predicting lncRNA-disease associations dataset January 2021
Additional file 1 of Machine learning predicts nucleosome binding modes of transcription factors dataset January 2021
Additional file 2 of InflamNat: web-based database and predictor of anti-inflammatory natural products dataset January 2022
Additional file 3 of InflamNat: web-based database and predictor of anti-inflammatory natural products dataset January 2022
Additional file 4 of InflamNat: web-based database and predictor of anti-inflammatory natural products dataset January 2022
Additional file 3 of Prediction of protein solubility based on sequence physicochemical patterns and distributed representation information with DeepSoluE dataset January 2023
Predicting the host of influenza viruses based on the word vector journal January 2017
EnzyNet: enzyme classification using 3D convolutional neural networks on spatial representation journal January 2018
Modeling aspects of the language of life through transfer-learning protein sequences text January 2019
Machine learning techniques for protein function prediction journal October 2019
Learning a functional grammar of protein domains using natural language word embedding techniques journal November 2019
GP-Based Grammatical Inference for Classification of Amyloidogenic Sequences book February 2019
Low-dimensional representation of genomic sequences journal March 2019
iN6-methylat (5-step): identifying DNA N6-methyladenine sites in rice genome using continuous bag of nucleobases via Chou’s 5-step rule journal May 2019
A novel improved prediction of protein structural class using deep recurrent neural network journal September 2018
DeepSite: bidirectional LSTM and CNN models for predicting DNA–protein binding journal July 2019
Predicting HLA class II antigen presentation through integrated deep learning journal October 2019
Machine-learning-guided directed evolution for protein engineering journal July 2019
Unified rational protein engineering with sequence-based deep representation learning journal October 2019
Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX) journal March 2019
SWeeP: representing large biological sequences datasets in compact vectors journal January 2020
Machine learning can be used to distinguish protein families and generate new proteins belonging to those families journal November 2019
Deep learning enables high-quality and high-throughput prediction of enzyme commission numbers journal June 2019
Deep learning in bioinformatics journal July 2016
DeepSol: a deep learning framework for sequence-based protein solubility prediction journal March 2018
Learned protein embeddings for machine learning journal March 2018
Predicting CTCF-mediated chromatin loops using CTCF-MP journal June 2018
MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples journal June 2018
Learned protein embeddings for machine learning journal June 2018
MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples journal August 2018
Towards region-specific propagation of protein functions journal October 2018
DeepCrystal: a deep learning framework for sequence-based protein crystallization prediction journal November 2018
circDeep: deep learning approach for circular RNA classification from other long non-coding RNA journal July 2019
Deep learning for clustering of multivariate clinical patient trajectories with missing values journal November 2019
Opportunities and obstacles for deep learning in biology and medicine posted_content January 2018
EMBER: Multi-label prediction of kinase-substrate phosphorylation events through deep learning posted_content February 2021
De novo profile generation based on sequence context specificity with the long short-term memory network posted_content May 2018
Towards region-specific propagation of protein functions journal March 2018
MHCSeqNet: A deep neural network model for universal MHC binding prediction posted_content November 2018
rawMSA: End-to-end Deep Learning Makes Protein Sequence Profiles and Feature Extraction obsolete posted_content October 2018
A Deep Learning Genome-Mining Strategy Improves Biosynthetic Gene Cluster Prediction posted_content December 2018
Learning supervised embeddings for large scale sequence comparisons posted_content April 2019
UDSMProt: Universal Deep Sequence Models for Protein Classification posted_content September 2019
End-to-end multitask learning, from protein language to protein features without alignments preprint January 2020
DeepCrystal: A Deep Learning Framework for Sequence-based Protein Crystallization Prediction conference December 2018
Using Deep Neural Networks to Improve the Performance of Protein–Protein Interactions Prediction journal April 2020
Protein secondary structure prediction improved by recurrent neural networks integrated with two-dimensional convolutional neural networks journal October 2018
Time series computational prediction of vaccines for influenza A H3N2 with recurrent neural networks journal February 2020
Low-Dimensional Representation of Genomic Sequences
  • Tillquist, Richard C.; Lladser, Manuel E.
  • BCB '19: 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics https://doi.org/10.1145/3307339.3342620
conference September 2019
P2V-MAP: Mapping Market Structures for Large Retail Assortments journal March 2019
Modeling aspects of the language of life through transfer-learning protein sequences journal December 2019
Amino acid encoding for deep learning applications journal June 2020
Multi-channel PINN: investigating scalable and transferable neural networks for drug discovery journal July 2019
On the prediction of DNA-binding proteins only from primary sequences: A deep learning approach journal December 2017
rawMSA: End-to-end Deep Learning using raw Multiple Sequence Alignments journal August 2019
Deep learning for computational biology journal July 2016
EnzyNet: enzyme classification using 3D convolutional neural networks on spatial representation preprint January 2017
Deep learning enables high-quality and high-throughput prediction of enzyme commission numbers text January 2018
Deep learning enables high-quality and high-throughput prediction of enzyme commission numbers text January 2019

Similar Records

FG nucleoporins feature unique patterns that distinguish them from other IDPs
Journal Article · Mon Jul 05 20:00:00 EDT 2021 · Biophysical Journal · OSTI ID:2470661

Cooperative Interactions between Different Classes of Disordered Proteins Play a Functional Role in the Nuclear Pore Complex of Baker’s Yeast
Journal Article · Sun Jan 08 19:00:00 EST 2017 · PLoS ONE · OSTI ID:1627813