Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Large language models generate functional protein sequences across diverse families

Journal Article · · Nature Biotechnology
 [1];  [2];  [3];  [4];  [5];  [6];  [3];  [2];  [5];  [2];  [3];  [2]
  1. Salesforce Research, Palo Alto, CA (United States); Profluent Bio, San Francisco, CA
  2. Salesforce Research, Palo Alto, CA (United States)
  3. Univ. of California, San Francisco, CA (United States)
  4. Univ. of California, Berkeley, CA (United States)
  5. Tierra Biosciences, San Leandro, CA (United States)
  6. Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States); SLAC National Accelerator Laboratory (SLAC), Menlo Park, CA (United States). Stanford Synchrotron Radiation Lightsource (SSRL); Univ. of California, San Francisco, CA (United States)
Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here, in this paper, we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase.
Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
National Institutes of Health (NIH); USDOE Office of Science (SC), Basic Energy Sciences (BES); USDOE Office of Science (SC), Biological and Environmental Research (BER)
Grant/Contract Number:
AC02-05CH11231
OSTI ID:
2282481
Journal Information:
Nature Biotechnology, Journal Name: Nature Biotechnology Journal Issue: 8 Vol. 41; ISSN 1087-0156
Publisher:
Springer NatureCopyright Statement
Country of Publication:
United States
Language:
English

References (66)

Lessons from the lysozyme of phage T4 journal January 2010
Evaluation at atomic resolution of the role of strain in destabilizing the temperature‐sensitive T4 lysozyme mutant Arg 96 → His journal April 2009
Learning generative models for protein fold families
  • Balakrishnan, Sivaraman; Kamisetty, Hetunandan; Carbonell, Jaime G.
  • Proteins: Structure, Function, and Bioinformatics, Vol. 79, Issue 4 https://doi.org/10.1002/prot.22934
journal January 2011
Pharaoh: A Beam Search Decoder for Phrase-Based Statistical Machine Translation Models book January 2004
Comparison of the predicted and observed secondary structure of T4 phage lysozyme journal October 1975
On the catalytic mechanism of bacteriophage endolysins: Opportunities for engineering journal January 2020
Protein production by auto-induction in high-density shaking cultures journal May 2005
Signal Peptides Generated by Attention-Based Neural Networks journal July 2020
Conformation of T4 Lysozyme in Solution. Hinge-Bending Motion and the Substrate-Induced Conformational Transition Studied by Site-Directed Spin Labeling journal January 1997
Gene Ontology: tool for the unification of biology journal May 2000
Principles for designing ideal protein structures journal November 2012
Deep learning journal May 2015
The coming of age of de novo protein design journal September 2016
De novo design of a four-fold symmetric TIM-barrel protein with atomic-level accuracy journal November 2015
Protein design and variant prediction using autoregressive generative models journal April 2021
Protein sequence design with a learned potential journal February 2022
ProtGPT2 is a deep unsupervised language model for protein design journal July 2022
Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations journal March 2021
Highly accurate protein structure prediction with AlphaFold journal July 2021
De novo protein design by deep network hallucination journal December 2021
A backbone-centred energy function of neural networks for protein design journal February 2022
Deep diversification of an AAV capsid protein by machine learning journal February 2021
Unified rational protein engineering with sequence-based deep representation learning journal October 2019
Low-N protein engineering with data-efficient deep learning journal April 2021
ColabFold: making protein folding accessible to all journal May 2022
Expanding functional protein sequence spaces using generative adversarial networks journal March 2021
Identification of direct residue contacts in protein-protein interaction by message passing journal December 2008
Direct-coupling analysis of residue coevolution captures native contacts across many protein families journal November 2011
Control over overall shape and size in de novo designed proteins journal September 2015
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences journal April 2021
Protein sequence design by conformational landscape optimization journal March 2021
Fast and sensitive taxonomic assignment to metagenomic contigs journal March 2021
UniProt archive journal March 2004
The EVcouplings Python framework for coevolutionary sequence analysis journal October 2018
The Universal Protein Resource (UniProt) journal December 2004
The NCBI Taxonomy database journal December 2011
Pfam: the protein families database journal November 2013
BetaCavityWeb: a webserver for molecular voids and channels journal April 2015
Twilight zone of protein sequence alignments journal February 1999
BERTology Meets Biology: Interpreting Attention in Protein Language Models posted_content July 2020
Phaser crystallographic software journal July 2007
Iterative model building, structure refinement and density modification with the PHENIX AutoBuild wizard journal December 2007
XDS journal January 2010
Features and development of Coot journal March 2010
Towards automated crystallographic structure refinement with phenix.refine journal March 2012
Overview of refinement procedures within REFMAC 5: utilizing data from different sources journal March 2018
Current approaches for automated model building into cryo-EM maps using Buccaneer with CCP-EM journal May 2020
Graphical Models of Residue Coupling in Protein Families journal April 2008
ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing journal January 2021
Catalytic diversity and cell wall binding repeats in the phage‐encoded endolysins journal November 2018
A covalent enzyme-substrate intermediate with saccharide distortion in a mutant T4 lysozyme journal December 1993
De novo design of protein homo-oligomers with modular hydrogen-bond network-mediated specificity journal May 2016
An evolution-based model for designing chorismate mutase enzymes journal July 2020
Potts Models and Related Problems in Statistical Mechanics book February 1991
Inferring Pairwise Interactions from Biological Data Using Maximum-Entropy Probability Models journal July 2015
Sometimes I’ll start a sentence in Spanish Y TERMINO EN ESPAÑOL: toward a typology of code-switching1 journal January 1980
Deep Contextualized Word Representations
  • Peters, Matthew; Neumann, Mark; Iyyer, Mohit
  • Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) https://doi.org/10.18653/v1/N18-1202
conference January 2018
Transfer Learning in Natural Language Processing conference January 2019
Universal Language Model Fine-tuning for Text Classification
  • Howard, Jeremy; Ruder, Sebastian
  • Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) https://doi.org/10.18653/v1/P18-1031
conference January 2018
Constraints on Language Mixing: Intrasentential Code-Switching and Borrowing in Spanish/English journal June 1979
Adam: A Method for Stochastic Optimization preprint January 2014
What makes ImageNet good for transfer learning? preprint January 2016
CTRL: A Conditional Transformer Language Model for Controllable Generation preprint January 2019
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer preprint January 2019
Hopfield Networks is All You Need preprint January 2020
Robust and accurate prediction of residue–residue interactions across protein interfaces using evolutionary information journal May 2014

Similar Records

An Introduction to Word Embeddings and Language Models
Technical Report · Wed Mar 31 20:00:00 EDT 2021 · OSTI ID:1773690

Computational models of natural language processing
Book · Sat Dec 31 23:00:00 EST 1983 · OSTI ID:6679331

Understanding digital-system specifications written in natural language
Book · Tue Dec 31 23:00:00 EST 1985 · OSTI ID:7043171