Large language models generate functional protein sequences across diverse families
Journal Article
·
· Nature Biotechnology
- Salesforce Research, Palo Alto, CA (United States); Profluent Bio, San Francisco, CA
- Salesforce Research, Palo Alto, CA (United States)
- Univ. of California, San Francisco, CA (United States)
- Univ. of California, Berkeley, CA (United States)
- Tierra Biosciences, San Leandro, CA (United States)
- Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States); SLAC National Accelerator Laboratory (SLAC), Menlo Park, CA (United States). Stanford Synchrotron Radiation Lightsource (SSRL); Univ. of California, San Francisco, CA (United States)
Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here, in this paper, we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase.
- Research Organization:
- Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
- Sponsoring Organization:
- National Institutes of Health (NIH); USDOE Office of Science (SC), Basic Energy Sciences (BES); USDOE Office of Science (SC), Biological and Environmental Research (BER)
- Grant/Contract Number:
- AC02-05CH11231
- OSTI ID:
- 2282481
- Journal Information:
- Nature Biotechnology, Journal Name: Nature Biotechnology Journal Issue: 8 Vol. 41; ISSN 1087-0156
- Publisher:
- Springer NatureCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Similar Records
An Introduction to Word Embeddings and Language Models
Computational models of natural language processing
Understanding digital-system specifications written in natural language
Technical Report
·
Wed Mar 31 20:00:00 EDT 2021
·
OSTI ID:1773690
Computational models of natural language processing
Book
·
Sat Dec 31 23:00:00 EST 1983
·
OSTI ID:6679331
Understanding digital-system specifications written in natural language
Book
·
Tue Dec 31 23:00:00 EST 1985
·
OSTI ID:7043171