Large language models generate functional protein sequences across diverse families

Madani, Ali; Krause, Ben; Greene, Eric R.; Subramanian, Subu; Mohr, Benjamin P.; Holton, James M.; Olmos, Jr., Jose Luis; Xiong, Caiming; Sun, Zachary Z.; Socher, Richard; Fraser, James S.; Naik, Nikhil

doi:10.1038/s41587-022-01618-2

Large language models generate functional protein sequences across diverse families

Journal Article · Thu Jan 26 00:00:00 EST 2023 · Nature Biotechnology

DOI:https://doi.org/10.1038/s41587-022-01618-2· OSTI ID:2282481

^[1]; Krause, Ben ^[2]; Greene, Eric R. ^[3]; Subramanian, Subu ^[4]; Mohr, Benjamin P. ^[5]; ^[6]; Olmos, Jr., Jose Luis ^[3]; Xiong, Caiming ^[2]; Sun, Zachary Z. ^[5]; Socher, Richard ^[2]; Fraser, James S. ^[3]; ^[2]

Salesforce Research, Palo Alto, CA (United States); Profluent Bio, San Francisco, CA
Salesforce Research, Palo Alto, CA (United States)
Univ. of California, San Francisco, CA (United States)
Univ. of California, Berkeley, CA (United States)
Tierra Biosciences, San Leandro, CA (United States)
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States); SLAC National Accelerator Laboratory (SLAC), Menlo Park, CA (United States). Stanford Synchrotron Radiation Lightsource (SSRL); Univ. of California, San Francisco, CA (United States)

Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here, in this paper, we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase.

View Accepted Manuscript (DOE)

Research Organization:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Organization:: National Institutes of Health (NIH); USDOE Office of Science (SC), Basic Energy Sciences (BES); USDOE Office of Science (SC), Biological and Environmental Research (BER)

Grant/Contract Number:: AC02-05CH11231

OSTI ID:: 2282481

Journal Information:: Nature Biotechnology, Journal Name: Nature Biotechnology Journal Issue: 8 Vol. 41; ISSN 1087-0156

Publisher:: Springer NatureCopyright Statement

Country of Publication:: United States

Language:: English

References (66)

Lessons from the lysozyme of phage T4 Baase, Walter A.; Liu, Lijun; Tronrud, Dale E. Protein Science, Vol. 19, Issue 4 https://doi.org/10.1002/pro.344	journal	January 2010
Evaluation at atomic resolution of the role of strain in destabilizing the temperature‐sensitive T4 lysozyme mutant Arg 96 → His Mooers, Blaine H. M.; Tronrud, Dale E.; Matthews, Brian W. Protein Science, Vol. 18, Issue 5 https://doi.org/10.1002/pro.93	journal	April 2009
Learning generative models for protein fold families Balakrishnan, Sivaraman; Kamisetty, Hetunandan; Carbonell, Jaime G. Proteins: Structure, Function, and Bioinformatics, Vol. 79, Issue 4 https://doi.org/10.1002/prot.22934	journal	January 2011
Pharaoh: A Beam Search Decoder for Phrase-Based Statistical Machine Translation Models Koehn, Philipp Machine Translation: From Real Users to Research https://doi.org/10.1007/978-3-540-30194-3_13	book	January 2004
Comparison of the predicted and observed secondary structure of T4 phage lysozyme Matthews, B. W. Biochimica et Biophysica Acta (BBA) - Protein Structure, Vol. 405, Issue 2 https://doi.org/10.1016/0005-2795(75)90109-9	journal	October 1975
On the catalytic mechanism of bacteriophage endolysins: Opportunities for engineering Love, Michael J.; Abeysekera, Gayan S.; Muscroft-Taylor, Andrew C. Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics, Vol. 1868, Issue 1 https://doi.org/10.1016/j.bbapap.2019.140302	journal	January 2020
Protein production by auto-induction in high-density shaking cultures Studier, F. William Protein Expression and Purification, Vol. 41, Issue 1, p. 207-234 https://doi.org/10.1016/j.pep.2005.01.016	journal	May 2005
Signal Peptides Generated by Attention-Based Neural Networks Wu, Zachary; Yang, Kevin K.; Liszka, Michael J. ACS Synthetic Biology, Vol. 9, Issue 8 https://doi.org/10.1021/acssynbio.0c00219	journal	July 2020
Conformation of T4 Lysozyme in Solution. Hinge-Bending Motion and the Substrate-Induced Conformational Transition Studied by Site-Directed Spin Labeling ^† Mchaourab, Hassane S.; Oh, Kyoung Joon; Fang, Celia J. Biochemistry, Vol. 36, Issue 2 https://doi.org/10.1021/bi962114m	journal	January 1997
Gene Ontology: tool for the unification of biology Ashburner, Michael; Ball, Catherine A.; Blake, Judith A. Nature Genetics, Vol. 25, Issue 1 https://doi.org/10.1038/75556	journal	May 2000
Principles for designing ideal protein structures Koga, Nobuyasu; Tatsumi-Koga, Rie; Liu, Gaohua Nature, Vol. 491, Issue 7423 https://doi.org/10.1038/nature11600	journal	November 2012
Deep learning LeCun, Yann; Bengio, Yoshua; Hinton, Geoffrey Nature, Vol. 521, Issue 7553 https://doi.org/10.1038/nature14539	journal	May 2015
The coming of age of de novo protein design Huang, Po-Ssu; Boyken, Scott E.; Baker, David Nature, Vol. 537, Issue 7620 https://doi.org/10.1038/nature19946	journal	September 2016
De novo design of a four-fold symmetric TIM-barrel protein with atomic-level accuracy Huang, Po-Ssu; Feldmeier, Kaspar; Parmeggiani, Fabio Nature Chemical Biology, Vol. 12, Issue 1 https://doi.org/10.1038/nchembio.1966	journal	November 2015
Protein design and variant prediction using autoregressive generative models Shin, Jung-Eun; Riesselman, Adam J.; Kollasch, Aaron W. Nature Communications, Vol. 12, Issue 1 https://doi.org/10.1038/s41467-021-22732-w	journal	April 2021
Protein sequence design with a learned potential Anand, Namrata; Eguchi, Raphael; Mathews, Irimpan I. Nature Communications, Vol. 13, Issue 1 https://doi.org/10.1038/s41467-022-28313-9	journal	February 2022
ProtGPT2 is a deep unsupervised language model for protein design Ferruz, Noelia; Schmidt, Steffen; Höcker, Birte Nature Communications, Vol. 13, Issue 1 https://doi.org/10.1038/s41467-022-32007-7	journal	July 2022
Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations Das, Payel; Sercu, Tom; Wadhawan, Kahini Nature Biomedical Engineering, Vol. 5, Issue 6 https://doi.org/10.1038/s41551-021-00689-x	journal	March 2021
Highly accurate protein structure prediction with AlphaFold Jumper, John; Evans, Richard; Pritzel, Alexander Nature https://doi.org/10.1038/s41586-021-03819-2	journal	July 2021
De novo protein design by deep network hallucination Anishchenko, Ivan; Pellock, Samuel J.; Chidyausiku, Tamuka M. Nature, Vol. 600, Issue 7889 https://doi.org/10.1038/s41586-021-04184-w	journal	December 2021
A backbone-centred energy function of neural networks for protein design Huang, Bin; Xu, Yang; Hu, Xiuhong Nature, Vol. 602, Issue 7897 https://doi.org/10.1038/s41586-021-04383-5	journal	February 2022
Deep diversification of an AAV capsid protein by machine learning Bryant, Drew H.; Bashir, Ali; Sinai, Sam Nature Biotechnology, Vol. 39, Issue 6 https://doi.org/10.1038/s41587-020-00793-4	journal	February 2021
Unified rational protein engineering with sequence-based deep representation learning Alley, Ethan C.; Khimulya, Grigory; Biswas, Surojit Nature Methods, Vol. 16, Issue 12 https://doi.org/10.1038/s41592-019-0598-1	journal	October 2019
Low-N protein engineering with data-efficient deep learning Biswas, Surojit; Khimulya, Grigory; Alley, Ethan C. Nature Methods, Vol. 18, Issue 4 https://doi.org/10.1038/s41592-021-01100-y	journal	April 2021
ColabFold: making protein folding accessible to all Mirdita, Milot; Schütze, Konstantin; Moriwaki, Yoshitaka Nature Methods, Vol. 19, Issue 6 https://doi.org/10.1038/s41592-022-01488-1	journal	May 2022
Expanding functional protein sequence spaces using generative adversarial networks Repecka, Donatas; Jauniskis, Vykintas; Karpus, Laurynas Nature Machine Intelligence, Vol. 3, Issue 4 https://doi.org/10.1038/s42256-021-00310-5	journal	March 2021
Identification of direct residue contacts in protein-protein interaction by message passing Weigt, M.; White, R. A.; Szurmant, H. Proceedings of the National Academy of Sciences, Vol. 106, Issue 1 https://doi.org/10.1073/pnas.0805923106	journal	December 2008
Direct-coupling analysis of residue coevolution captures native contacts across many protein families Morcos, F.; Pagnani, A.; Lunt, B. Proceedings of the National Academy of Sciences, Vol. 108, Issue 49 https://doi.org/10.1073/pnas.1111471108	journal	November 2011
Control over overall shape and size in de novo designed proteins Lin, Yu-Ru; Koga, Nobuyasu; Tatsumi-Koga, Rie Proceedings of the National Academy of Sciences, Vol. 112, Issue 40 https://doi.org/10.1073/pnas.1509508112	journal	September 2015
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences Rives, Alexander; Meier, Joshua; Sercu, Tom Proceedings of the National Academy of Sciences, Vol. 118, Issue 15 https://doi.org/10.1073/pnas.2016239118	journal	April 2021
Protein sequence design by conformational landscape optimization Norn, Christoffer; Wicky, Basile I. M.; Juergens, David Proceedings of the National Academy of Sciences, Vol. 118, Issue 11 https://doi.org/10.1073/pnas.2017228118	journal	March 2021
Fast and sensitive taxonomic assignment to metagenomic contigs Mirdita, M.; Steinegger, M.; Breitwieser, F. Bioinformatics, Vol. 37, Issue 18 https://doi.org/10.1093/bioinformatics/btab184	journal	March 2021
UniProt archive Leinonen, R.; Diez, F. G.; Binns, D. Bioinformatics, Vol. 20, Issue 17 https://doi.org/10.1093/bioinformatics/bth191	journal	March 2004
The EVcouplings Python framework for coevolutionary sequence analysis Hopf, Thomas A.; Green, Anna G.; Schubert, Benjamin Bioinformatics, Vol. 35, Issue 9 https://doi.org/10.1093/bioinformatics/bty862	journal	October 2018
The Universal Protein Resource (UniProt) Bairoch, A. Nucleic Acids Research, Vol. 33, Issue Database issue https://doi.org/10.1093/nar/gki070	journal	December 2004
The NCBI Taxonomy database Federhen, S. Nucleic Acids Research, Vol. 40, Issue D1 https://doi.org/10.1093/nar/gkr1178	journal	December 2011
Pfam: the protein families database Finn, Robert D.; Bateman, Alex; Clements, Jody Nucleic Acids Research, Vol. 42, Issue D1 https://doi.org/10.1093/nar/gkt1223	journal	November 2013
BetaCavityWeb: a webserver for molecular voids and channels Kim, Jae-Kwan; Cho, Youngsong; Lee, Mokwon Nucleic Acids Research, Vol. 43, Issue W1 https://doi.org/10.1093/nar/gkv360	journal	April 2015
Twilight zone of protein sequence alignments Rost, Burkhard Protein Engineering, Design and Selection, Vol. 12, Issue 2 https://doi.org/10.1093/protein/12.2.85	journal	February 1999
BERTology Meets Biology: Interpreting Attention in Protein Language Models Vig, Jesse; Madani, Ali; Varshney, Lav R. BioRxiv https://doi.org/10.1101/2020.06.26.174417	posted_content	July 2020
Phaser crystallographic software McCoy, Airlie J.; Grosse-Kunstleve, Ralf W.; Adams, Paul D. Journal of Applied Crystallography, Vol. 40, Issue 4 https://doi.org/10.1107/S0021889807021206	journal	July 2007
Iterative model building, structure refinement and density modification with the PHENIX AutoBuild wizard Terwilliger, Thomas C.; Grosse-Kunstleve, Ralf W.; Afonine, Pavel V. Acta Crystallographica Section D Biological Crystallography, Vol. 64, Issue 1 https://doi.org/10.1107/S090744490705024X	journal	December 2007
XDS Kabsch, Wolfgang Acta Crystallographica Section D Biological Crystallography, Vol. 66, Issue 2 https://doi.org/10.1107/S0907444909047337	journal	January 2010
Features and development of Coot Emsley, P.; Lohkamp, B.; Scott, W. G. Acta Crystallographica Section D Biological Crystallography, Vol. 66, Issue 4 https://doi.org/10.1107/S0907444910007493	journal	March 2010
Towards automated crystallographic structure refinement with phenix.refine Afonine, Pavel V.; Grosse-Kunstleve, Ralf W.; Echols, Nathaniel Acta Crystallographica Section D Biological Crystallography, Vol. 68, Issue 4 https://doi.org/10.1107/S0907444912001308	journal	March 2012
Overview of refinement procedures within REFMAC 5: utilizing data from different sources Kovalevskiy, Oleg; Nicholls, Robert A.; Long, Fei Acta Crystallographica Section D Structural Biology, Vol. 74, Issue 3 https://doi.org/10.1107/S2059798318000979	journal	March 2018
Current approaches for automated model building into cryo-EM maps using Buccaneer with CCP-EM Hoh, Soon Wen; Burnley, Tom; Cowtan, Kevin Acta Crystallographica Section D Structural Biology, Vol. 76, Issue 6 https://doi.org/10.1107/S2059798320005513	journal	May 2020
Graphical Models of Residue Coupling in Protein Families Thomas, J.; Ramakrishnan, N.; Bailey-Kellogg, C. IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 5, Issue 2 https://doi.org/10.1109/TCBB.2007.70225	journal	April 2008
ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing Elnaggar, Ahmed; Heinzinger, Michael; Dallago, Christian IEEE Transactions on Pattern Analysis and Machine Intelligence https://doi.org/10.1109/TPAMI.2021.3095381	journal	January 2021
Catalytic diversity and cell wall binding repeats in the phage‐encoded endolysins Broendum, Sebastian S.; Buckle, Ashley M.; McGowan, Sheena Molecular Microbiology, Vol. 110, Issue 6 https://doi.org/10.1111/mmi.14134	journal	November 2018
A covalent enzyme-substrate intermediate with saccharide distortion in a mutant T4 lysozyme Kuroki, R.; Weaver, L.; Matthews, B. Science, Vol. 262, Issue 5142 https://doi.org/10.1126/science.8266098	journal	December 1993
De novo design of protein homo-oligomers with modular hydrogen-bond network-mediated specificity Boyken, S. E.; Chen, Z.; Groves, B. Science, Vol. 352, Issue 6286 https://doi.org/10.1126/science.aad8865	journal	May 2016
An evolution-based model for designing chorismate mutase enzymes Russ, William P.; Figliuzzi, Matteo; Stocker, Christian Science, Vol. 369, Issue 6502 https://doi.org/10.1126/science.aba3304	journal	July 2020
Potts Models and Related Problems in Statistical Mechanics Martin, Paul Series on Advances in Statistical Mechanics https://doi.org/10.1142/0983	book	February 1991
Inferring Pairwise Interactions from Biological Data Using Maximum-Entropy Probability Models Stein, Richard R.; Marks, Debora S.; Sander, Chris PLOS Computational Biology, Vol. 11, Issue 7 https://doi.org/10.1371/journal.pcbi.1004182	journal	July 2015
Sometimes I’ll start a sentence in Spanish Y TERMINO EN ESPAÑOL: toward a typology of code-switching1 Poplack, Shana Linguistics, Vol. 18, Issue 7-8 https://doi.org/10.1515/ling.1980.18.7-8.581	journal	January 1980
Deep Contextualized Word Representations Peters, Matthew; Neumann, Mark; Iyyer, Mohit Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) https://doi.org/10.18653/v1/N18-1202	conference	January 2018
Transfer Learning in Natural Language Processing Ruder, Sebastian; Peters, Matthew E.; Swayamdipta, Swabha Proceedings of the 2019 Conference of the North https://doi.org/10.18653/v1/N19-5004	conference	January 2019
Universal Language Model Fine-tuning for Text Classification Howard, Jeremy; Ruder, Sebastian Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) https://doi.org/10.18653/v1/P18-1031	conference	January 2018
Constraints on Language Mixing: Intrasentential Code-Switching and Borrowing in Spanish/English Pfaff, Carol W. Language, Vol. 55, Issue 2 https://doi.org/10.2307/412586	journal	June 1979
Adam: A Method for Stochastic Optimization Kingma, Diederik P.; Ba, Jimmy arXiv https://doi.org/10.48550/arXiv.1412.6980	preprint	January 2014
What makes ImageNet good for transfer learning? Huh, Minyoung; Agrawal, Pulkit; Efros, Alexei A. arXiv https://doi.org/10.48550/arXiv.1608.08614	preprint	January 2016
CTRL: A Conditional Transformer Language Model for Controllable Generation Keskar, Nitish Shirish; McCann, Bryan; Varshney, Lav R. arXiv https://doi.org/10.48550/arXiv.1909.05858	preprint	January 2019
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer Raffel, Colin; Shazeer, Noam; Roberts, Adam arXiv https://doi.org/10.48550/arXiv.1910.10683	preprint	January 2019
Hopfield Networks is All You Need Ramsauer, Hubert; Schäfl, Bernhard; Lehner, Johannes arXiv https://doi.org/10.48550/arXiv.2008.02217	preprint	January 2020
Robust and accurate prediction of residue–residue interactions across protein interfaces using evolutionary information Ovchinnikov, Sergey; Kamisetty, Hetunandan; Baker, David eLife, Vol. 3 https://doi.org/10.7554/eLife.02030	journal	May 2014

Similar Records

An Introduction to Word Embeddings and Language Models

Technical Report · Wed Mar 31 20:00:00 EDT 2021 · OSTI ID:1773690

Computational models of natural language processing

Book · Sat Dec 31 23:00:00 EST 1983 · OSTI ID:6679331

Understanding digital-system specifications written in natural language

Book · Tue Dec 31 23:00:00 EST 1985 · OSTI ID:7043171

Related Subjects

59 BASIC BIOLOGICAL SCIENCES
enzymes
machine learning
proteomics

Large language models generate functional protein sequences across diverse families

Citation Formats

References (66)

Similar Records

Related Subjects