DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Plug & play directed evolution of proteins with gradient-based discrete MCMC

Journal Article · · Machine Learning: Science and Technology

Abstract A long-standing goal of machine-learning-based protein engineering is to accelerate the discovery of novel mutations that improve the function of a known protein. We introduce a sampling framework for evolving proteins in silico that supports mixing and matching a variety of unsupervised models, such as protein language models, and supervised models that predict protein function from sequence. By composing these models, we aim to improve our ability to evaluate unseen mutations and constrain search to regions of sequence space likely to contain functional proteins. Our framework achieves this without any model fine-tuning or re-training by constructing a product of experts distribution directly in discrete protein space. Instead of resorting to brute force search or random sampling, which is typical of classic directed evolution, we introduce a fast Markov chain Monte Carlo sampler that uses gradients to propose promising mutations. We conduct in silico directed evolution experiments on wide fitness landscapes and across a range of different pre-trained unsupervised models, including a 650 M parameter protein language model. Our results demonstrate an ability to efficiently discover variants with high evolutionary likelihood as well as estimated activity multiple mutations away from a wild type protein, suggesting our sampler provides a practical and effective new paradigm for machine-learning-based protein engineering.

Research Organization:
National Renewable Energy Laboratory (NREL), Golden, CO (United States)
Sponsoring Organization:
USDOE; USDOE Laboratory Directed Research and Development (LDRD) Program; USDOE Office of Energy Efficiency and Renewable Energy (EERE); USDOE Office of Energy Efficiency and Renewable Energy (EERE), Office of Sustainable Transportation. Bioenergy Technologies Office (BETO); USDOE Office of Science (SC)
Grant/Contract Number:
AC05-00OR22725; AC36-08GO28308
OSTI ID:
1971401
Report Number(s):
NREL/JA-2C00-84201
Journal Information:
Machine Learning: Science and Technology, Journal Name: Machine Learning: Science and Technology Journal Issue: 2 Vol. 4; ISSN 2632-2153
Publisher:
IOP PublishingCopyright Statement
Country of Publication:
United Kingdom
Language:
English

References (24)

Directed evolution of enzyme catalysts journal December 1997
Protein sequence design with deep generative models journal December 2021
A Generative Neural Network for Maximizing Fitness and Diversity of Synthetic DNA and Protein Sequences journal July 2020
Protein Engineering for Improving and Diversifying Natural Product Biosynthesis journal January 2020
Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules journal January 2018
Design by Directed Evolution journal March 1998
Mutation effects predicted from sequence co-variation journal January 2017
Protein design and variant prediction using autoregressive generative models journal April 2021
ProtGPT2 is a deep unsupervised language model for protein design journal July 2022
Learning protein fitness models from evolutionary and assay-labeled data journal January 2022
Deep generative models of genetic variation capture the effects of mutations journal September 2018
Machine-learning-guided directed evolution for protein engineering journal July 2019
Low-N protein engineering with data-efficient deep learning journal April 2021
Equation of State Calculations by Fast Computing Machines journal June 1953
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences journal April 2021
Informed Proposals for Local MCMC in Discrete Spaces journal April 2019
Monte Carlo sampling methods using Markov chains and their applications journal April 1970
Language models enable zero-shot prediction of the effects of mutations on protein function preprint November 2021
FLIP: Benchmark tasks in fitness landscape inference for proteins journal January 2022
Function-guided protein design by deep manifold sampling preprint December 2021
Convolutions are competitive with transformers for protein sequence pretraining preprint February 2024
Training Products of Experts by Minimizing Contrastive Divergence journal August 2002
Protease Inhibitors from Marine Actinobacteria as a Potential Source for Antimalarial Compound journal March 2014
Learning to Write with Cooperative Discriminators
  • Holtzman, Ari; Buys, Jan; Forbes, Maxwell
  • Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) https://doi.org/10.18653/v1/P18-1152
conference January 2018