skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Plug & play directed evolution of proteins with gradient-based discrete MCMC

Journal Article · · Machine Learning: Science and Technology

Abstract A long-standing goal of machine-learning-based protein engineering is to accelerate the discovery of novel mutations that improve the function of a known protein. We introduce a sampling framework for evolving proteins in silico that supports mixing and matching a variety of unsupervised models, such as protein language models, and supervised models that predict protein function from sequence. By composing these models, we aim to improve our ability to evaluate unseen mutations and constrain search to regions of sequence space likely to contain functional proteins. Our framework achieves this without any model fine-tuning or re-training by constructing a product of experts distribution directly in discrete protein space. Instead of resorting to brute force search or random sampling, which is typical of classic directed evolution, we introduce a fast Markov chain Monte Carlo sampler that uses gradients to propose promising mutations. We conduct in silico directed evolution experiments on wide fitness landscapes and across a range of different pre-trained unsupervised models, including a 650 M parameter protein language model. Our results demonstrate an ability to efficiently discover variants with high evolutionary likelihood as well as estimated activity multiple mutations away from a wild type protein, suggesting our sampler provides a practical and effective new paradigm for machine-learning-based protein engineering.

Research Organization:
National Renewable Energy Laboratory (NREL), Golden, CO (United States)
Sponsoring Organization:
USDOE Office of Energy Efficiency and Renewable Energy (EERE); USDOE Office of Energy Efficiency and Renewable Energy (EERE), Office of Sustainable Transportation. Bioenergy Technologies Office (BETO); USDOE Laboratory Directed Research and Development (LDRD) Program; USDOE Office of Science (SC)
Grant/Contract Number:
AC36-08GO28308; AC05-00OR22725
OSTI ID:
1971401
Alternate ID(s):
OSTI ID: 1968632; OSTI ID: 1969637
Report Number(s):
NREL/JA-2C00-84201
Journal Information:
Machine Learning: Science and Technology, Journal Name: Machine Learning: Science and Technology Vol. 4 Journal Issue: 2; ISSN 2632-2153
Publisher:
IOP PublishingCopyright Statement
Country of Publication:
United Kingdom
Language:
English

References (24)

Learning protein fitness models from evolutionary and assay-labeled data journal January 2022
Language models enable zero-shot prediction of the effects of mutations on protein function preprint November 2021
Low-N protein engineering with data-efficient deep learning journal April 2021
Machine-learning-guided directed evolution for protein engineering journal July 2019
Protein design and variant prediction using autoregressive generative models journal April 2021
Directed evolution of enzyme catalysts journal December 1997
Deep generative models of genetic variation capture the effects of mutations journal September 2018
Mutation effects predicted from sequence co-variation journal January 2017
Convolutions are competitive with transformers for protein sequence pretraining preprint February 2024
Learning to Write with Cooperative Discriminators
  • Holtzman, Ari; Buys, Jan; Forbes, Maxwell
  • Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) https://doi.org/10.18653/v1/P18-1152
conference January 2018
ProtGPT2 is a deep unsupervised language model for protein design journal July 2022
Training Products of Experts by Minimizing Contrastive Divergence journal August 2002
Informed Proposals for Local MCMC in Discrete Spaces journal April 2019
Protease Inhibitors from Marine Actinobacteria as a Potential Source for Antimalarial Compound journal March 2014
Protein Engineering for Improving and Diversifying Natural Product Biosynthesis journal January 2020
Function-guided protein design by deep manifold sampling preprint December 2021
Protein sequence design with deep generative models journal December 2021
A Generative Neural Network for Maximizing Fitness and Diversity of Synthetic DNA and Protein Sequences journal July 2020
Equation of State Calculations by Fast Computing Machines journal June 1953
Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules journal January 2018
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences journal April 2021
FLIP: Benchmark tasks in fitness landscape inference for proteins journal January 2022
Design by Directed Evolution journal March 1998
Monte Carlo sampling methods using Markov chains and their applications journal April 1970

Similar Records

Combining machine learning and high-throughput experimentation to discover photocatalytically active organic molecules
Journal Article · Wed Aug 18 00:00:00 EDT 2021 · Chemical Science · OSTI ID:1971401

Adaptive language model training for molecular design
Journal Article · Thu Jun 08 00:00:00 EDT 2023 · Journal of Cheminformatics · OSTI ID:1971401

UNSUPERVISED TRANSIENT LIGHT CURVE ANALYSIS VIA HIERARCHICAL BAYESIAN INFERENCE
Journal Article · Tue Feb 10 00:00:00 EST 2015 · Astrophysical Journal · OSTI ID:1971401