Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

DeCOIL: Optimization of Degenerate Codon Libraries for Machine Learning-Assisted Protein Engineering

Journal Article · · ACS Synthetic Biology

With advances in machine learning (ML)-assisted protein engineering, models based on data, biophysics, and natural evolution are being used to propose informed libraries of protein variants to explore. Synthesizing these libraries for experimental screens is a major bottleneck, as the cost of obtaining large numbers of exact gene sequences is often prohibitive. Degenerate codon (DC) libraries are a cost-effective alternative for generating combinatorial mutagenesis libraries where mutations are targeted to a handful of amino acid sites. However, existing computational methods to optimize DC libraries to include desired protein variants are not well suited to design libraries for ML-assisted protein engineering. To address these drawbacks, we present DEgenerate Codon Optimization for Informed Libraries (DeCOIL), a generalized method which directly optimizes DC libraries to be useful for protein engineering: to sample protein variants that are likely to have both high fitness and high diversity in the sequence search space. Using computational simulations and wet-lab experiments, we demonstrate that DeCOIL is effective across two specific case studies, with potential to be applied to many other use cases. DeCOIL offers several advantages over existing methods, as it is direct, easy-to-use, generalizable, and scalable. With accompanying software, DeCOIL can be readily implemented to generate desired informed libraries.

Research Organization:
California Institute of Technology (CalTech), Pasadena, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Basic Energy Sciences (BES)
Grant/Contract Number:
SC0022218
OSTI ID:
1992599
Journal Information:
ACS Synthetic Biology, Journal Name: ACS Synthetic Biology Journal Issue: 8 Vol. 12; ISSN 2161-5063
Publisher:
American Chemical Society (ACS)Copyright Statement
Country of Publication:
United States
Language:
English

References (63)

Directed Evolution: Bringing New Chemistry to Life journal November 2017
Tryptophan Synthase: Biocatalyst Extraordinaire journal September 2020
Machine Learning-driven Protein Library Design: A Path Toward Smarter Libraries book April 2022
An analysis of approximations for maximizing submodular set functions—I journal December 1978
The budgeted maximum coverage problem journal April 1999
Informed training set design enables efficient machine learning-assisted directed protein evolution journal August 2021
Evolutionary velocity with protein language models predicts evolutionary dynamics of diverse proteins journal April 2022
Machine learning to navigate fitness landscapes for protein engineering journal June 2022
Advances in machine learning for directed evolution journal August 2021
Adaptive machine learning for protein engineering journal February 2022
Deep Dive into Machine Learning Models for Protein Engineering journal April 2020
The Rosetta All-Atom Energy Function for Macromolecular Modeling and Design journal May 2017
Machine Learning in Enzyme Engineering journal December 2019
100th Anniversary of Macromolecular Science Viewpoint: Data-Driven Protein Design journal February 2021
evSeq: Cost-Effective Amplicon Sequencing of Every Variant in a Protein Library journal February 2022
A Web Interface for Codon Compression journal May 2016
Reducing Codon Redundancy and Screening Effort of Combinatorial Protein Libraries Created by Saturation Mutagenesis journal June 2012
Natural Selection and the Concept of a Protein Space journal February 1970
Mutation effects predicted from sequence co-variation journal January 2017
Molecular evolution by staggered extension process (StEP) in vitro recombination journal March 1998
Enzymatic assembly of DNA molecules up to several hundred kilobases journal April 2009
Methods for the directed evolution of proteins journal June 2015
Exploring protein fitness landscapes by directed evolution journal December 2009
Protein building blocks preserved by recombination journal June 2002
Epistatic Net allows the sparse spectral regularization of deep neural networks for inferring fitness functions journal September 2021
Machine learning-guided acyl-ACP reductase engineering for improved in vivo fatty alcohol production journal October 2021
Deep diversification of an AAV capsid protein by machine learning journal February 2021
Learning protein fitness models from evolutionary and assay-labeled data journal January 2022
Deep generative models of genetic variation capture the effects of mutations journal September 2018
Machine-learning-guided directed evolution for protein engineering journal July 2019
Machine learning-guided channelrhodopsin engineering enables minimally invasive optogenetics journal October 2019
Unified rational protein engineering with sequence-based deep representation learning journal October 2019
LM-GVP: an extensible sequence and structure informed deep learning framework for protein property prediction journal April 2022
Machine learning-enabled retrobiosynthesis of molecules journal February 2023
Cluster learning-assisted directed evolution journal December 2021
Combining chemistry and protein engineering for new-to-nature biocatalysis journal January 2022
Protein stability promotes evolvability journal March 2006
Computational method to reduce the search space for directed protein evolution journal March 2001
Navigating the protein fitness landscape with Gaussian processes journal December 2012
Directed evolution of the tryptophan synthase β-subunit for stand-alone function recapitulates allosteric activation journal November 2015
Machine learning-assisted directed protein evolution with combinatorial libraries journal April 2019
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences journal April 2021
Optimization of Combinatorial Mutagenesis journal November 2011
DeCoDe: degenerate codon design for complete protein-coding DNA libraries journal March 2020
CoLiDe: Combinatorial Library Design tool for probing protein sequence space journal September 2020
SwiftLib: rapid degenerate-codon-library optimization through dynamic programming journal December 2014
Automated design of degenerate codon libraries journal October 2005
MSA Transformer preprint August 2021
Language models enable zero-shot prediction of the effects of mutations on protein function preprint November 2021
Optimal Design of Stochastic DNA Synthesis Protocols based on Generative Sequence Models posted_content October 2021
Optimal trade-off control in machine learning-based library design, with application to adeno-associated virus (AAV) for gene therapy posted_content September 2022
Convolutions are competitive with transformers for protein sequence pretraining preprint February 2024
Masked inverse folding with sequence transfer for protein representation learning preprint March 2023
Tuned Fitness Landscapes for Benchmarking Model-Guided Protein Design journal October 2022
ProT-VAE: Protein Transformer Variational AutoEncoder for Functional Protein Design preprint January 2023
Protein Fitness Prediction is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods posted_content February 2023
ProtWave-VAE: Integrating autoregressive sampling with latent-based inference for data-driven protein design posted_content April 2023
Evaluating Protein Transfer Learning with TAPE preprint June 2019
Pareto Optimization of Combinatorial Mutagenesis Libraries journal July 2019
Enzyme function prediction using contrastive learning journal March 2023
XGBoost: A Scalable Tree Boosting System conference January 2016
Machine learning to design integral membrane channelrhodopsins for efficient eukaryotic expression and plasma membrane localization journal October 2017
Adaptation in protein fitness landscapes is facilitated by indirect paths journal July 2016