Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Adaptive language model training for molecular design

Journal Article · · Journal of Cheminformatics
Abstract

The vast size of chemical space necessitates computational approaches to automate and accelerate the design of molecular sequences to guide experimental efforts for drug discovery. Genetic algorithms provide a useful framework to incrementally generate molecules by applying mutations to known chemical structures. Recently, masked language models have been applied to automate the mutation process by leveraging large compound libraries to learn commonly occurring chemical sequences (i.e., using tokenization) and predict rearrangements (i.e., using mask prediction). Here, we consider how language models can be adapted to improve molecule generation for different optimization tasks. We use two different generation strategies for comparison, fixed and adaptive. The fixed strategy uses a pre-trained model to generate mutations; the adaptive strategy trains the language model on each new generation of molecules selected for target properties during optimization. Our results show that the adaptive strategy allows the language model to more closely fit the distribution of molecules in the population. Therefore, for enhanced fitness optimization, we suggest the use of the fixed strategy during an initial phase followed by the use of the adaptive strategy. We demonstrate the impact of adaptive training by searching for molecules that optimize both heuristic metrics, drug-likeness and synthesizability, as well as predicted protein binding affinity from a surrogate model. Our results show that the adaptive strategy provides a significant improvement in fitness optimization compared to the fixed pre-trained model, empowering the application of language models to molecular design tasks.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE; USDOE Laboratory Directed Research and Development (LDRD) Program; USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC)
Grant/Contract Number:
AC05-00OR22725
OSTI ID:
1984313
Alternate ID(s):
OSTI ID: 1987804
Journal Information:
Journal of Cheminformatics, Journal Name: Journal of Cheminformatics Journal Issue: 1 Vol. 15; ISSN 1758-2946
Publisher:
Springer Science + Business MediaCopyright Statement
Country of Publication:
United Kingdom
Language:
English

References (40)

Introduction to Evolutionary Computing book January 2015
An interactive web-based dashboard to track COVID-19 in real time journal May 2020
The rise of deep learning in drug discovery journal June 2018
Concepts of Artificial Intelligence for Computer-Assisted Drug Discovery journal July 2019
Supercomputer-Based Ensemble Docking Drug Discovery Pipeline with Application to Covid-19 journal December 2020
Dynamic Profiling of β-Coronavirus 3CL Mpro Protease Ligand-Binding Sites journal June 2021
MolGPT: Molecular Generation Using a Transformer-Decoder Model journal October 2021
Generative Chemical Transformer: Neural Machine Learning of Molecular Geometric Structures from Chemical Language via Attention journal December 2021
Computational Modeling of β-Secretase 1 (BACE-1) Inhibitors Using Ligand Based Approaches journal October 2016
GuacaMol: Benchmarking Models for de Novo Molecular Design journal October 2018
Drug Analogs from Fragment-Based Long Short-Term Memory Generative Neural Networks journal January 2019
AMPL: A Data-Driven Modeling Pipeline for Drug Discovery journal April 2020
Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks journal December 2017
Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules journal January 2018
SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules journal February 1988
A Graph-Based Genetic Algorithm and Its Application to the Multiobjective Evolution of Median Molecules journal May 2004
The Molecule Evoluator. An Interactive Evolutionary Algorithm for the Design of Drug-Like Molecules journal January 2006
Mining a Chemical Database for Fragment Co-occurrence:  Discovery of “Chemical Clichés” journal January 2006
A Bayesian Approach to in Silico Blood-Brain Barrier Penetration Modeling journal June 2012
De Novo Drug Design Using Multiobjective Evolutionary Graphs journal January 2009
Stochastic Voyages into Uncharted Chemical Space Produce a Representative Library of All Possible Drug-Like Compounds journal May 2013
Quantifying the chemical beauty of drugs journal January 2012
De novo generation of hit-like molecules from gene expression signatures using artificial intelligence journal January 2020
A merged molecular representation learning for molecular properties prediction with a web-based service journal May 2021
Quantum chemistry structures and properties of 134 kilo molecules journal August 2014
A graph-based genetic algorithm and generative model/Monte Carlo tree search for the exploration of chemical space journal January 2019
High Performance I/O For Large Scale Deep Learning conference December 2019
How Distinct Structural Flexibility within SARS-CoV-2 Spike Protein Reveals Potential Therapeutic Targets conference December 2021
Japanese and Korean voice search conference March 2012
Automating Genetic Algorithm Mutations for Molecules Using a Masked Language Model journal August 2022
Smiles-Bert
  • Wang, Sheng; Guo, Yuzhi; Wang, Yuhong
  • Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics https://doi.org/10.1145/3307339.3342186
conference September 2019
Enabling rapid COVID-19 small molecule drug design through scalable deep learning of generative models journal May 2021
Language models for the prediction of SARS-CoV-2 inhibitors journal October 2022
Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions journal June 2009
Deep clustering of protein folding simulations journal December 2018
A retrosynthetic analysis algorithm implementation journal January 2019
QBMG: quasi-biogenic molecule generator with deep recurrent neural network journal January 2019
Using GANs with adaptive training data to search for new molecules journal February 2021
Transformers: State-of-the-Art Natural Language Processing conference January 2020
Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models journal December 2020

Similar Records

Automating Genetic Algorithm Mutations for Molecules Using a Masked Language Model
Journal Article · Wed Jan 19 19:00:00 EST 2022 · IEEE Transactions on Evolutionary Computation · OSTI ID:1845799