skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules

Abstract

We report a method to convert discrete representations of molecules to and from a multidimensional continuous representation. This model allows us to generate new molecules for efficient exploration and optimization through open-ended spaces of chemical compounds. A deep neural network was trained on hundreds of thousands of existing chemical structures to construct three coupled functions: an encoder, a decoder, and a predictor. The encoder converts the discrete representation of a molecule into a real-valued continuous vector, and the decoder converts these continuous vectors back to discrete molecular representations. The predictor estimates chemical properties from the latent continuous vector representation of the molecule. Continuous representations of molecules allow us to automatically generate novel chemical structures by performing simple operations in the latent space, such as decoding random vectors, perturbing known chemical structures, or interpolating between molecules. Continuous representations also allow the use of powerful gradient-based optimization to efficiently guide the search for optimized functional compounds. We demonstrate our method in the domain of drug-like molecules and also in a set of molecules with fewer that nine heavy atoms.

Authors:
ORCiD logo [1]; ORCiD logo [2];  [3];  [4];  [2]; ORCiD logo [2];  [1];  [1];  [5]; ORCiD logo [6]
  1. Kyulux North America Inc., 10 Post Office Square, Suite 800, Boston, Massachusetts 02109, United States
  2. Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts 02138, United States
  3. Department of Computer Science, University of Toronto, 6 King’s College Road, Toronto, Ontario M5S 3H5, Canada
  4. Department of Engineering, University of Cambridge, Trumpington Street, Cambridge CB2 1PZ, U.K.
  5. Google Brain, Mountain View, California, United States, Princeton University, Princeton, New Jersey, United States
  6. Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts 02138, United States, Biologically-Inspired Solar Energy Program, Canadian Institute for Advanced Research (CIFAR), Toronto, Ontario M5S 1M1, Canada
Publication Date:
Research Org.:
Harvard Univ., Cambridge, MA (United States); Univ. of Toronto, ON (Canada)
Sponsoring Org.:
USDOE Office of Science (SC), Basic Energy Sciences (BES) (SC-22)
OSTI Identifier:
1416858
Alternate Identifier(s):
OSTI ID: 1498675
Grant/Contract Number:  
SC0015959
Resource Type:
Published Article
Journal Name:
ACS Central Science
Additional Journal Information:
Journal Name: ACS Central Science Journal Volume: 4 Journal Issue: 2; Journal ID: ISSN 2374-7943
Publisher:
American Chemical Society
Country of Publication:
United States
Language:
English
Subject:
37 INORGANIC, ORGANIC, PHYSICAL, AND ANALYTICAL CHEMISTRY

Citation Formats

Gómez-Bombarelli, Rafael, Wei, Jennifer N., Duvenaud, David, Hernández-Lobato, José Miguel, Sánchez-Lengeling, Benjamín, Sheberla, Dennis, Aguilera-Iparraguirre, Jorge, Hirzel, Timothy D., Adams, Ryan P., and Aspuru-Guzik, Alán. Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. United States: N. p., 2018. Web. doi:10.1021/acscentsci.7b00572.
Gómez-Bombarelli, Rafael, Wei, Jennifer N., Duvenaud, David, Hernández-Lobato, José Miguel, Sánchez-Lengeling, Benjamín, Sheberla, Dennis, Aguilera-Iparraguirre, Jorge, Hirzel, Timothy D., Adams, Ryan P., & Aspuru-Guzik, Alán. Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. United States. doi:10.1021/acscentsci.7b00572.
Gómez-Bombarelli, Rafael, Wei, Jennifer N., Duvenaud, David, Hernández-Lobato, José Miguel, Sánchez-Lengeling, Benjamín, Sheberla, Dennis, Aguilera-Iparraguirre, Jorge, Hirzel, Timothy D., Adams, Ryan P., and Aspuru-Guzik, Alán. Fri . "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules". United States. doi:10.1021/acscentsci.7b00572.
@article{osti_1416858,
title = {Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules},
author = {Gómez-Bombarelli, Rafael and Wei, Jennifer N. and Duvenaud, David and Hernández-Lobato, José Miguel and Sánchez-Lengeling, Benjamín and Sheberla, Dennis and Aguilera-Iparraguirre, Jorge and Hirzel, Timothy D. and Adams, Ryan P. and Aspuru-Guzik, Alán},
abstractNote = {We report a method to convert discrete representations of molecules to and from a multidimensional continuous representation. This model allows us to generate new molecules for efficient exploration and optimization through open-ended spaces of chemical compounds. A deep neural network was trained on hundreds of thousands of existing chemical structures to construct three coupled functions: an encoder, a decoder, and a predictor. The encoder converts the discrete representation of a molecule into a real-valued continuous vector, and the decoder converts these continuous vectors back to discrete molecular representations. The predictor estimates chemical properties from the latent continuous vector representation of the molecule. Continuous representations of molecules allow us to automatically generate novel chemical structures by performing simple operations in the latent space, such as decoding random vectors, perturbing known chemical structures, or interpolating between molecules. Continuous representations also allow the use of powerful gradient-based optimization to efficiently guide the search for optimized functional compounds. We demonstrate our method in the domain of drug-like molecules and also in a set of molecules with fewer that nine heavy atoms.},
doi = {10.1021/acscentsci.7b00572},
journal = {ACS Central Science},
number = 2,
volume = 4,
place = {United States},
year = {2018},
month = {1}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record
DOI: 10.1021/acscentsci.7b00572

Citation Metrics:
Cited by: 85 works
Citation information provided by
Web of Science

Figures / Tables:

Figure 1 Figure 1: (a) A diagram of the autoencoder used for molecular design, including the joint property prediction model. Starting from a discrete molecular representation, such as a SMILES string, the encoder network converts each molecule into a vector in the latent space, which is effectively a continuous molecular representation. Givenmore » a point in the latent space, the decoder network produces a corresponding SMILES string. A mutilayer perceptron network estimates the value of target properties associated with each molecule. (b) Gradient-based optimization in continuous latent space. After training a surrogate model f(z) to predict the properties of molecules based on their latent representation z, we can optimize f(z) with respect to z to find new latent representations expected to have high values of desired properties. These new latent representations can then be decoded into SMILES strings, at which point their properties can be tested empirically.« less

Save / Share:

Works referencing / citing this record:

Conditional deep surrogate models for stochastic, high-dimensional, and multi-fidelity systems
journal, May 2019


Machine-learning-assisted discovery of polymers with high thermal conductivity using a molecular design algorithm
journal, June 2019


Accelerating the discovery of materials for clean energy in the era of smart automation
journal, April 2018

  • Tabor, Daniel P.; Roch, Loïc M.; Saikin, Semion K.
  • Nature Reviews Materials, Vol. 3, Issue 5
  • DOI: 10.1038/s41578-018-0005-z

Controlling an organic synthesis robot with machine learning to search for new reactivity
journal, July 2018


Extensive deep neural networks for transferring small scale learning to large scale systems
journal, January 2019

  • Mills, Kyle; Ryczko, Kevin; Luchak, Iryna
  • Chemical Science, Vol. 10, Issue 15
  • DOI: 10.1039/c8sc04578j

Multi-channel PINN: investigating scalable and transferable neural networks for drug discovery
journal, July 2019


Exploring differential evolution for inverse QSAR analysis
journal, January 2017


Challenges and opportunities of polymer design with machine learning and high throughput experimentation
journal, May 2019

  • Kumar, Jatin N.; Li, Qianxiao; Jun, Ye
  • MRS Communications, Vol. 9, Issue 02
  • DOI: 10.1557/mrc.2019.54

Efficient multi-objective molecular optimization in a continuous latent space
journal, January 2019

  • Winter, Robin; Montanari, Floriane; Steffen, Andreas
  • Chemical Science, Vol. 10, Issue 34
  • DOI: 10.1039/c9sc01928f

KekuleScope: prediction of cancer cell line sensitivity and compound potency using convolutional neural networks trained on compound images
journal, June 2019


A de novo molecular generation method using latent vector based generative adversarial network
journal, December 2019

  • Prykhodko, Oleksii; Johansson, Simon Viet; Kotsias, Panagiotis-Christos
  • Journal of Cheminformatics, Vol. 11, Issue 1
  • DOI: 10.1186/s13321-019-0397-9

Conditional deep surrogate models for stochastic, high-dimensional, and multi-fidelity systems
journal, May 2019


Machine-learning-assisted discovery of polymers with high thermal conductivity using a molecular design algorithm
journal, June 2019


Accelerating the discovery of materials for clean energy in the era of smart automation
journal, April 2018

  • Tabor, Daniel P.; Roch, Loïc M.; Saikin, Semion K.
  • Nature Reviews Materials, Vol. 3, Issue 5
  • DOI: 10.1038/s41578-018-0005-z

Controlling an organic synthesis robot with machine learning to search for new reactivity
journal, July 2018


Extensive deep neural networks for transferring small scale learning to large scale systems
journal, January 2019

  • Mills, Kyle; Ryczko, Kevin; Luchak, Iryna
  • Chemical Science, Vol. 10, Issue 15
  • DOI: 10.1039/c8sc04578j

Efficient multi-objective molecular optimization in a continuous latent space
journal, January 2019

  • Winter, Robin; Montanari, Floriane; Steffen, Andreas
  • Chemical Science, Vol. 10, Issue 34
  • DOI: 10.1039/c9sc01928f

KekuleScope: prediction of cancer cell line sensitivity and compound potency using convolutional neural networks trained on compound images
journal, June 2019


Multi-channel PINN: investigating scalable and transferable neural networks for drug discovery
journal, July 2019


Transformer-CNN: Swiss knife for QSAR modeling and interpretation
journal, March 2020


Exploring differential evolution for inverse QSAR analysis
journal, January 2017


Challenges and opportunities of polymer design with machine learning and high throughput experimentation
journal, May 2019

  • Kumar, Jatin N.; Li, Qianxiao; Jun, Ye
  • MRS Communications, Vol. 9, Issue 02
  • DOI: 10.1557/mrc.2019.54

In silico Strategies to Support Fragment-to-Lead Optimization in Drug Discovery
journal, February 2020

  • de Souza Neto, Lauro Ribeiro; Moreira-Filho, José Teófilo; Neves, Bruno Junior
  • Frontiers in Chemistry, Vol. 8
  • DOI: 10.3389/fchem.2020.00093

    Figures/Tables have been extracted from DOE-funded journal article accepted manuscripts.