skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules

Abstract

We report a method to convert discrete representations of molecules to and from a multidimensional continuous representation. This model allows us to generate new molecules for efficient exploration and optimization through open-ended spaces of chemical compounds. A deep neural network was trained on hundreds of thousands of existing chemical structures to construct three coupled functions: an encoder, a decoder, and a predictor. The encoder converts the discrete representation of a molecule into a real-valued continuous vector, and the decoder converts these continuous vectors back to discrete molecular representations. The predictor estimates chemical properties from the latent continuous vector representation of the molecule. Continuous representations of molecules allow us to automatically generate novel chemical structures by performing simple operations in the latent space, such as decoding random vectors, perturbing known chemical structures, or interpolating between molecules. Continuous representations also allow the use of powerful gradient-based optimization to efficiently guide the search for optimized functional compounds. We demonstrate our method in the domain of drug-like molecules and also in a set of molecules with fewer that nine heavy atoms.

Authors:
ORCiD logo [1]; ORCiD logo [2];  [3];  [4];  [2]; ORCiD logo [2];  [1];  [1];  [5]; ORCiD logo [6]
  1. Kyulux North America Inc., Boston, MA (United States)
  2. Harvard Univ., Cambridge, MA (United States)
  3. Univ. of Toronto, ON (Canada)
  4. Univ. of Cambridge (United Kingdom)
  5. Google Brain, Mountain View, CA (United States); Princeton Univ., NJ (United States)
  6. Harvard Univ., Cambridge, MA (United States); Canadian Institute for Advanced Research (CIFAR), Toronto, ON (United States)
Publication Date:
Research Org.:
Harvard Univ., Cambridge, MA (United States); Univ. of Toronto, ON (Canada)
Sponsoring Org.:
USDOE Office of Science (SC), Basic Energy Sciences (BES) (SC-22)
OSTI Identifier:
1416858
Alternate Identifier(s):
OSTI ID: 1498675
Grant/Contract Number:  
SC0015959
Resource Type:
Journal Article: Published Article
Journal Name:
ACS Central Science
Additional Journal Information:
Journal Volume: 4; Journal Issue: 2; Journal ID: ISSN 2374-7943
Publisher:
American Chemical Society (ACS)
Country of Publication:
United States
Language:
English
Subject:
37 INORGANIC, ORGANIC, PHYSICAL, AND ANALYTICAL CHEMISTRY

Citation Formats

Gómez-Bombarelli, Rafael, Wei, Jennifer N., Duvenaud, David, Hernández-Lobato, José Miguel, Sánchez-Lengeling, Benjamín, Sheberla, Dennis, Aguilera-Iparraguirre, Jorge, Hirzel, Timothy D., Adams, Ryan P., and Aspuru-Guzik, Alán. Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. United States: N. p., 2018. Web. doi:10.1021/acscentsci.7b00572.
Gómez-Bombarelli, Rafael, Wei, Jennifer N., Duvenaud, David, Hernández-Lobato, José Miguel, Sánchez-Lengeling, Benjamín, Sheberla, Dennis, Aguilera-Iparraguirre, Jorge, Hirzel, Timothy D., Adams, Ryan P., & Aspuru-Guzik, Alán. Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. United States. doi:10.1021/acscentsci.7b00572.
Gómez-Bombarelli, Rafael, Wei, Jennifer N., Duvenaud, David, Hernández-Lobato, José Miguel, Sánchez-Lengeling, Benjamín, Sheberla, Dennis, Aguilera-Iparraguirre, Jorge, Hirzel, Timothy D., Adams, Ryan P., and Aspuru-Guzik, Alán. Fri . "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules". United States. doi:10.1021/acscentsci.7b00572.
@article{osti_1416858,
title = {Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules},
author = {Gómez-Bombarelli, Rafael and Wei, Jennifer N. and Duvenaud, David and Hernández-Lobato, José Miguel and Sánchez-Lengeling, Benjamín and Sheberla, Dennis and Aguilera-Iparraguirre, Jorge and Hirzel, Timothy D. and Adams, Ryan P. and Aspuru-Guzik, Alán},
abstractNote = {We report a method to convert discrete representations of molecules to and from a multidimensional continuous representation. This model allows us to generate new molecules for efficient exploration and optimization through open-ended spaces of chemical compounds. A deep neural network was trained on hundreds of thousands of existing chemical structures to construct three coupled functions: an encoder, a decoder, and a predictor. The encoder converts the discrete representation of a molecule into a real-valued continuous vector, and the decoder converts these continuous vectors back to discrete molecular representations. The predictor estimates chemical properties from the latent continuous vector representation of the molecule. Continuous representations of molecules allow us to automatically generate novel chemical structures by performing simple operations in the latent space, such as decoding random vectors, perturbing known chemical structures, or interpolating between molecules. Continuous representations also allow the use of powerful gradient-based optimization to efficiently guide the search for optimized functional compounds. We demonstrate our method in the domain of drug-like molecules and also in a set of molecules with fewer that nine heavy atoms.},
doi = {10.1021/acscentsci.7b00572},
journal = {ACS Central Science},
issn = {2374-7943},
number = 2,
volume = 4,
place = {United States},
year = {2018},
month = {1}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record at 10.1021/acscentsci.7b00572

Citation Metrics:
Cited by: 85 works
Citation information provided by
Web of Science

Figures / Tables:

Figure 1 Figure 1: (a) A diagram of the autoencoder used for molecular design, including the joint property prediction model. Starting from a discrete molecular representation, such as a SMILES string, the encoder network converts each molecule into a vector in the latent space, which is effectively a continuous molecular representation. Givenmore » a point in the latent space, the decoder network produces a corresponding SMILES string. A mutilayer perceptron network estimates the value of target properties associated with each molecule. (b) Gradient-based optimization in continuous latent space. After training a surrogate model f(z) to predict the properties of molecules based on their latent representation z, we can optimize f(z) with respect to z to find new latent representations expected to have high values of desired properties. These new latent representations can then be decoded into SMILES strings, at which point their properties can be tested empirically.« less

Save / Share:

Works referencing / citing this record:

Conditional deep surrogate models for stochastic, high-dimensional, and multi-fidelity systems
journal, May 2019


Machine-learning-assisted discovery of polymers with high thermal conductivity using a molecular design algorithm
journal, June 2019


Accelerating the discovery of materials for clean energy in the era of smart automation
journal, April 2018

  • Tabor, Daniel P.; Roch, Loïc M.; Saikin, Semion K.
  • Nature Reviews Materials, Vol. 3, Issue 5
  • DOI: 10.1038/s41578-018-0005-z

Controlling an organic synthesis robot with machine learning to search for new reactivity
journal, July 2018


Extensive deep neural networks for transferring small scale learning to large scale systems
journal, January 2019

  • Mills, Kyle; Ryczko, Kevin; Luchak, Iryna
  • Chemical Science, Vol. 10, Issue 15
  • DOI: 10.1039/c8sc04578j

Multi-channel PINN: investigating scalable and transferable neural networks for drug discovery
journal, July 2019


Exploring differential evolution for inverse QSAR analysis
journal, January 2017


Challenges and opportunities of polymer design with machine learning and high throughput experimentation
journal, May 2019

  • Kumar, Jatin N.; Li, Qianxiao; Jun, Ye
  • MRS Communications, Vol. 9, Issue 02
  • DOI: 10.1557/mrc.2019.54

    Figures/Tables have been extracted from DOE-funded journal article accepted manuscripts.