DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules

Abstract

We report a method to convert discrete representations of molecules to and from a multidimensional continuous representation. This model allows us to generate new molecules for efficient exploration and optimization through open-ended spaces of chemical compounds. A deep neural network was trained on hundreds of thousands of existing chemical structures to construct three coupled functions: an encoder, a decoder, and a predictor. The encoder converts the discrete representation of a molecule into a real-valued continuous vector, and the decoder converts these continuous vectors back to discrete molecular representations. The predictor estimates chemical properties from the latent continuous vector representation of the molecule. Continuous representations of molecules allow us to automatically generate novel chemical structures by performing simple operations in the latent space, such as decoding random vectors, perturbing known chemical structures, or interpolating between molecules. Continuous representations also allow the use of powerful gradient-based optimization to efficiently guide the search for optimized functional compounds. We demonstrate our method in the domain of drug-like molecules and also in a set of molecules with fewer that nine heavy atoms.

Authors:
ORCiD logo [1]; ORCiD logo [2];  [3];  [4];  [2]; ORCiD logo [2];  [1];  [1];  [5]; ORCiD logo [6]
  1. Kyulux North America Inc., 10 Post Office Square, Suite 800, Boston, Massachusetts 02109, United States
  2. Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts 02138, United States
  3. Department of Computer Science, University of Toronto, 6 King’s College Road, Toronto, Ontario M5S 3H5, Canada
  4. Department of Engineering, University of Cambridge, Trumpington Street, Cambridge CB2 1PZ, U.K.
  5. Google Brain, Mountain View, California, United States, Princeton University, Princeton, New Jersey, United States
  6. Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts 02138, United States, Biologically-Inspired Solar Energy Program, Canadian Institute for Advanced Research (CIFAR), Toronto, Ontario M5S 1M1, Canada
Publication Date:
Research Org.:
Harvard Univ., Cambridge, MA (United States); Univ. of Toronto, ON (Canada)
Sponsoring Org.:
USDOE Office of Science (SC), Basic Energy Sciences (BES)
OSTI Identifier:
1416858
Alternate Identifier(s):
OSTI ID: 1498675
Grant/Contract Number:  
SC0015959
Resource Type:
Published Article
Journal Name:
ACS Central Science
Additional Journal Information:
Journal Name: ACS Central Science Journal Volume: 4 Journal Issue: 2; Journal ID: ISSN 2374-7943
Publisher:
American Chemical Society
Country of Publication:
United States
Language:
English
Subject:
37 INORGANIC, ORGANIC, PHYSICAL, AND ANALYTICAL CHEMISTRY

Citation Formats

Gómez-Bombarelli, Rafael, Wei, Jennifer N., Duvenaud, David, Hernández-Lobato, José Miguel, Sánchez-Lengeling, Benjamín, Sheberla, Dennis, Aguilera-Iparraguirre, Jorge, Hirzel, Timothy D., Adams, Ryan P., and Aspuru-Guzik, Alán. Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. United States: N. p., 2018. Web. doi:10.1021/acscentsci.7b00572.
Gómez-Bombarelli, Rafael, Wei, Jennifer N., Duvenaud, David, Hernández-Lobato, José Miguel, Sánchez-Lengeling, Benjamín, Sheberla, Dennis, Aguilera-Iparraguirre, Jorge, Hirzel, Timothy D., Adams, Ryan P., & Aspuru-Guzik, Alán. Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. United States. https://doi.org/10.1021/acscentsci.7b00572
Gómez-Bombarelli, Rafael, Wei, Jennifer N., Duvenaud, David, Hernández-Lobato, José Miguel, Sánchez-Lengeling, Benjamín, Sheberla, Dennis, Aguilera-Iparraguirre, Jorge, Hirzel, Timothy D., Adams, Ryan P., and Aspuru-Guzik, Alán. Fri . "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules". United States. https://doi.org/10.1021/acscentsci.7b00572.
@article{osti_1416858,
title = {Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules},
author = {Gómez-Bombarelli, Rafael and Wei, Jennifer N. and Duvenaud, David and Hernández-Lobato, José Miguel and Sánchez-Lengeling, Benjamín and Sheberla, Dennis and Aguilera-Iparraguirre, Jorge and Hirzel, Timothy D. and Adams, Ryan P. and Aspuru-Guzik, Alán},
abstractNote = {We report a method to convert discrete representations of molecules to and from a multidimensional continuous representation. This model allows us to generate new molecules for efficient exploration and optimization through open-ended spaces of chemical compounds. A deep neural network was trained on hundreds of thousands of existing chemical structures to construct three coupled functions: an encoder, a decoder, and a predictor. The encoder converts the discrete representation of a molecule into a real-valued continuous vector, and the decoder converts these continuous vectors back to discrete molecular representations. The predictor estimates chemical properties from the latent continuous vector representation of the molecule. Continuous representations of molecules allow us to automatically generate novel chemical structures by performing simple operations in the latent space, such as decoding random vectors, perturbing known chemical structures, or interpolating between molecules. Continuous representations also allow the use of powerful gradient-based optimization to efficiently guide the search for optimized functional compounds. We demonstrate our method in the domain of drug-like molecules and also in a set of molecules with fewer that nine heavy atoms.},
doi = {10.1021/acscentsci.7b00572},
journal = {ACS Central Science},
number = 2,
volume = 4,
place = {United States},
year = {Fri Jan 12 00:00:00 EST 2018},
month = {Fri Jan 12 00:00:00 EST 2018}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record
https://doi.org/10.1021/acscentsci.7b00572

Citation Metrics:
Cited by: 1726 works
Citation information provided by
Web of Science

Figures / Tables:

Figure 1 Figure 1: (a) A diagram of the autoencoder used for molecular design, including the joint property prediction model. Starting from a discrete molecular representation, such as a SMILES string, the encoder network converts each molecule into a vector in the latent space, which is effectively a continuous molecular representation. Givenmore » a point in the latent space, the decoder network produces a corresponding SMILES string. A mutilayer perceptron network estimates the value of target properties associated with each molecule. (b) Gradient-based optimization in continuous latent space. After training a surrogate model f(z) to predict the properties of molecules based on their latent representation z, we can optimize f(z) with respect to z to find new latent representations expected to have high values of desired properties. These new latent representations can then be decoded into SMILES strings, at which point their properties can be tested empirically.« less

Save / Share:

Works referenced in this record:

Molecular graph convolutions: moving beyond fingerprints
journal, August 2016

  • Kearnes, Steven; McCloskey, Kevin; Berndl, Marc
  • Journal of Computer-Aided Molecular Design, Vol. 30, Issue 8
  • DOI: 10.1007/s10822-016-9938-8

Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning
journal, January 2012


Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions
journal, June 2009

  • Ertl, Peter; Schuffenhauer, Ansgar
  • Journal of Cheminformatics, Vol. 1, Issue 1
  • DOI: 10.1186/1758-2946-1-8

Chemical space as a source for new drugs
journal, January 2010

  • Reymond, Jean-Louis; van Deursen, Ruud; Blum, Lorenz C.
  • MedChemComm, Vol. 1, Issue 1
  • DOI: 10.1039/c0md00020e

Exploring chemical space with discrete, gradient, and hybrid optimization methods
journal, November 2008

  • Balamurugan, D.; Yang, Weitao; Beratan, David N.
  • The Journal of Chemical Physics, Vol. 129, Issue 17
  • DOI: 10.1063/1.2987711

A few useful things to know about machine learning
journal, October 2012


The Chemical Space Project
journal, February 2015

  • Reymond, Jean-Louis
  • Accounts of Chemical Research, Vol. 48, Issue 3
  • DOI: 10.1021/ar500432k

Retrosynthetic Reaction Prediction Using Neural Sequence-to-Sequence Models
journal, September 2017


Chemical Space Travel
journal, May 2007


The Harvard Clean Energy Project: Large-Scale Computational Screening and Design of Organic Photovoltaics on the World Community Grid
journal, August 2011

  • Hachmann, Johannes; Olivares-Amaya, Roberto; Atahan-Evrenk, Sule
  • The Journal of Physical Chemistry Letters, Vol. 2, Issue 17
  • DOI: 10.1021/jz200866s

On the Surprising Behavior of Distance Metrics in High Dimensional Space
book, January 2001


Extended-Connectivity Fingerprints
journal, April 2010

  • Rogers, David; Hahn, Mathew
  • Journal of Chemical Information and Modeling, Vol. 50, Issue 5
  • DOI: 10.1021/ci100050t

ChemTS: an efficient python library for de novo molecular generation
journal, November 2017

  • Yang, Xiufeng; Zhang, Jinzhe; Yoshizoe, Kazuki
  • Science and Technology of Advanced Materials, Vol. 18, Issue 1
  • DOI: 10.1080/14686996.2017.1401424

Designing Molecules by Optimizing Potentials
journal, March 2006

  • Wang, Mingliang; Hu, Xiangqian; Beratan, David N.
  • Journal of the American Chemical Society, Vol. 128, Issue 10
  • DOI: 10.1021/ja0572046

Strategy To Discover Diverse Optimal Molecules in the Small Molecule Universe
journal, February 2015

  • Rupakheti, Chetan; Virshup, Aaron; Yang, Weitao
  • Journal of Chemical Information and Modeling, Vol. 55, Issue 3
  • DOI: 10.1021/ci500749q

A Learning Algorithm for Continually Running Fully Recurrent Neural Networks
journal, June 1989


Generating Sentences from a Continuous Space
conference, January 2016

  • Bowman, Samuel R.; Vilnis, Luke; Vinyals, Oriol
  • Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning
  • DOI: 10.18653/v1/K16-1002

Computational Design and Selection of Optimal Organic Photovoltaic Materials
journal, July 2011

  • O’Boyle, Noel M.; Campbell, Casey M.; Hutchison, Geoffrey R.
  • The Journal of Physical Chemistry C, Vol. 115, Issue 32
  • DOI: 10.1021/jp202765c

Prediction of Physicochemical Parameters by Atomic Contributions
journal, August 1999

  • Wildman, Scott A.; Crippen, Gordon M.
  • Journal of Chemical Information and Computer Sciences, Vol. 39, Issue 5
  • DOI: 10.1021/ci990307l

InChI - the worldwide chemical structure identifier standard
journal, January 2013

  • Heller, Stephen; McNaught, Alan; Stein, Stephen
  • Journal of Cheminformatics, Vol. 5, Issue 1
  • DOI: 10.1186/1758-2946-5-7

Virtual screening of chemical libraries
journal, December 2004


Structure-Based Virtual Screening for Drug Discovery: a Problem-Centric Review
journal, January 2012


Estimation of the size of drug-like chemical space based on GDB-17 data
journal, August 2013

  • Polishchuk, P. G.; Madzhidov, T. I.; Varnek, A.
  • Journal of Computer-Aided Molecular Design, Vol. 27, Issue 8
  • DOI: 10.1007/s10822-013-9672-4

Efficient Computational Screening of Organic Polymer Photovoltaics
journal, April 2013

  • Kanal, Ilana Y.; Owens, Steven G.; Bechtel, Jonathon S.
  • The Journal of Physical Chemistry Letters, Vol. 4, Issue 10
  • DOI: 10.1021/jz400215j

Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks
journal, December 2017


ZINC: A Free Tool to Discover Chemistry for Biology
journal, June 2012

  • Irwin, John J.; Sterling, Teague; Mysinger, Michael M.
  • Journal of Chemical Information and Modeling, Vol. 52, Issue 7
  • DOI: 10.1021/ci3001277

SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules
journal, February 1988

  • Weininger, David
  • Journal of Chemical Information and Modeling, Vol. 28, Issue 1
  • DOI: 10.1021/ci00057a005

PubChem Substance and Compound databases
journal, September 2015

  • Kim, Sunghwan; Thiessen, Paul A.; Bolton, Evan E.
  • Nucleic Acids Research, Vol. 44, Issue D1
  • DOI: 10.1093/nar/gkv951

What Is High-Throughput Virtual Screening? A Perspective from Organic Materials Discovery
journal, July 2015


Design of efficient molecular organic light-emitting diodes by a high-throughput virtual screening and experimental approach
journal, August 2016

  • Gómez-Bombarelli, Rafael; Aguilera-Iparraguirre, Jorge; Hirzel, Timothy D.
  • Nature Materials, Vol. 15, Issue 10
  • DOI: 10.1038/nmat4717

Quantifying the chemical beauty of drugs
journal, January 2012

  • Bickerton, G. Richard; Paolini, Gaia V.; Besnard, Jérémy
  • Nature Chemistry, Vol. 4, Issue 2
  • DOI: 10.1038/nchem.1243

Recognizing Pitfalls in Virtual Screening: A Critical Review
journal, April 2012

  • Scior, Thomas; Bender, Andreas; Tresadern, Gary
  • Journal of Chemical Information and Modeling, Vol. 52, Issue 4
  • DOI: 10.1021/ci200528d

Stochastic Voyages into Uncharted Chemical Space Produce a Representative Library of All Possible Drug-Like Compounds
journal, May 2013

  • Virshup, Aaron M.; Contreras-García, Julia; Wipf, Peter
  • Journal of the American Chemical Society, Vol. 135, Issue 19
  • DOI: 10.1021/ja401184g

Quantum chemistry structures and properties of 134 kilo molecules
journal, August 2014

  • Ramakrishnan, Raghunathan; Dral, Pavlo O.; Rupp, Matthias
  • Scientific Data, Vol. 1, Issue 1
  • DOI: 10.1038/sdata.2014.22

Application of Generative Autoencoder in De Novo Molecular Design
journal, December 2017

  • Blaschke, Thomas; Olivecrona, Marcus; Engkvist, Ola
  • Molecular Informatics, Vol. 37, Issue 1-2
  • DOI: 10.1002/minf.201700123

Virtual screening: an endless staircase?
journal, April 2010

  • Schneider, Gisbert
  • Nature Reviews Drug Discovery, Vol. 9, Issue 4
  • DOI: 10.1038/nrd3139

Figures/Tables have been extracted from DOE-funded journal article accepted manuscripts.