skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Less is more: Sampling chemical space with active learning

Abstract

We present the development of accurate and transferable machine learning (ML) potentials for predicting molecular energetics is a challenging task. The process of data generation to train such ML potentials is a task neither well understood nor researched in detail. In this work, we present a fully automated approach for the generation of datasets with the intent of training universal ML potentials. It is based on the concept of active learning (AL) via Query by Committee (QBC), which uses the disagreement between an ensemble of ML potentials to infer the reliability of the ensemble’s prediction. QBC allows the presented AL algorithm to automatically sample regions of chemical space where the ML potential fails to accurately predict the potential energy. AL improves the overall fitness of ANAKIN-ME (ANI) deep learning potentials in rigorous test cases by mitigating human biases in deciding what new training data to use. AL also reduces the training set size to a fraction of the data required when using naive random sampling techniques. To provide validation of our AL approach, we develop the COmprehensive Machine-learning Potential (COMP6) benchmark (publicly available on GitHub) which contains a diverse set of organic molecules. Active learning-based ANI potentials outperform the originalmore » random sampled ANI-1 potential with only 10% of the data, while the final active learning-based model vastly outperforms ANI-1 on the COMP6 benchmark after training to only 25% of the data. Finally, we show that our proposed AL technique develops a universal ANI potential (ANI-1x) that provides accurate energy and force predictions on the entire COMP6 benchmark. Finally, this universal ML potential achieves a level of accuracy on par with the best ML potentials for single molecules or materials, while remaining applicable to the general class of organic molecules composed of the elements CHNO.« less

Authors:
ORCiD logo [1]; ORCiD logo [2]; ORCiD logo [2];  [3];  [1]
  1. Univ. of Florida, Gainesville, FL (United States)
  2. Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
  3. Univ. of North Carolina, Chapel Hill, NC (United States)
Publication Date:
Research Org.:
Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
Sponsoring Org.:
USDOE Office of Science (SC); USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1479911
Alternate Identifier(s):
OSTI ID: 1438295
Report Number(s):
LA-UR-18-30171
Journal ID: ISSN 0021-9606
Grant/Contract Number:  
AC52-06NA25396
Resource Type:
Accepted Manuscript
Journal Name:
Journal of Chemical Physics
Additional Journal Information:
Journal Volume: 148; Journal Issue: 24; Journal ID: ISSN 0021-9606
Publisher:
American Institute of Physics (AIP)
Country of Publication:
United States
Language:
English
Subject:
37 INORGANIC, ORGANIC, PHYSICAL, AND ANALYTICAL CHEMISTRY; Material Science

Citation Formats

Smith, Justin Steven, Nebgen, Benjamin Tyler, Lubbers, Nicholas Edward, Isayev, Olexandr, and Roitberg, Adrian E. Less is more: Sampling chemical space with active learning. United States: N. p., 2018. Web. https://doi.org/10.1063/1.5023802.
Smith, Justin Steven, Nebgen, Benjamin Tyler, Lubbers, Nicholas Edward, Isayev, Olexandr, & Roitberg, Adrian E. Less is more: Sampling chemical space with active learning. United States. https://doi.org/10.1063/1.5023802
Smith, Justin Steven, Nebgen, Benjamin Tyler, Lubbers, Nicholas Edward, Isayev, Olexandr, and Roitberg, Adrian E. Tue . "Less is more: Sampling chemical space with active learning". United States. https://doi.org/10.1063/1.5023802. https://www.osti.gov/servlets/purl/1479911.
@article{osti_1479911,
title = {Less is more: Sampling chemical space with active learning},
author = {Smith, Justin Steven and Nebgen, Benjamin Tyler and Lubbers, Nicholas Edward and Isayev, Olexandr and Roitberg, Adrian E},
abstractNote = {We present the development of accurate and transferable machine learning (ML) potentials for predicting molecular energetics is a challenging task. The process of data generation to train such ML potentials is a task neither well understood nor researched in detail. In this work, we present a fully automated approach for the generation of datasets with the intent of training universal ML potentials. It is based on the concept of active learning (AL) via Query by Committee (QBC), which uses the disagreement between an ensemble of ML potentials to infer the reliability of the ensemble’s prediction. QBC allows the presented AL algorithm to automatically sample regions of chemical space where the ML potential fails to accurately predict the potential energy. AL improves the overall fitness of ANAKIN-ME (ANI) deep learning potentials in rigorous test cases by mitigating human biases in deciding what new training data to use. AL also reduces the training set size to a fraction of the data required when using naive random sampling techniques. To provide validation of our AL approach, we develop the COmprehensive Machine-learning Potential (COMP6) benchmark (publicly available on GitHub) which contains a diverse set of organic molecules. Active learning-based ANI potentials outperform the original random sampled ANI-1 potential with only 10% of the data, while the final active learning-based model vastly outperforms ANI-1 on the COMP6 benchmark after training to only 25% of the data. Finally, we show that our proposed AL technique develops a universal ANI potential (ANI-1x) that provides accurate energy and force predictions on the entire COMP6 benchmark. Finally, this universal ML potential achieves a level of accuracy on par with the best ML potentials for single molecules or materials, while remaining applicable to the general class of organic molecules composed of the elements CHNO.},
doi = {10.1063/1.5023802},
journal = {Journal of Chemical Physics},
number = 24,
volume = 148,
place = {United States},
year = {2018},
month = {5}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 19 works
Citation information provided by
Web of Science

Figures / Tables:

Figure 1 Figure 1: Example of choosing a value $\hat{ρ}$ which captures 98% of all errors ($ε$) over 1.5 kcal/mol on the GDB07to09 benchmark set using the initial (before using active learning) ANI model ensemble. The value which accomplished this is found to be $\hat{ρ}$ = 0.23. This value of $\hat{ρ}$ usedmore » in query by committee results in the selection of 58% of all test data. Initially 26% of all $ε$ are greater than 1.5. 44% of $ε$ corresponding to ρ > $\hat{ρ}$ are greater than 1.5. Splitting the dataset along ρ = $\hat{ρ}$ results in a total energy RMSE of the ANI ensemble prediction vs. reference DFT of 7.4 kcal/mol for all values ρ > $\hat{ρ}$ and 1.5 kcal/mol for all values ρ ≤ $\hat{ρ}$ .« less

Save / Share:

Works referenced in this record:

Neural Networks for the Prediction of Organic Chemistry Reactions
journal, October 2016


Material informatics driven design and experimental validation of lead titanate as an aqueous solar photocathode
journal, October 2016


The S66x8 benchmark for noncovalent interactions revisited: explicitly correlated ab initio methods and density functional theory
journal, January 2016

  • Brauer, Brina; Kesharwani, Manoj K.; Kozuch, Sebastian
  • Physical Chemistry Chemical Physics, Vol. 18, Issue 31
  • DOI: 10.1039/c6cp00688d

Protein–Ligand Scoring with Convolutional Neural Networks
journal, April 2017

  • Ragoza, Matthew; Hochuli, Joshua; Idrobo, Elisa
  • Journal of Chemical Information and Modeling, Vol. 57, Issue 4
  • DOI: 10.1021/acs.jcim.6b00740

Systematic optimization of long-range corrected hybrid density functionals
journal, February 2008

  • Chai, Jeng-Da; Head-Gordon, Martin
  • The Journal of Chemical Physics, Vol. 128, Issue 8
  • DOI: 10.1063/1.2834918

970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13
journal, July 2009

  • Blum, Lorenz C.; Reymond, Jean-Louis
  • Journal of the American Chemical Society, Vol. 131, Issue 25
  • DOI: 10.1021/ja902302h

Retrosynthetic Reaction Prediction Using Neural Sequence-to-Sequence Models
journal, September 2017


UFF, a full periodic table force field for molecular mechanics and molecular dynamics simulations
journal, December 1992

  • Rappe, A. K.; Casewit, C. J.; Colwell, K. S.
  • Journal of the American Chemical Society, Vol. 114, Issue 25, p. 10024-10035
  • DOI: 10.1021/ja00051a040

The EBI RDF platform: linked open data for the life sciences
journal, January 2014


CHARMM36 all-atom additive protein force field: Validation based on comparison to NMR data
journal, July 2013

  • Huang, Jing; MacKerell, Alexander D.
  • Journal of Computational Chemistry, Vol. 34, Issue 25
  • DOI: 10.1002/jcc.23354

A full coupled‐cluster singles and doubles model: The inclusion of disconnected triples
journal, February 1982

  • Purvis, George D.; Bartlett, Rodney J.
  • The Journal of Chemical Physics, Vol. 76, Issue 4
  • DOI: 10.1063/1.443164

The ChEMBL bioactivity database: an update
journal, November 2013

  • Bento, A. Patrícia; Gaulton, Anna; Hersey, Anne
  • Nucleic Acids Research, Vol. 42, Issue D1
  • DOI: 10.1093/nar/gkt1031

Virtual Exploration of the Small-Molecule Chemical Universe below 160 Daltons
journal, February 2005

  • Fink, Tobias; Bruggesser, Heinz; Reymond, Jean-Louis
  • Angewandte Chemie International Edition, Vol. 44, Issue 10
  • DOI: 10.1002/anie.200462457

ANI-1, A data set of 20 million calculated off-equilibrium conformations for organic molecules
journal, December 2017

  • Smith, Justin S.; Isayev, Olexandr; Roitberg, Adrian E.
  • Scientific Data, Vol. 4, Issue 1
  • DOI: 10.1038/sdata.2017.193

Structure-based sampling and self-correcting machine learning for accurate calculations of potential energy surfaces and vibrational levels
journal, June 2017

  • Dral, Pavlo O.; Owens, Alec; Yurchenko, Sergei N.
  • The Journal of Chemical Physics, Vol. 146, Issue 24
  • DOI: 10.1063/1.4989536

Self‐Consistent Molecular‐Orbital Methods. IX. An Extended Gaussian‐Type Basis for Molecular‐Orbital Studies of Organic Molecules
journal, January 1971

  • Ditchfield, R.; Hehre, W. J.; Pople, J. A.
  • The Journal of Chemical Physics, Vol. 54, Issue 2
  • DOI: 10.1063/1.1674902

Addressing uncertainty in atomistic machine learning
journal, January 2017

  • Peterson, Andrew A.; Christensen, Rune; Khorshidi, Alireza
  • Physical Chemistry Chemical Physics, Vol. 19, Issue 18
  • DOI: 10.1039/C7CP00375G

CHARMM general force field: A force field for drug-like molecules compatible with the CHARMM all-atom additive biological force fields
journal, January 2009

  • Vanommeslaeghe, K.; Hatcher, E.; Acharya, C.
  • Journal of Computational Chemistry
  • DOI: 10.1002/jcc.21367

Comparison of multiple Amber force fields and development of improved protein backbone parameters
journal, November 2006

  • Hornak, Viktor; Abel, Robert; Okur, Asim
  • Proteins: Structure, Function, and Bioinformatics, Vol. 65, Issue 3
  • DOI: 10.1002/prot.21123

Assessment of the Performance of DFT and DFT-D Methods for Describing Distance Dependence of Hydrogen-Bonded Interactions
journal, December 2010

  • Thanthiriwatte, Kanchana S.; Hohenstein, Edward G.; Burns, Lori A.
  • Journal of Chemical Theory and Computation, Vol. 7, Issue 1
  • DOI: 10.1021/ct100469b

Representing the potential-energy surface of protonated water clusters by high-dimensional neural network potentials
journal, January 2015

  • Kondati Natarajan, Suresh; Morawietz, Tobias; Behler, Jörg
  • Physical Chemistry Chemical Physics, Vol. 17, Issue 13
  • DOI: 10.1039/c4cp04751f

The open science grid
journal, July 2007


Intrinsic Bond Energies from a Bonds-in-Molecules Neural Network
journal, June 2017


Permutation invariant potential energy surfaces for polyatomic reactions using atomistic neural networks
journal, June 2016

  • Kolb, Brian; Zhao, Bin; Li, Jun
  • The Journal of Chemical Physics, Vol. 144, Issue 22
  • DOI: 10.1063/1.4953560

Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning
journal, October 2017


Møller-Plesset perturbation theory: from small molecule methods to methods for thousands of atoms: Møller-Plesset perturbation theory
journal, May 2011

  • Cremer, Dieter
  • Wiley Interdisciplinary Reviews: Computational Molecular Science, Vol. 1, Issue 4
  • DOI: 10.1002/wcms.58

First Principles Neural Network Potentials for Reactive Simulations of Large Molecular and Condensed Systems
journal, August 2017


A consistent and accurate ab initio parametrization of density functional dispersion correction (DFT-D) for the 94 elements H-Pu
journal, April 2010

  • Grimme, Stefan; Antony, Jens; Ehrlich, Stephan
  • The Journal of Chemical Physics, Vol. 132, Issue 15
  • DOI: 10.1063/1.3382344

Hierarchical modeling of molecular energies using a deep neural network
journal, June 2018

  • Lubbers, Nicholas; Smith, Justin S.; Barros, Kipton
  • The Journal of Chemical Physics, Vol. 148, Issue 24
  • DOI: 10.1063/1.5011181

GLYCAM06: A generalizable biomolecular force field. Carbohydrates: GLYCAM06
journal, September 2007

  • Kirschner, Karl N.; Yongye, Austin B.; Tschampel, Sarah M.
  • Journal of Computational Chemistry, Vol. 29, Issue 4
  • DOI: 10.1002/jcc.20820

Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning
journal, January 2012


A structured approach
journal, February 2003


Active-learning strategies in computer-assisted drug discovery
journal, April 2015


Big Data Meets Quantum Chemistry Approximations: The Δ-Machine Learning Approach
journal, April 2015

  • Ramakrishnan, Raghunathan; Dral, Pavlo O.; Rupp, Matthias
  • Journal of Chemical Theory and Computation, Vol. 11, Issue 5
  • DOI: 10.1021/acs.jctc.5b00099

ff14SB: Improving the Accuracy of Protein Side Chain and Backbone Parameters from ff99SB
journal, July 2015

  • Maier, James A.; Martinez, Carmenza; Kasavajhala, Koushik
  • Journal of Chemical Theory and Computation, Vol. 11, Issue 8
  • DOI: 10.1021/acs.jctc.5b00255

The atomic simulation environment—a Python library for working with atoms
journal, June 2017

  • Hjorth Larsen, Ask; Jørgen Mortensen, Jens; Blomqvist, Jakob
  • Journal of Physics: Condensed Matter, Vol. 29, Issue 27
  • DOI: 10.1088/1361-648x/aa680e

Generalized Neural-Network Representation of High-Dimensional Potential-Energy Surfaces
journal, April 2007


Energy-free machine learning force field for aluminum
journal, August 2017


The TensorMol-0.1 model chemistry: a neural network augmented with long-range physics
journal, January 2018

  • Yao, Kun; Herr, John E.; Toth, David W.
  • Chemical Science, Vol. 9, Issue 8
  • DOI: 10.1039/c7sc04934j

Ab Initio Investigation of O–H Dissociation from the Al–OH 2 Complex Using Molecular Dynamics and Neural Network Fitting
journal, January 2016

  • Ho, Thi H.; Pham-Tran, Nguyen-Nguyen; Kawazoe, Yoshiyuki
  • The Journal of Physical Chemistry A, Vol. 120, Issue 3
  • DOI: 10.1021/acs.jpca.5b09497

Metadynamics for training neural network model chemistries: A competitive assessment
journal, June 2018

  • Herr, John E.; Yao, Kun; McIntyre, Ryker
  • The Journal of Chemical Physics, Vol. 148, Issue 24
  • DOI: 10.1063/1.5020067

Calculation of properties with the coupled-cluster method
journal, January 1977

  • Monkhorst, Hendrik J.
  • International Journal of Quantum Chemistry, Vol. 12, Issue S11
  • DOI: 10.1002/qua.560120850

Digitization of multistep organic synthesis in reactionware for on-demand pharmaceuticals
journal, January 2018

  • Kitson, Philip J.; Marie, Guillaume; Francoia, Jean-Patrick
  • Science, Vol. 359, Issue 6373
  • DOI: 10.1126/science.aao3466

ANI-1: an extensible neural network potential with DFT accuracy at force field computational cost
journal, January 2017

  • Smith, J. S.; Isayev, O.; Roitberg, A. E.
  • Chemical Science, Vol. 8, Issue 4
  • DOI: 10.1039/c6sc05720a

The Automation of Science
journal, April 2009


Structure of aqueous NaOH solutions: insights from neural-network-based molecular dynamics simulations
journal, January 2017

  • Hellström, Matti; Behler, Jörg
  • Physical Chemistry Chemical Physics, Vol. 19, Issue 1
  • DOI: 10.1039/c6cp06547c

Quantum-chemical insights from deep tensor neural networks
journal, January 2017

  • Schütt, Kristof T.; Arbabzadah, Farhad; Chmiela, Stefan
  • Nature Communications, Vol. 8, Issue 1
  • DOI: 10.1038/ncomms13890

MyChEMBL: A Virtual Platform for Distributing Cheminformatics Tools and Open Data
journal, September 2014

  • Davies, Mark; Nowotka, Michał; Papadatos, George
  • Challenges, Vol. 5, Issue 2
  • DOI: 10.3390/challe5020334

Genetic Optimization of Training Sets for Improved Machine Learning Models of Molecular Properties
journal, March 2017

  • Browning, Nicholas J.; Ramakrishnan, Raghunathan; von Lilienfeld, O. Anatole
  • The Journal of Physical Chemistry Letters, Vol. 8, Issue 7
  • DOI: 10.1021/acs.jpclett.7b00038

Pressure-induced phase transitions in silicon studied by neural network-based metadynamics simulations
journal, December 2008

  • Behler, Jörg; Martoňák, Roman; Donadio, Davide
  • physica status solidi (b), Vol. 245, Issue 12
  • DOI: 10.1002/pssb.200844219

Machine Learning Force Fields: Construction, Validation, and Outlook
journal, December 2016


Machine learning molecular dynamics for the simulation of infrared spectra
journal, January 2017

  • Gastegger, Michael; Behler, Jörg; Marquetand, Philipp
  • Chemical Science, Vol. 8, Issue 10
  • DOI: 10.1039/c7sc02267k

DrugBank 4.0: shedding new light on drug metabolism
journal, November 2013

  • Law, Vivian; Knox, Craig; Djoumbou, Yannick
  • Nucleic Acids Research, Vol. 42, Issue D1
  • DOI: 10.1093/nar/gkt1068

Machine-learning approaches in drug discovery: methods and applications
journal, March 2015


Machine-learning-assisted materials discovery using failed experiments
journal, May 2016

  • Raccuglia, Paul; Elbert, Katherine C.; Adler, Philip D. F.
  • Nature, Vol. 533, Issue 7601
  • DOI: 10.1038/nature17439

Universal fragment descriptors for predicting properties of inorganic crystals
journal, June 2017

  • Isayev, Olexandr; Oses, Corey; Toher, Cormac
  • Nature Communications, Vol. 8, Issue 1
  • DOI: 10.1038/ncomms15679

Quantum chemistry structures and properties of 134 kilo molecules
journal, August 2014

  • Ramakrishnan, Raghunathan; Dral, Pavlo O.; Rupp, Matthias
  • Scientific Data, Vol. 1, Issue 1
  • DOI: 10.1038/sdata.2014.22

Active learning of linearly parametrized interatomic potentials
journal, December 2017


    Works referencing / citing this record:

    Making machine learning a useful tool in the accelerated discovery of transition metal complexes
    journal, July 2019

    • Kulik, Heather J.
    • WIREs Computational Molecular Science, Vol. 10, Issue 1
    • DOI: 10.1002/wcms.1439

    Machine learning and artificial neural network accelerated computational discoveries in materials science
    journal, November 2019

    • Hong, Yang; Hou, Bo; Jiang, Hengle
    • WIREs Computational Molecular Science, Vol. 10, Issue 3
    • DOI: 10.1002/wcms.1450

    Active learning in materials science with emphasis on adaptive sampling using uncertainties for targeted design
    journal, February 2019

    • Lookman, Turab; Balachandran, Prasanna V.; Xue, Dezhen
    • npj Computational Materials, Vol. 5, Issue 1
    • DOI: 10.1038/s41524-019-0153-8

    Can machine learning identify the next high-temperature superconductor? Examining extrapolation performance for materials discovery
    journal, January 2018

    • Meredig, Bryce; Antono, Erin; Church, Carena
    • Molecular Systems Design & Engineering, Vol. 3, Issue 5
    • DOI: 10.1039/c8me00012c

    Machine learning enables long time scale molecular photodynamics simulations
    journal, January 2019

    • Westermayr, Julia; Gastegger, Michael; Menger, Maximilian F. S. J.
    • Chemical Science, Vol. 10, Issue 35
    • DOI: 10.1039/c9sc01742a

    A quantitative uncertainty metric controls error in neural network-driven chemical discovery
    journal, January 2019

    • Janet, Jon Paul; Duan, Chenru; Yang, Tzuhsiung
    • Chemical Science, Vol. 10, Issue 34
    • DOI: 10.1039/c9sc02298h

    IMPRESSION – prediction of NMR parameters for 3-dimensional chemical structures using machine learning with near quantum chemical accuracy
    journal, January 2020

    • Gerrard, Will; Bratholm, Lars A.; Packer, Martin J.
    • Chemical Science, Vol. 11, Issue 2
    • DOI: 10.1039/c9sc03854j

    Guest Editorial: Special Topic on Data-Enabled Theoretical Chemistry
    journal, June 2018

    • Rupp, Matthias; von Lilienfeld, O. Anatole; Burke, Kieron
    • The Journal of Chemical Physics, Vol. 148, Issue 24
    • DOI: 10.1063/1.5043213

    Compressing physics with an autoencoder: Creating an atomic species representation to improve machine learning models in the chemical sciences
    journal, August 2019

    • Herr, John E.; Koh, Kevin; Yao, Kun
    • The Journal of Chemical Physics, Vol. 151, Issue 8
    • DOI: 10.1063/1.5108803

    Ring polymer molecular dynamics and active learning of moment tensor potential for gas-phase barrierless reactions: Application to S + H 2
    journal, December 2019

    • Novikov, Ivan S.; Shapeev, Alexander V.; Suleimanov, Yury V.
    • The Journal of Chemical Physics, Vol. 151, Issue 22
    • DOI: 10.1063/1.5127561

    From DFT to machine learning: recent approaches to materials science–a review
    journal, May 2019

    • Schleder, Gabriel R.; Padilha, Antonio C. M.; Acosta, Carlos Mera
    • Journal of Physics: Materials, Vol. 2, Issue 3
    • DOI: 10.1088/2515-7639/ab084b

    Accessing thermal conductivity of complex compounds by machine learning interatomic potentials
    journal, October 2019


    Constructing convex energy landscapes for atomistic structure optimization
    journal, December 2019


    Active learning of uniformly accurate interatomic potentials for materials simulation
    journal, February 2019


    Machine learning and the physical sciences
    journal, December 2019


    Accurate and transferable multitask prediction of chemical properties with an atoms-in-molecules neural network
    journal, August 2019

    • Zubatyuk, Roman; Smith, Justin S.; Leszczynski, Jerzy
    • Science Advances, Vol. 5, Issue 8
    • DOI: 10.1126/sciadv.aav6490

      Figures/Tables have been extracted from DOE-funded journal article accepted manuscripts.