skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Less is more: Sampling chemical space with active learning

Abstract

We present the development of accurate and transferable machine learning (ML) potentials for predicting molecular energetics is a challenging task. The process of data generation to train such ML potentials is a task neither well understood nor researched in detail. In this work, we present a fully automated approach for the generation of datasets with the intent of training universal ML potentials. It is based on the concept of active learning (AL) via Query by Committee (QBC), which uses the disagreement between an ensemble of ML potentials to infer the reliability of the ensemble’s prediction. QBC allows the presented AL algorithm to automatically sample regions of chemical space where the ML potential fails to accurately predict the potential energy. AL improves the overall fitness of ANAKIN-ME (ANI) deep learning potentials in rigorous test cases by mitigating human biases in deciding what new training data to use. AL also reduces the training set size to a fraction of the data required when using naive random sampling techniques. To provide validation of our AL approach, we develop the COmprehensive Machine-learning Potential (COMP6) benchmark (publicly available on GitHub) which contains a diverse set of organic molecules. Active learning-based ANI potentials outperform the originalmore » random sampled ANI-1 potential with only 10% of the data, while the final active learning-based model vastly outperforms ANI-1 on the COMP6 benchmark after training to only 25% of the data. Finally, we show that our proposed AL technique develops a universal ANI potential (ANI-1x) that provides accurate energy and force predictions on the entire COMP6 benchmark. Finally, this universal ML potential achieves a level of accuracy on par with the best ML potentials for single molecules or materials, while remaining applicable to the general class of organic molecules composed of the elements CHNO.« less

Authors:
ORCiD logo [1]; ORCiD logo [2]; ORCiD logo [2];  [3];  [1]
  1. Univ. of Florida, Gainesville, FL (United States)
  2. Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
  3. Univ. of North Carolina, Chapel Hill, NC (United States)
Publication Date:
Research Org.:
Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)
Sponsoring Org.:
USDOE Office of Science (SC); USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1479911
Alternate Identifier(s):
OSTI ID: 1438295
Report Number(s):
LA-UR-18-30171
Journal ID: ISSN 0021-9606
Grant/Contract Number:  
AC52-06NA25396
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
Journal of Chemical Physics
Additional Journal Information:
Journal Volume: 148; Journal Issue: 24; Journal ID: ISSN 0021-9606
Publisher:
American Institute of Physics (AIP)
Country of Publication:
United States
Language:
English
Subject:
37 INORGANIC, ORGANIC, PHYSICAL, AND ANALYTICAL CHEMISTRY; Material Science

Citation Formats

Smith, Justin Steven, Nebgen, Benjamin Tyler, Lubbers, Nicholas Edward, Isayev, Olexandr, and Roitberg, Adrian E. Less is more: Sampling chemical space with active learning. United States: N. p., 2018. Web. doi:10.1063/1.5023802.
Smith, Justin Steven, Nebgen, Benjamin Tyler, Lubbers, Nicholas Edward, Isayev, Olexandr, & Roitberg, Adrian E. Less is more: Sampling chemical space with active learning. United States. https://doi.org/10.1063/1.5023802
Smith, Justin Steven, Nebgen, Benjamin Tyler, Lubbers, Nicholas Edward, Isayev, Olexandr, and Roitberg, Adrian E. 2018. "Less is more: Sampling chemical space with active learning". United States. https://doi.org/10.1063/1.5023802. https://www.osti.gov/servlets/purl/1479911.
@article{osti_1479911,
title = {Less is more: Sampling chemical space with active learning},
author = {Smith, Justin Steven and Nebgen, Benjamin Tyler and Lubbers, Nicholas Edward and Isayev, Olexandr and Roitberg, Adrian E},
abstractNote = {We present the development of accurate and transferable machine learning (ML) potentials for predicting molecular energetics is a challenging task. The process of data generation to train such ML potentials is a task neither well understood nor researched in detail. In this work, we present a fully automated approach for the generation of datasets with the intent of training universal ML potentials. It is based on the concept of active learning (AL) via Query by Committee (QBC), which uses the disagreement between an ensemble of ML potentials to infer the reliability of the ensemble’s prediction. QBC allows the presented AL algorithm to automatically sample regions of chemical space where the ML potential fails to accurately predict the potential energy. AL improves the overall fitness of ANAKIN-ME (ANI) deep learning potentials in rigorous test cases by mitigating human biases in deciding what new training data to use. AL also reduces the training set size to a fraction of the data required when using naive random sampling techniques. To provide validation of our AL approach, we develop the COmprehensive Machine-learning Potential (COMP6) benchmark (publicly available on GitHub) which contains a diverse set of organic molecules. Active learning-based ANI potentials outperform the original random sampled ANI-1 potential with only 10% of the data, while the final active learning-based model vastly outperforms ANI-1 on the COMP6 benchmark after training to only 25% of the data. Finally, we show that our proposed AL technique develops a universal ANI potential (ANI-1x) that provides accurate energy and force predictions on the entire COMP6 benchmark. Finally, this universal ML potential achieves a level of accuracy on par with the best ML potentials for single molecules or materials, while remaining applicable to the general class of organic molecules composed of the elements CHNO.},
doi = {10.1063/1.5023802},
url = {https://www.osti.gov/biblio/1479911}, journal = {Journal of Chemical Physics},
issn = {0021-9606},
number = 24,
volume = 148,
place = {United States},
year = {Tue May 22 00:00:00 EDT 2018},
month = {Tue May 22 00:00:00 EDT 2018}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 286 works
Citation information provided by
Web of Science

Figures / Tables:

Figure 1 Figure 1: Example of choosing a value $$\hat{ρ}$$ which captures 98% of all errors ($$ε$$) over 1.5 kcal/mol on the GDB07to09 benchmark set using the initial (before using active learning) ANI model ensemble. The value which accomplished this is found to be $$\hat{ρ}$$ = 0.23. This value of $$\hat{ρ}$$ usedmore » in query by committee results in the selection of 58% of all test data. Initially 26% of all $$ε$$ are greater than 1.5. 44% of $$ε$$ corresponding to ρ > $$\hat{ρ}$$ are greater than 1.5. Splitting the dataset along ρ = $$\hat{ρ}$$ results in a total energy RMSE of the ANI ensemble prediction vs. reference DFT of 7.4 kcal/mol for all values ρ > $$\hat{ρ}$$ and 1.5 kcal/mol for all values ρ ≤ $$\hat{ρ}$$ .« less

Save / Share:

Works referenced in this record:

Neural Networks for the Prediction of Organic Chemistry Reactions
journal, October 2016


Material informatics driven design and experimental validation of lead titanate as an aqueous solar photocathode
journal, October 2016


The S66x8 benchmark for noncovalent interactions revisited: explicitly correlated ab initio methods and density functional theory
journal, January 2016


Protein–Ligand Scoring with Convolutional Neural Networks
journal, April 2017


Systematic optimization of long-range corrected hybrid density functionals
journal, February 2008


970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13
journal, July 2009


Retrosynthetic Reaction Prediction Using Neural Sequence-to-Sequence Models
journal, September 2017


UFF, a full periodic table force field for molecular mechanics and molecular dynamics simulations
journal, December 1992


The EBI RDF platform: linked open data for the life sciences
journal, January 2014


CHARMM36 all-atom additive protein force field: Validation based on comparison to NMR data
journal, July 2013


A full coupled‐cluster singles and doubles model: The inclusion of disconnected triples
journal, February 1982


The ChEMBL bioactivity database: an update
journal, November 2013


Virtual Exploration of the Small-Molecule Chemical Universe below 160 Daltons
journal, February 2005


ANI-1, A data set of 20 million calculated off-equilibrium conformations for organic molecules
journal, December 2017


Structure-based sampling and self-correcting machine learning for accurate calculations of potential energy surfaces and vibrational levels
journal, June 2017


Self‐Consistent Molecular‐Orbital Methods. IX. An Extended Gaussian‐Type Basis for Molecular‐Orbital Studies of Organic Molecules
journal, January 1971


Addressing uncertainty in atomistic machine learning
journal, January 2017


Comparison of multiple Amber force fields and development of improved protein backbone parameters
journal, November 2006


Assessment of the Performance of DFT and DFT-D Methods for Describing Distance Dependence of Hydrogen-Bonded Interactions
journal, December 2010


The open science grid
journal, July 2007


Intrinsic Bond Energies from a Bonds-in-Molecules Neural Network
journal, June 2017


Permutation invariant potential energy surfaces for polyatomic reactions using atomistic neural networks
journal, June 2016


Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning
journal, October 2017


Møller-Plesset perturbation theory: from small molecule methods to methods for thousands of atoms: Møller-Plesset perturbation theory
journal, May 2011


First Principles Neural Network Potentials for Reactive Simulations of Large Molecular and Condensed Systems
journal, August 2017


A consistent and accurate ab initio parametrization of density functional dispersion correction (DFT-D) for the 94 elements H-Pu
journal, April 2010


Hierarchical modeling of molecular energies using a deep neural network
journal, June 2018


GLYCAM06: A generalizable biomolecular force field. Carbohydrates: GLYCAM06
journal, September 2007


Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning
journal, January 2012


A structured approach
journal, February 2003


Active-learning strategies in computer-assisted drug discovery
journal, April 2015


Big Data Meets Quantum Chemistry Approximations: The Δ-Machine Learning Approach
journal, April 2015


ff14SB: Improving the Accuracy of Protein Side Chain and Backbone Parameters from ff99SB
journal, July 2015


The atomic simulation environment—a Python library for working with atoms
journal, June 2017


Energy-free machine learning force field for aluminum
journal, August 2017


The TensorMol-0.1 model chemistry: a neural network augmented with long-range physics
journal, January 2018


Ab Initio Investigation of O–H Dissociation from the Al–OH 2 Complex Using Molecular Dynamics and Neural Network Fitting
journal, January 2016


Metadynamics for training neural network model chemistries: A competitive assessment
journal, June 2018


Calculation of properties with the coupled-cluster method
journal, January 1977


Digitization of multistep organic synthesis in reactionware for on-demand pharmaceuticals
journal, January 2018


ANI-1: an extensible neural network potential with DFT accuracy at force field computational cost
journal, January 2017


The Automation of Science
journal, April 2009


Structure of aqueous NaOH solutions: insights from neural-network-based molecular dynamics simulations
journal, January 2017


Quantum-chemical insights from deep tensor neural networks
journal, January 2017


MyChEMBL: A Virtual Platform for Distributing Cheminformatics Tools and Open Data
journal, September 2014


Genetic Optimization of Training Sets for Improved Machine Learning Models of Molecular Properties
journal, March 2017


Pressure-induced phase transitions in silicon studied by neural network-based metadynamics simulations
journal, December 2008


Machine Learning Force Fields: Construction, Validation, and Outlook
journal, December 2016


DrugBank 4.0: shedding new light on drug metabolism
journal, November 2013


Machine-learning approaches in drug discovery: methods and applications
journal, March 2015


Machine-learning-assisted materials discovery using failed experiments
journal, May 2016


Universal fragment descriptors for predicting properties of inorganic crystals
journal, June 2017


Quantum chemistry structures and properties of 134 kilo molecules
journal, August 2014


Active learning of linearly parametrized interatomic potentials
journal, December 2017


Works referencing / citing this record:

Making machine learning a useful tool in the accelerated discovery of transition metal complexes
journal, July 2019


Machine learning and artificial neural network accelerated computational discoveries in materials science
journal, November 2019


Active learning in materials science with emphasis on adaptive sampling using uncertainties for targeted design
journal, February 2019


Can machine learning identify the next high-temperature superconductor? Examining extrapolation performance for materials discovery
journal, January 2018


Machine learning enables long time scale molecular photodynamics simulations
journal, January 2019


A quantitative uncertainty metric controls error in neural network-driven chemical discovery
journal, January 2019


Guest Editorial: Special Topic on Data-Enabled Theoretical Chemistry
journal, June 2018


Ring polymer molecular dynamics and active learning of moment tensor potential for gas-phase barrierless reactions: Application to S + H 2
journal, December 2019


From DFT to machine learning: recent approaches to materials science–a review
journal, May 2019


Accessing thermal conductivity of complex compounds by machine learning interatomic potentials
journal, October 2019


Constructing convex energy landscapes for atomistic structure optimization
journal, December 2019


Active learning of uniformly accurate interatomic potentials for materials simulation
journal, February 2019


Machine learning and the physical sciences
journal, December 2019


Accurate and transferable multitask prediction of chemical properties with an atoms-in-molecules neural network
journal, August 2019


Machine learning enables long time scale molecular photodynamics simulations
text, January 2018


Liposome encapsulation circumvents the hepatic clearance mechanisms of all-trans-retinoic acid
journal, August 1994


Gaussian Process-Based Refinement of Dispersion Corrections
journal, October 2019


Machine learning enables long time scale molecular photodynamics simulations
journal, January 2019


Accurate and transferable multitask prediction of chemical properties with an atoms-in-molecules neural network
journal, August 2019


Deep Learning for Deep Chemistry: Optimizing the Prediction of Chemical Patterns
journal, November 2019


Machine Learning of coarse-grained Molecular Dynamics Force Fields
preprint, January 2018


Molecular Dynamics with Neural-Network Potentials
preprint, January 2018


Dropout Strikes Back: Improved Uncertainty Estimation via Diversity Sampling
preprint, January 2020


Deep Learning in Protein Structural Modeling and Design
preprint, January 2020


The MLIP package: Moment Tensor Potentials with MPI and Active Learning
preprint, January 2020


Figures/Tables have been extracted from DOE-funded journal article accepted manuscripts.