DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Less is more: Sampling chemical space with active learning

Journal Article · · Journal of Chemical Physics
DOI: https://doi.org/10.1063/1.5023802 · OSTI ID:1479911

We present the development of accurate and transferable machine learning (ML) potentials for predicting molecular energetics is a challenging task. The process of data generation to train such ML potentials is a task neither well understood nor researched in detail. In this work, we present a fully automated approach for the generation of datasets with the intent of training universal ML potentials. It is based on the concept of active learning (AL) via Query by Committee (QBC), which uses the disagreement between an ensemble of ML potentials to infer the reliability of the ensemble’s prediction. QBC allows the presented AL algorithm to automatically sample regions of chemical space where the ML potential fails to accurately predict the potential energy. AL improves the overall fitness of ANAKIN-ME (ANI) deep learning potentials in rigorous test cases by mitigating human biases in deciding what new training data to use. AL also reduces the training set size to a fraction of the data required when using naive random sampling techniques. To provide validation of our AL approach, we develop the COmprehensive Machine-learning Potential (COMP6) benchmark (publicly available on GitHub) which contains a diverse set of organic molecules. Active learning-based ANI potentials outperform the original random sampled ANI-1 potential with only 10% of the data, while the final active learning-based model vastly outperforms ANI-1 on the COMP6 benchmark after training to only 25% of the data. Finally, we show that our proposed AL technique develops a universal ANI potential (ANI-1x) that provides accurate energy and force predictions on the entire COMP6 benchmark. Finally, this universal ML potential achieves a level of accuracy on par with the best ML potentials for single molecules or materials, while remaining applicable to the general class of organic molecules composed of the elements CHNO.

Research Organization:
Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)
Sponsoring Organization:
USDOE Office of Science (SC); USDOE National Nuclear Security Administration (NNSA)
Grant/Contract Number:
AC52-06NA25396
OSTI ID:
1479911
Alternate ID(s):
OSTI ID: 1438295
Report Number(s):
LA-UR-18-30171
Journal Information:
Journal of Chemical Physics, Vol. 148, Issue 24; ISSN 0021-9606
Publisher:
American Institute of Physics (AIP)Copyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 349 works
Citation information provided by
Web of Science

References (58)

Neural Networks for the Prediction of Organic Chemistry Reactions journal October 2016
Material informatics driven design and experimental validation of lead titanate as an aqueous solar photocathode journal October 2016
The S66x8 benchmark for noncovalent interactions revisited: explicitly correlated ab initio methods and density functional theory journal January 2016
Protein–Ligand Scoring with Convolutional Neural Networks journal April 2017
Systematic optimization of long-range corrected hybrid density functionals journal February 2008
970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13 journal July 2009
Retrosynthetic Reaction Prediction Using Neural Sequence-to-Sequence Models journal September 2017
UFF, a full periodic table force field for molecular mechanics and molecular dynamics simulations journal December 1992
The EBI RDF platform: linked open data for the life sciences journal January 2014
CHARMM36 all-atom additive protein force field: Validation based on comparison to NMR data journal July 2013
A full coupled‐cluster singles and doubles model: The inclusion of disconnected triples journal February 1982
The ChEMBL bioactivity database: an update journal November 2013
Virtual Exploration of the Small-Molecule Chemical Universe below 160 Daltons journal February 2005
ANI-1, A data set of 20 million calculated off-equilibrium conformations for organic molecules journal December 2017
Structure-based sampling and self-correcting machine learning for accurate calculations of potential energy surfaces and vibrational levels journal June 2017
Self‐Consistent Molecular‐Orbital Methods. IX. An Extended Gaussian‐Type Basis for Molecular‐Orbital Studies of Organic Molecules journal January 1971
Addressing uncertainty in atomistic machine learning journal January 2017
Virtual Exploration of the Chemical Universe up to 11 Atoms of C, N, O, F:  Assembly of 26.4 Million Structures (110.9 Million Stereoisomers) and Analysis for New Ring Systems, Stereochemistry, Physicochemical Properties, Compound Classes, and Drug Discovery journal January 2007
CHARMM general force field: A force field for drug-like molecules compatible with the CHARMM all-atom additive biological force fields journal January 2009
Comparison of multiple Amber force fields and development of improved protein backbone parameters journal November 2006
Assessment of the Performance of DFT and DFT-D Methods for Describing Distance Dependence of Hydrogen-Bonded Interactions journal December 2010
Representing the potential-energy surface of protonated water clusters by high-dimensional neural network potentials journal January 2015
The open science grid journal July 2007
Intrinsic Bond Energies from a Bonds-in-Molecules Neural Network journal June 2017
Permutation invariant potential energy surfaces for polyatomic reactions using atomistic neural networks journal June 2016
Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning journal October 2017
Møller-Plesset perturbation theory: from small molecule methods to methods for thousands of atoms: Møller-Plesset perturbation theory journal May 2011
First Principles Neural Network Potentials for Reactive Simulations of Large Molecular and Condensed Systems journal August 2017
A consistent and accurate ab initio parametrization of density functional dispersion correction (DFT-D) for the 94 elements H-Pu journal April 2010
Hierarchical modeling of molecular energies using a deep neural network journal June 2018
COMPASS:  An ab Initio Force-Field Optimized for Condensed-Phase ApplicationsOverview with Details on Alkane and Benzene Compounds journal September 1998
GLYCAM06: A generalizable biomolecular force field. Carbohydrates: GLYCAM06 journal September 2007
A structured approach journal February 2003
Active-learning strategies in computer-assisted drug discovery journal April 2015
Big Data Meets Quantum Chemistry Approximations: The Δ-Machine Learning Approach journal April 2015
ff14SB: Improving the Accuracy of Protein Side Chain and Backbone Parameters from ff99SB journal July 2015
The atomic simulation environment—a Python library for working with atoms journal June 2017
Energy-free machine learning force field for aluminum journal August 2017
The TensorMol-0.1 model chemistry: a neural network augmented with long-range physics journal January 2018
Ab Initio Investigation of O–H Dissociation from the Al–OH 2 Complex Using Molecular Dynamics and Neural Network Fitting journal January 2016
Metadynamics for training neural network model chemistries: A competitive assessment journal June 2018
Calculation of properties with the coupled-cluster method journal January 1977
Digitization of multistep organic synthesis in reactionware for on-demand pharmaceuticals journal January 2018
ANI-1: an extensible neural network potential with DFT accuracy at force field computational cost journal January 2017
The Automation of Science journal April 2009
Structure of aqueous NaOH solutions: insights from neural-network-based molecular dynamics simulations journal January 2017
Quantum-chemical insights from deep tensor neural networks journal January 2017
MyChEMBL: A Virtual Platform for Distributing Cheminformatics Tools and Open Data journal September 2014
Genetic Optimization of Training Sets for Improved Machine Learning Models of Molecular Properties journal March 2017
Pressure-induced phase transitions in silicon studied by neural network-based metadynamics simulations journal December 2008
Machine Learning Force Fields: Construction, Validation, and Outlook journal December 2016
Machine learning molecular dynamics for the simulation of infrared spectra journal January 2017
DrugBank 4.0: shedding new light on drug metabolism journal November 2013
Machine-learning approaches in drug discovery: methods and applications journal March 2015
Machine-learning-assisted materials discovery using failed experiments journal May 2016
Universal fragment descriptors for predicting properties of inorganic crystals journal June 2017
Quantum chemistry structures and properties of 134 kilo molecules journal August 2014
Active learning of linearly parametrized interatomic potentials journal December 2017

Cited By (36)

Making machine learning a useful tool in the accelerated discovery of transition metal complexes journal July 2019
Machine learning and artificial neural network accelerated computational discoveries in materials science journal November 2019
Active learning in materials science with emphasis on adaptive sampling using uncertainties for targeted design journal February 2019
Machine learning enables long time scale molecular photodynamics simulations journal January 2019
IMPRESSION – prediction of NMR parameters for 3-dimensional chemical structures using machine learning with near quantum chemical accuracy journal January 2020
Guest Editorial: Special Topic on Data-Enabled Theoretical Chemistry journal June 2018
Compressing physics with an autoencoder: Creating an atomic species representation to improve machine learning models in the chemical sciences journal August 2019
Ring polymer molecular dynamics and active learning of moment tensor potential for gas-phase barrierless reactions: Application to S + H 2 journal December 2019
From DFT to machine learning: recent approaches to materials science–a review journal May 2019
Accessing thermal conductivity of complex compounds by machine learning interatomic potentials journal October 2019
Constructing convex energy landscapes for atomistic structure optimization journal December 2019
Accurate and transferable multitask prediction of chemical properties with an atoms-in-molecules neural network journal August 2019
Active Learning of Uniformly Accurate Inter-atomic Potentials for Materials Simulation text January 2018
Machine learning enables long time scale molecular photodynamics simulations text January 2018
IMPRESSION -- Prediction of NMR Parameters for 3-dimensional chemical structures using Machine Learning with near quantum chemical accuracy preprint January 2019
Ring Polymer Molecular Dynamics and Active Learning of Moment Tensor Potential for Gas-Phase Barrierless Reactions: Application to S + H2 text January 2019
Molecular Dynamics with Neural Network Potentials book January 2020
Liposome encapsulation circumvents the hepatic clearance mechanisms of all-trans-retinoic acid journal August 1994
Gaussian Process-Based Refinement of Dispersion Corrections journal October 2019
Machine learning enables long time scale molecular photodynamics simulations journal January 2019
Accurate and transferable multitask prediction of chemical properties with an atoms-in-molecules neural network journal August 2019
Deep Learning for Deep Chemistry: Optimizing the Prediction of Chemical Patterns journal November 2019
Adversarial Active Learning for Deep Networks: a Margin Based Approach preprint January 2018
Deep Ensemble Bayesian Active Learning : Addressing the Mode Collapse issue in Monte Carlo dropout via Ensembles preprint January 2018
Machine Learning of coarse-grained Molecular Dynamics Force Fields preprint January 2018
Molecular Dynamics with Neural-Network Potentials preprint January 2018
Machine Learning Prediction of DNA Charge Transport text January 2018
A Scalable Molecular Force Field Parameterization Method Based on Density Functional Theory and Quantum-Level Machine Learning preprint January 2019
Incorporating electronic information into Machine Learning potential energy surfaces via approaching the ground-state electronic energy as a function of atom-based electronic populations preprint January 2020
Dropout Strikes Back: Improved Uncertainty Estimation via Diversity Sampling preprint January 2020
Machine Learning for Multi-fidelity Scale Bridging and Dynamical Simulations of Materials preprint January 2020
Opportunities and Challenges for Machine Learning in Materials Science text January 2020
Deep Learning in Protein Structural Modeling and Design preprint January 2020
The MLIP package: Moment Tensor Potentials with MPI and Active Learning preprint January 2020
Facilitating {\it ab initio} configurational sampling of multicomponent solids using an on-lattice neural network model and active learning preprint January 2020
Training Data Set Refinement for the Machine Learning Potential of Li-Si Alloys via Structural Similarity Analysis preprint January 2021

Figures / Tables (6)