DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Evaluating uncertainty-based active learning for accelerating the generalization of molecular property prediction

Journal Article · · Journal of Cheminformatics
 [1];  [1];  [1];  [1]
  1. Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)

Deep learning models have proven to be a powerful tool for the prediction of molecular properties for applications including drug design and the development of energy storage materials. However, in order to learn accurate and robust structure–property mappings, these models require large amounts of data which can be a challenge to collect given the time and resource-intensive nature of experimental material characterization efforts. Additionally, such models fail to generalize to new types of molecular structures that were not included in the model training data. The acceleration of material development through uncertainty-guided experimental design has the promise to significantly reduce the data requirements and enable faster generalization to new types of materials. To evaluate the potential of such approaches for electrolyte design applications, we perform comprehensive evaluation of existing uncertainty quantification methods on the prediction of two relevant molecular properties - aqueous solubility and redox potential. We develop novel evaluation methods to probe the utility of the uncertainty estimates for both in-domain and out-of-domain data sets. Finally, we leverage selected uncertainty estimation methods for active learning to evaluate their capacity to support experimental design.

Research Organization:
Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
Sponsoring Organization:
USDOE Laboratory Directed Research and Development (LDRD) Program
Grant/Contract Number:
AC05-76RL01830
OSTI ID:
2228269
Report Number(s):
PNNL-SA--179045
Journal Information:
Journal of Cheminformatics, Journal Name: Journal of Cheminformatics Journal Issue: 1 Vol. 15; ISSN 1758-2946
Publisher:
Chemistry Central Ltd.Copyright Statement
Country of Publication:
United States
Language:
English

References (24)

PubChem Substance and Compound databases journal September 2015
Quantum Chemistry-Informed Active Learning to Accelerate the Design and Discovery of Sustainable Energy Storage Materials journal May 2020
Unsupervised Data Base Clustering Based on Daylight's Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets journal July 1999
A review of uncertainty quantification in deep learning: Techniques, applications and challenges journal December 2021
Evaluation of Deep Learning Architectures for Aqueous Solubility Prediction journal April 2022
Somas dataset January 2022
Aqueous organic redox flow batteries journal March 2019
Active learning with sampling by uncertainty and density for word sense disambiguation and text classification conference January 2008
QSAR-Based Virtual Screening: Advances and Applications in Drug Discovery journal November 2018
Large-scale comparison of machine learning methods for drug target prediction on ChEMBL journal January 2018
QSAR Modeling: Where Have You Been? Where Are You Going To? journal January 2014
Uncertainty Quantification Using Neural Networks for Molecular Property Prediction journal July 2020
Active learning accelerates ab initio molecular dynamics on reactive energy surfaces journal March 2021
Attribute driven inverse materials design using deep learning Bayesian framework journal December 2019
Improved Prediction of Aqueous Solubility of Novel Compounds by Going Deeper With Deep Learning journal February 2020
Analyzing Learned Molecular Representations for Property Prediction journal July 2019
Dual Strategy Active Learning book January 2007
UMAP: Uniform Manifold Approximation and Projection journal September 2018
Less is more: Sampling chemical space with active learning journal June 2018
Query by committee conference January 1992
Estimating the mean and variance of the target probability distribution conference January 1994
Diverse ensembles for active learning conference January 2004
A Sequential Algorithm for Training Text Classifiers book January 1994
An Overview of Overfitting and its Solutions journal February 2019