skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Audacity of huge: overcoming challenges of data scarcity and data quality for machine learning in computational materials discovery

Journal Article · · Current Opinion in Chemical Engineering

Machine learning (ML)-accelerated discovery requires large amounts of high-fidelity data to reveal predictive structure–property relationships. For many properties of interest in materials discovery, the challenging nature and high cost of data generation has resulted in a data landscape that is both scarcely populated and of dubious quality. Data-driven techniques starting to overcome these limitations include the use of consensus across functionals in density functional theory, the development of new functionals or accelerated electronic structure theories, and the detection of where computationally demanding methods are most necessary. When properties cannot be reliably simulated, large experimental data sets can be used to train ML models. In the absence of manual curation, increasingly sophisticated natural language processing and automated image analysis are making it possible to learn structure–property relationships from the literature. Finally, models trained on these data sets will improve as they incorporate community feedback.

Research Organization:
Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States); Univ. of Minnesota, Minneapolis, MN (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA); Defense Advanced Research Projects Agency (DARPA); National Science Foundation (NSF); US Department of the Navy, Office of Naval Research (ONR)
Grant/Contract Number:
NA0003965; SC0012702; SC0018096; SC0019112; CBET-1704266; CBET-1846426; D18AP00039; N00014-17-1-2956; N00014-18-1-2434; N00014-20-1-2150
OSTI ID:
1976980
Alternate ID(s):
OSTI ID: 1836609
Journal Information:
Current Opinion in Chemical Engineering, Vol. 36, Issue C; ISSN 2211-3398
Publisher:
ElsevierCopyright Statement
Country of Publication:
United States
Language:
English

References (75)

Understanding the diversity of the metal-organic framework ecosystem journal August 2020
Materials Precursor Score: Modeling Chemists’ Intuition for the Synthetic Accessibility of Porous Organic Cage Precursors journal August 2021
A data-driven perspective on the colours of metal–organic frameworks journal January 2021
A Design-to-Device Pipeline for Data-Driven Materials Discovery journal February 2020
Efficient Treatment of Large Active Spaces through Multi-GPU Parallel Implementation of Direct Configuration Interaction journal January 2020
Semi-supervised Machine Learning Enables the Robust Detection of Multireference Character at Low Cost journal July 2020
The Cambridge Structural Database
  • Groom, Colin R.; Bruno, Ian J.; Lightfoot, Matthew P.
  • Acta Crystallographica Section B Structural Science, Crystal Engineering and Materials, Vol. 72, Issue 2, p. 171-179 https://doi.org/10.1107/S2052520616003954
journal April 2016
ImageDataExtractor: A Tool To Extract and Quantify Data from Microscopy Images journal November 2019
Exploring density functional subspaces with genetic algorithms journal December 2018
Prediction of higher-selectivity catalysts by computer-driven workflow and machine learning journal January 2019
Opportunities and challenges of text mining in materials research journal March 2021
Designing in the Face of Uncertainty: Exploiting Electronic Structure and Machine Learning Models for Discovery in Inorganic Chemistry journal March 2019
Unsupervised word embeddings capture latent knowledge from materials science literature journal July 2019
ANI-1, A data set of 20 million calculated off-equilibrium conformations for organic molecules journal December 2017
Machine learning the quantum-chemical properties of metal–organic frameworks for accelerated materials discovery journal May 2021
Autonomous Molecular Design: Then and Now journal March 2019
Inverse Design of Materials That Exhibit the Magnetocaloric Effect by Text-Mining of the Scientific Literature and Generative Deep Learning journal September 2021
Deciphering Cryptic Behavior in Bimetallic Transition-Metal Complexes with Machine Learning journal October 2021
Machine-Learning Coupled Cluster Properties through a Density Tensor Representation journal May 2020
Using Machine Learning and Data Mining to Leverage Community Knowledge for the Engineering of Stable Metal–Organic Frameworks journal October 2021
Does repeat synthesis in materials chemistry obey a power law? journal December 2019
Machine Learning Approaches toward Orbital-free Density Functional Theory: Simultaneous Training on the Kinetic Energy Density Functional and Its Functional Derivative journal August 2020
Data-Driven Acceleration of the Coupled-Cluster Singles and Doubles Iterative Solver journal June 2019
Anthropogenic biases in chemical reaction data hinder exploratory inorganic synthesis journal September 2019
Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning journal October 2017
Learning the Exchange-Correlation Functional from Nature with Fully Differentiable Density Functional Theory journal September 2021
Deep-neural-network solution of the electronic Schrödinger equation journal September 2020
Computational Ligand Descriptors for Catalyst Design journal October 2018
Rational Density Functional Selection Using Game Theory journal December 2017
Identification Schemes for Metal–Organic Frameworks To Enable Rapid Search and Cheminformatics Analysis journal September 2019
Torsion Angle Effect on the Activation of UiO Metal–Organic Frameworks journal April 2019
A Machine Learning Approach to Zeolite Synthesis Enabled by Automatic Literature Data Extraction journal April 2019
Thirty years of density functional theory in computational chemistry: an overview and extensive assessment of 200 density functionals journal April 2017
ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature journal October 2016
Predictive and mechanistic multivariate linear regression models for reaction development journal January 2018
A critical examination of compound stability predictions from machine-learned formation energies journal July 2020
Machine Learning Configuration Interaction journal October 2018
Discovering Relationships between OSDAs and Zeolites through Data Mining and Generative Neural Networks journal April 2021
Machine-Learned Energy Functionals for Multiconfigurational Wave Functions journal August 2021
Data-driven design of metal–organic frameworks for wet flue gas CO2 capture journal December 2019
Inorganic Materials Synthesis Planning with Literature-Trained Neural Networks journal January 2020
Prediction of water stability of metal–organic frameworks using machine learning journal November 2020
Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature journal July 2019
Data-Driven Approaches Can Overcome the Cost–Accuracy Trade-Off in Multireference Diagnostics journal June 2020
Bypassing the Kohn-Sham equations with machine learning journal October 2017
Commentary: The Materials Project: A materials genome approach to accelerating materials innovation journal July 2013
Advances, Updates, and Analytics for the Computation-Ready, Experimental Metal–Organic Framework Database: CoRE MOF 2019 journal November 2019
Discovery of complex oxides via automated experiments and data science journal September 2021
Seeing Is Believing: Experimental Spin States from Machine Learning Model Structure Predictions journal March 2020
tmQM Dataset—Quantum Geometries and Properties of 86k Transition Metal Complexes journal November 2020
Quantum chemical accuracy from density functional approximations via machine learning journal October 2020
Quantum Deep Field: Data-Driven Wave Function, Electron Density Generation, and Atomization Energy Prediction and Extrapolation with Machine Learning journal November 2020
Unifying machine learning and quantum chemistry with a deep neural network for molecular wavefunctions journal November 2019
ChemDataExtractor 2.0: Autopopulated Ontologies for Materials Science journal September 2021
Text Mining Metal–Organic Framework Papers journal January 2018
Virtual screening of inorganic materials synthesis parameters with deep learning journal December 2017
Materials Informatics with PoreBlazer v4.0 and the CSD MOF Database journal November 2020
Rapid Detection of Strong Correlation with Machine Learning for Transition-Metal Complex High-Throughput Screening journal August 2020
Automatic Selection of Active Orbitals from Generalized Valence Bond Orbitals journal September 2020
Automation of Active Space Selection for Multireference Methods via Machine Learning on Chemical Bond Dissociation journal March 2020
Artificial Neural Networks Applied as Molecular Wave Function Solvers journal April 2020
Capturing chemical intuition in synthesis of metal-organic frameworks journal February 2019
Big-Data Science in Porous Materials: Materials Genomics and Machine Learning journal June 2020
Design of efficient molecular organic light-emitting diodes by a high-throughput virtual screening and experimental approach journal August 2016
Computational Approach to Molecular Catalysis by 3d Transition Metals: Challenges and Opportunities journal October 2018
Simple Automatized Tool for Exchange–Correlation Functional Fitting journal March 2020
A Ranked-Orbital Approach to Select Active Spaces for High-Throughput Multireference Computation journal April 2021
Automated Selection of Active Orbital Spaces journal March 2016
Machine-learning-assisted materials discovery using failed experiments journal May 2016
Replacing hybrid density functional theory: motivation and recent advances journal January 2021
Using collective knowledge to assign oxidation states of metal cations in metal–organic frameworks journal July 2021
Computational Discovery of Transition-metal Complexes: From High-throughput Screening to Machine Learning journal July 2021
Random Forests journal January 2001
Topologically Guided, Automated Construction of Metal–Organic Frameworks and Their Evaluation for Energy-Related Applications journal September 2017
Completing density functional theory by machine learning hidden messages from molecules journal May 2020