DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: MatFold: systematic insights into materials discovery models' performance through standardized cross-validation protocols

Journal Article · · Digital Discovery
DOI: https://doi.org/10.1039/d4dd00250d · OSTI ID:2483691

Machine learning (ML) models in the materials sciences that are validated by overly simplistic cross-validation (CV) protocols can yield biased performance estimates for downstream modeling or materials screening tasks. This can be particularly counterproductive for applications where the time and cost of failed validation efforts (experimental synthesis, characterization, and testing) are consequential. We propose a set of standardized and increasingly difficult splitting protocols for chemically and structurally motivated CV that can be followed to validate any ML model for materials discovery. Among several benefits, this enables systematic insights into model generalizability, improvability, and uncertainty, provides benchmarks for fair comparison between competing models with access to differing quantities of data, and systematically reduces possible data leakage through increasingly strict splitting protocols. Performing thorough CV investigations across increasingly strict chemical/structural splitting criteria, local vs. global property prediction tasks, small vs. large datasets, and structure vs. compositional model architectures, some common threads are observed; however, several marked differences exist across these exemplars, indicating the need for comprehensive analysis to fully understand each model's generalization accuracy and potential for materials discovery. For this we provide a general-purpose, featurization-agnostic toolkit, MatFold, to automate reproducible construction of these CV splits and encourage further community use in model benchmarking.

Research Organization:
Sandia National Laboratories (SNL-CA), Livermore, CA (United States)
Sponsoring Organization:
USDOE; USDOE Laboratory Directed Research and Development (LDRD) Program; USDOE Office of Energy Efficiency and Renewable Energy (EERE), Office of Sustainable Transportation and Fuels. Hydrogen and Fuel Cell Technologies Office (HFTO)
Grant/Contract Number:
NA0003525
OSTI ID:
2483691
Report Number(s):
SAND--2025-03037J
Journal Information:
Digital Discovery, Journal Name: Digital Discovery Journal Issue: 3 Vol. 4; ISSN 2635-098X
Publisher:
Royal Society of ChemistryCopyright Statement
Country of Publication:
United States
Language:
English

References (31)

Novel Ultrabright and Air‐Stable Photocathodes Discovered from Machine Learning and Density Functional Theory Driven Screening journal September 2021
A survey of uncertainty in deep neural networks journal July 2023
Error assessment and optimal cross-validation approaches in machine learning applied to impurity diffusion journal November 2019
Evaluating explorative prediction power of machine learning algorithms for materials discovery using k -fold forward cross-validation journal January 2020
Can machine learning find extraordinary materials? journal March 2020
Graph Networks as a Universal Machine Learning Framework for Molecules and Crystals journal April 2019
Surface Photovoltage-Induced Ultralow Work Function Material for Thermionic Energy Converters journal July 2019
Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. journal April 2013
Factors Governing Oxygen Vacancy Formation in Oxide Perovskites journal August 2021
A general-purpose machine learning framework for predicting properties of inorganic materials journal August 2016
Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm journal September 2020
Calibration after bootstrap for accurate uncertainty quantification in regression models journal May 2022
Computational design for 4D printing of topology optimized multi-material active composites journal January 2023
Structure-based out-of-distribution (OOD) materials property prediction: a benchmark study journal July 2024
CHGNet as a pretrained universal neural network potential for charge-informed atomistic modelling journal September 2023
Defect graph neural networks for materials discovery in high-temperature clean-energy applications journal August 2023
MoleculeNet: a benchmark for molecular machine learning journal January 2018
Can machine learning identify the next high-temperature superconductor? Examining extrapolation performance for materials discovery journal January 2018
Unified graph neural network force-field for the periodic table: solid state applications journal January 2023
Realistic material property prediction using domain adaptation based machine learning journal January 2024
Uncertainty quantification for molecular property predictions with graph neural architecture search journal January 2024
Tuning terahertz emission generated by anomalous Nernst effect in ferromagnetic metal journal June 2023
Generalizable density functional theory based photoemission model for the accelerated development of photocathodes and other photoemissive devices journal June 2020
Crystal Graph Convolutional Neural Networks for an Accurate and Interpretable Prediction of Material Properties journal April 2018
Opportunities and Challenges for Machine Learning in Materials Science journal July 2020
Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation journal November 2014
A foundation model for atomistic materials chemistry preprint January 2024
Accelerating Ensemble Error Bar Prediction with Single Models Fits preprint January 2024
Probing out-of-distribution generalization in machine learning for materials preprint January 2024
Machine Learning Materials Properties with Accurate Predictions, Uncertainty Estimates, Domain Guidance, and Persistent Online Accessibility preprint January 2024
A database of vacancy formation enthalpies for materials discovery dataset January 2023