Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Evaluating causal‐based feature selection for fuel property prediction models

Journal Article · · Statistical Analysis and Data Mining
DOI:https://doi.org/10.1002/sam.11511· OSTI ID:1786607
 [1];  [2];  [1];  [1]
  1. Sandia National Laboratories Livermore California USA
  2. Sandia National Laboratories Livermore California USA, Department of Immunology, University of Washington Seattle Washington USA

Abstract

In‐silico screening of novel biofuel molecules based on chemical and fuel properties is a critical first step in the biofuel evaluation process due to the significant volumes of samples required for experimental testing, the destructive nature of engine tests, and the costs associated with bench‐scale synthesis of novel fuels. Predictive models are limited by training sets of few existing measurements, often containing similar classes of molecules that represent just a subset of the potential molecular fuel space. Software tools can be used to generate every possible molecular descriptor for use as input features, but most of these features are largely irrelevant and training models on datasets with higher dimensionality than size tends to yield poor predictive performance. Feature selection has been shown to improve machine learning models, but correlation‐based feature selection fails to provide scientific insight into the underlying mechanisms that determine structure–property relationships. The implementation of causal discovery in feature selection could potentially inform the biofuel design process while also improving model prediction accuracy and robustness to new data. In this study, we investigate the benefits causal‐based feature selection might have on both model performance and identification of key molecular substructures. We found that causal‐based feature selection performed on par with alternative filtration methods, and that a structural causal model provides valuable scientific insights into the relationships between molecular substructures and fuel properties.

Sponsoring Organization:
USDOE
Grant/Contract Number:
NA0003525
OSTI ID:
1786607
Journal Information:
Statistical Analysis and Data Mining, Journal Name: Statistical Analysis and Data Mining Journal Issue: 6 Vol. 14; ISSN 1932-1864
Publisher:
Wiley Blackwell (John Wiley & Sons)Copyright Statement
Country of Publication:
United States
Language:
English

References (15)

PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints journal December 2010
A survey on feature selection methods journal January 2014
Using machine learning with target-specific feature sets for structure-property relationship modeling of octane numbers and octane sensitivity journal December 2020
Causality book January 2009
BioCompoundML: A General Biofuel Property Screening Tool for Biological Molecules Using Random Forest Classifiers journal September 2016
Stereo Signature Molecular Descriptor journal April 2013
PubChem Substance and Compound databases journal September 2015
Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy journal August 2005
Pretense, Counterfactuals, and Bayesian Causal Models: Why What Is Not Real Really Matters journal August 2013
Explaining machine learning classifiers through diverse counterfactual explanations
  • Mothilal, Ramaravind K.; Sharma, Amit; Tan, Chenhao
  • FAT* '20: Conference on Fairness, Accountability, and Transparency, Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency https://doi.org/10.1145/3351095.3372850
conference January 2020
“Why Should I Trust You?”: Explaining the Predictions of Any Classifier
  • Ribeiro, Marco; Singh, Sameer; Guestrin, Carlos
  • Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations https://doi.org/10.18653/v1/N16-3020
conference January 2016
A Historical Analysis of the Co-evolution of Gasoline Octane Number and Spark-Ignition Engines journal January 2016
Exploring Alternative Octane Specification Methods for Improved Gasoline Knock Resistance in Spark-Ignition Engines journal December 2018
The Relevance of Fuel RON and MON to Knock Onset in Modern SI Engines conference October 2008
Benefits of a Higher Octane Standard Gasoline for the U.S. Light-Duty Vehicle Fleet conference April 2014

Similar Records

A systematic method for selecting molecular descriptors as features when training models for predicting physiochemical properties
Journal Article · Mon Apr 04 00:00:00 EDT 2022 · Fuel · OSTI ID:1876862

Can machine learning predict fuel properties accurately?
Conference · Tue Mar 30 00:00:00 EDT 2021 · OSTI ID:1773721

Artificial Neural Network Models for Octane Number and Octane Sensitivity: A Quantitative Structure Property Relationship Approach to Fuel Design
Journal Article · Mon Apr 17 00:00:00 EDT 2023 · Journal of Energy Resources Technology · OSTI ID:2283717

Related Subjects