skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: BioCompoundML: A General Biofuel Property Screening Tool for Biological Molecules Using Random Forest Classifiers

Journal Article · · Energy and Fuels
 [1];  [2];  [3];  [1];  [4];  [1];  [1]
  1. Sandia National Lab. (SNL-CA), Livermore, CA (United States); Joint BioEnergy Inst. (JBEI), Emeryville, CA (United States)
  2. Sandia National Laboratories, Livermore, California 94551, United States
  3. National Renewable Energy Lab. (NREL), Golden, CO (United States)
  4. Sandia National Lab. (SNL-CA), Livermore, CA (United States); Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)

Screening a large number of biologically derived molecules for potential fuel compounds without recourse to experimental testing is important in identifying understudied yet valuable molecules. Experimental testing, although a valuable standard for measuring fuel properties, has several major limitations, including the requirement of testably high quantities, considerable expense, and a large amount of time. This paper discusses the development of a general-purpose fuel property tool, using machine learning, whose outcome is to screen molecules for desirable fuel properties. BioCompoundML adopts a general methodology, requiring as input only a list of training compounds (with identifiers and measured values) and a list of testing compounds (with identifiers). For the training data, BioCompoundML collects open data from the National Center for Biotechnology Information, incorporates user-provided features, imputes missing values, performs feature reduction, builds a classifier, and clusters compounds. BioCompoundML then collects data for the testing compounds, predicts class membership, and determines whether compounds are found in the range of variability of the training data set. This tool is demonstrated using three different fuel properties: research octane number (RON), threshold soot index (TSI), and melting point (MP). We provide measures of its success with these properties using randomized train/test measurements: average accuracy is 88% in RON, 85% in TSI, and 94% in MP; average precision is 88% in RON, 88% in TSI, and 95% in MP; and average recall is 88% in RON, 82% in TSI, and 97% in MP. The receiver operator characteristics (area under the curve) were estimated at 0.88 in RON, 0.86 in TSI, and 0.87 in MP. We also measured the success of BioCompoundML by sending 16 compounds for direct RON determination. Finally, we provide a screen of 1977 hydrocarbons/oxygenates within the 8696 compounds in MetaCyc, identifying compounds with high predictive strength for high or low RON.

Research Organization:
National Renewable Energy Laboratory (NREL), Golden, CO (United States); Sandia National Lab. (SNL-CA), Livermore, CA (United States); Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE Office of Energy Efficiency and Renewable Energy (EERE), Vehicle Technologies Office (EE-3V); USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC), Biological and Environmental Research (BER)
Grant/Contract Number:
AC36-08GO28308; AC04-94AL85000; 347AC36-99GO10337; AC02- 05CH11231
OSTI ID:
1327434
Alternate ID(s):
OSTI ID: 1332664; OSTI ID: 1440943
Report Number(s):
NREL/JA-5400-67434
Journal Information:
Energy and Fuels, Journal Name: Energy and Fuels Vol. 30 Journal Issue: 10; ISSN 0887-0624
Publisher:
American Chemical Society (ACS)Copyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 20 works
Citation information provided by
Web of Science

References (31)

Random Forest:  A Classification and Regression Tool for Compound Classification and QSAR Modeling journal November 2003
Relation of Smoke Point to Molecular Structure journal March 1953
Octane numbers (ONs) of hydrocarbons: a QSPR study using optimal topological indices for the topological equivalents of the ONs journal September 2007
Automatic learning of chemical concepts: Research octane number and molecular substructures journal June 1995
Chemically Aware Model Builder (camb): an R package for property and bioactivity modelling of small molecules journal August 2015
Missing value estimation methods for DNA microarrays journal June 2001
When is Chemical Similarity Significant? The Statistical Distribution of Chemical Similarity Scores and Its Extreme Values journal June 2010
An analysis of four missing data treatment methods for supervised learning journal May 2003
Jean-Claude Bradley Open Melting Point Dataset dataset January 2014
Flash Point and Cetane Number Predictions for Fuel Compounds Using Quantitative Structure Property Relationship (QSPR) Methods journal September 2011
Model-Based Design of Tailor-Made Biofuels journal February 2016
Randomized Clustering Forests for Image Classification journal September 2008
Particulate matter indices using fuel smoke point for vehicle emissions with gasoline, ethanol blends, and butanol blends journal May 2016
Anthropogenic reaction parameters – the missing link between chemical intuition and the available chemical space journal January 2014
The Effect of Compression Ratio, Fuel Octane Rating, and Ethanol Content on Spark-Ignition Engine Efficiency journal July 2015
Current Modeling Methods Used in QSAR/QSPR book January 2012
Mechanism of smoke formation in diffusion flames journal January 1955
On the rational formulation of alternative fuels: melting point and net heat of combustion predictions for fuel compounds using machine learning methods journal April 2013
QSPR Models for Octane Number Prediction journal August 2014
Compendium of Experimental Cetane Numbers report August 2014
The effects of molecular structure on soot formation II. Diffusion flames journal October 1985
PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints journal December 2010
Effect of molecular structure on incipient soot formation journal January 1983
A study of jet fuel sooting tendency using the threshold sooting index (TSI) model journal April 2007
Feature Selection with the Boruta Package journal January 2010
Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information journal June 2011
Catalytic strategies for improving specific fuel properties book January 2007
PubChem Substance and Compound databases journal September 2015
The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases journal November 2013
Chemical product design: challenges and opportunities journal November 2004
Random Forests journal January 2001

Similar Records

Artificial Neural Network Models for Octane Number and Octane Sensitivity: A Quantitative Structure Property Relationship Approach to Fuel Design
Journal Article · Mon Apr 17 00:00:00 EDT 2023 · Journal of Energy Resources Technology · OSTI ID:1327434

Predicting octane number from microscale flame dynamics
Journal Article · Tue Jul 09 00:00:00 EDT 2019 · Combustion and Flame · OSTI ID:1327434

Selection Criteria and Screening of Potential Biomass-Derived Streams as Fuel Blendstocks for Advanced Spark-Ignition Engines
Journal Article · Tue Mar 28 00:00:00 EDT 2017 · SAE International Journal of Fuels and Lubricants (Online) · OSTI ID:1327434