skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: BioCompoundML: A General Biofuel Property Screening Tool for Biological Molecules Using Random Forest Classifiers

Abstract

Screening a large number of biologically derived molecules for potential fuel compounds without recourse to experimental testing is important in identifying understudied yet valuable molecules. Experimental testing, although a valuable standard for measuring fuel properties, has several major limitations, including the requirement of testably high quantities, considerable expense, and a large amount of time. This paper discusses the development of a general-purpose fuel property tool, using machine learning, whose outcome is to screen molecules for desirable fuel properties. BioCompoundML adopts a general methodology, requiring as input only a list of training compounds (with identifiers and measured values) and a list of testing compounds (with identifiers). For the training data, BioCompoundML collects open data from the National Center for Biotechnology Information, incorporates user-provided features, imputes missing values, performs feature reduction, builds a classifier, and clusters compounds. BioCompoundML then collects data for the testing compounds, predicts class membership, and determines whether compounds are found in the range of variability of the training data set. We demonstrate this tool using three different fuel properties: research octane number (RON), threshold soot index (TSI), and melting point (MP). Here we provide measures of its success with these properties using randomized train/test measurements: average accuracy ismore » 88% in RON, 85% in TSI, and 94% in MP; average precision is 88% in RON, 88% in TSI, and 95% in MP; and average recall is 88% in RON, 82% in TSI, and 97% in MP. The receiver operator characteristics (area under the curve) were estimated at 0.88 in RON, 0.86 in TSI, and 0.87 in MP. We also measured the success of BioCompoundML by sending 16 compounds for direct RON determination. Finally, we provide a screen of 1977 hydrocarbons/oxygenates within the 8696 compounds in MetaCyc, identifying compounds with high predictive strength for high or low RON.« less

Authors:
 [1];  [2];  [3];  [1];  [4];  [1];  [1]
  1. Sandia National Lab. (SNL-CA), Livermore, CA (United States); Joint BioEnergy Inst. (JBEI), Emeryville, CA (United States)
  2. Sandia National Lab. (SNL-CA), Livermore, CA (United States)
  3. National Renewable Energy Lab. (NREL), Golden, CO (United States)
  4. Sandia National Lab. (SNL-CA), Livermore, CA (United States); Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Publication Date:
Research Org.:
National Renewable Energy Lab. (NREL), Golden, CO (United States)
Sponsoring Org.:
USDOE Office of Energy Efficiency and Renewable Energy (EERE)
OSTI Identifier:
1327434
Report Number(s):
NREL/JA-5400-67434
Journal ID: ISSN 0887-0624
Grant/Contract Number:
AC36-08GO28308; AC04-94AL85000; 347AC36-99GO10337; AC02- 05CH11231; BM0102060; DE347AC36-99GO10337
Resource Type:
Journal Article: Published Article
Journal Name:
Energy and Fuels
Additional Journal Information:
Journal Volume: 30; Journal Issue: 10; Journal ID: ISSN 0887-0624
Publisher:
American Chemical Society (ACS)
Country of Publication:
United States
Language:
English
Subject:
09 BIOMASS FUELS; screening tool; fuel properties; BioCompoundML

Citation Formats

Whitmore, Leanne S., Davis, Ryan W., McCormick, Robert L., Gladden, John M., Simmons, Blake A., George, Anthe, and Hudson, Corey M. BioCompoundML: A General Biofuel Property Screening Tool for Biological Molecules Using Random Forest Classifiers. United States: N. p., 2016. Web. doi:10.1021/acs.energyfuels.6b01952.
Whitmore, Leanne S., Davis, Ryan W., McCormick, Robert L., Gladden, John M., Simmons, Blake A., George, Anthe, & Hudson, Corey M. BioCompoundML: A General Biofuel Property Screening Tool for Biological Molecules Using Random Forest Classifiers. United States. doi:10.1021/acs.energyfuels.6b01952.
Whitmore, Leanne S., Davis, Ryan W., McCormick, Robert L., Gladden, John M., Simmons, Blake A., George, Anthe, and Hudson, Corey M. 2016. "BioCompoundML: A General Biofuel Property Screening Tool for Biological Molecules Using Random Forest Classifiers". United States. doi:10.1021/acs.energyfuels.6b01952.
@article{osti_1327434,
title = {BioCompoundML: A General Biofuel Property Screening Tool for Biological Molecules Using Random Forest Classifiers},
author = {Whitmore, Leanne S. and Davis, Ryan W. and McCormick, Robert L. and Gladden, John M. and Simmons, Blake A. and George, Anthe and Hudson, Corey M.},
abstractNote = {Screening a large number of biologically derived molecules for potential fuel compounds without recourse to experimental testing is important in identifying understudied yet valuable molecules. Experimental testing, although a valuable standard for measuring fuel properties, has several major limitations, including the requirement of testably high quantities, considerable expense, and a large amount of time. This paper discusses the development of a general-purpose fuel property tool, using machine learning, whose outcome is to screen molecules for desirable fuel properties. BioCompoundML adopts a general methodology, requiring as input only a list of training compounds (with identifiers and measured values) and a list of testing compounds (with identifiers). For the training data, BioCompoundML collects open data from the National Center for Biotechnology Information, incorporates user-provided features, imputes missing values, performs feature reduction, builds a classifier, and clusters compounds. BioCompoundML then collects data for the testing compounds, predicts class membership, and determines whether compounds are found in the range of variability of the training data set. We demonstrate this tool using three different fuel properties: research octane number (RON), threshold soot index (TSI), and melting point (MP). Here we provide measures of its success with these properties using randomized train/test measurements: average accuracy is 88% in RON, 85% in TSI, and 94% in MP; average precision is 88% in RON, 88% in TSI, and 95% in MP; and average recall is 88% in RON, 82% in TSI, and 97% in MP. The receiver operator characteristics (area under the curve) were estimated at 0.88 in RON, 0.86 in TSI, and 0.87 in MP. We also measured the success of BioCompoundML by sending 16 compounds for direct RON determination. Finally, we provide a screen of 1977 hydrocarbons/oxygenates within the 8696 compounds in MetaCyc, identifying compounds with high predictive strength for high or low RON.},
doi = {10.1021/acs.energyfuels.6b01952},
journal = {Energy and Fuels},
number = 10,
volume = 30,
place = {United States},
year = 2016,
month = 9
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record at 10.1021/acs.energyfuels.6b01952

Save / Share:
  • Screening a large number of biologically derived molecules for potential fuel compounds without recourse to experimental testing is important in identifying understudied yet valuable molecules. Experimental testing, although a valuable standard for measuring fuel properties, has several major limitations, including the requirement of testably high quantities, considerable expense, and a large amount of time. This paper discusses the development of a general-purpose fuel property tool, using machine learning, whose outcome is to screen molecules for desirable fuel properties. BioCompoundML adopts a general methodology, requiring as input only a list of training compounds (with identifiers and measured values) and a listmore » of testing compounds (with identifiers). For the training data, BioCompoundML collects open data from the National Center for Biotechnology Information, incorporates user-provided features, imputes missing values, performs feature reduction, builds a classifier, and clusters compounds. BioCompoundML then collects data for the testing compounds, predicts class membership, and determines whether compounds are found in the range of variability of the training data set. We demonstrate this tool using three different fuel properties: research octane number (RON), threshold soot index (TSI), and melting point (MP). Here we provide measures of its success with these properties using randomized train/test measurements: average accuracy is 88% in RON, 85% in TSI, and 94% in MP; average precision is 88% in RON, 88% in TSI, and 95% in MP; and average recall is 88% in RON, 82% in TSI, and 97% in MP. The receiver operator characteristics (area under the curve) were estimated at 0.88 in RON, 0.86 in TSI, and 0.87 in MP. We also measured the success of BioCompoundML by sending 16 compounds for direct RON determination. Finally, we provide a screen of 1977 hydrocarbons/oxygenates within the 8696 compounds in MetaCyc, identifying compounds with high predictive strength for high or low RON.« less
  • The utility of a fish DNA damage assay as a rapid monitoring tool was investigated. Metal plating wastewater was chosen as a sample because it contains various genotoxic metal species. Fish DNA damage assay results were compared to data generated from the conventional whole effluent toxicity (WET) test procedure. The Microtox{reg_sign} assay (Azur Environmental, Carlsbad, CA, USA) using Vibrio fischeri was also employed. Eleven samples from two metal plating companies were collected for this evaluation. For the fish DNA damage assay, 7-d-old fathead minnow larvae, Pimephales promelas, were utilized. They were exposed to a series of dilutions at 20 Cmore » for 2 h. Whole effluent toxicity tests conducted in this study included two acute toxicity tests with Daphnia magna and fathead minnows and two chronic toxicity tests with Ceriodaphnia dubia and fathead minnows. The fish DNA damage assay showed good correlations with both the acute and chronic WET test results, especially with those obtained with fathead minnows. The kappa values, an index of agreement, between the fish DNA damage assay and WET tests were shown to be acceptable. These findings imply that this novel fish DNA damage assay has use as an expedient toxicity screening procedure since it produces comparable results to those of the acute and chronic fathead minnow toxicity tests.« less
  • The goal of this testing was to evaluate the ability of currently available commercial off-the-shelf (COTS) biological indicator tests and immunoassays to detect Bacillus anthracis (Ba) spores and ricin. In general, immunoassays provide more specific identification of biological threats as compared to indicator tests [3]. Many of these detection products are widely used by first responders and other end users. In most cases, performance data for these instruments are supplied directly from the manufacturer, but have not been verified by an external, independent assessment [1]. Our test plan modules included assessments of inclusivity (ability to generate true positive results), commonlymore » encountered hoax powders (which can cause potential interferences or false positives), and estimation of limit of detection (LOD) (sensitivity) testing.« less
  • Purpose: Develop an automated Random Forest algorithm for tissue segmentation of CT examinations. Methods: Seven materials were classified for segmentation: background, lung/internal gas, fat, muscle, solid organ parenchyma, blood/contrast, and bone using Matlab and the Trainable Weka Segmentation (TWS) plugin of FIJI. The following classifier feature filters of TWS were investigated: minimum, maximum, mean, and variance each evaluated over a pixel radius of 2n, (n = 0–4). Also noise reduction and edge preserving filters, Gaussian, bilateral, Kuwahara, and anisotropic diffusion, were evaluated. The algorithm used 200 trees with 2 features per node. A training data set was established using anmore » anonymized patient’s (male, 20 yr, 72 kg) chest-abdomen-pelvis CT examination. To establish segmentation ground truth, the training data were manually segmented using Eclipse planning software, and an intra-observer reproducibility test was conducted. Six additional patient data sets were segmented based on classifier data generated from the training data. Accuracy of segmentation was determined by calculating the Dice similarity coefficient (DSC) between manual and auto segmented images. Results: The optimized autosegmentation algorithm resulted in 16 features calculated using maximum, mean, variance, and Gaussian blur filters with kernel radii of 1, 2, and 4 pixels, in addition to the original CT number, and Kuwahara filter (linear kernel of 19 pixels). Ground truth had a DSC of 0.94 (range: 0.90–0.99) for adult and 0.92 (range: 0.85–0.99) for pediatric data sets across all seven segmentation classes. The automated algorithm produced segmentation with an average DSC of 0.85 ± 0.04 (range: 0.81–1.00) for the adult patients, and 0.86 ± 0.03 (range: 0.80–0.99) for the pediatric patients. Conclusion: The TWS Random Forest auto-segmentation algorithm was optimized for CT environment, and able to segment seven material classes over a range of body habitus and CT protocol parameters with an average DSC of 0.86 ± 0.04 (range: 0.80–0.99).« less