Expansion of bond dissociation prediction with machine learning to medicinally and environmentally relevant chemical space
- Colorado State Univ., Fort Collins, CO (United States)
- National Renewable Energy Laboratory (NREL), Golden, CO (United States)
Bond dissociation energetics underpin the thermodynamics of chemical transformations where bonds are broken or formed and can also be used to predict reaction rates and selectivities. Current machine learning (ML) models to predict bond dissociation energy (BDE) are largely limited in their elemental coverage to hydrogen and the second-row elements. This has restricted the applicability of ML-derived BDE predictions, particularly for molecules of medicinal relevance, since the heteroatoms S, Cl, F, P, Br, and I are commonly found in approved pharmaceuticals. Atmospherically and environmentally relevant molecules containing multiple halogen atoms have been similarly inaccessible. In this study, we considerably expand the size, elemental composition, and bond types of an extensive BDE database and train a new ML BDE model that includes C, H, N, O, S, Cl, F, P, Br, and I. We curate a new quantum chemical dataset of 531 244 unique zero-point energy inclusive homolytic dissociations of organic compounds. We investigate accuracy for out-of-sample molecules and implement iterative training and testing cycles during model development to improve the model accuracy. Improvements in predictive accuracy were achieved for datasets of pharmaceutically relevant molecules containing multiple C(sp2)–halogen bonds from 5.7 to 0.8 kcal mol-1 and polyhaloalkyl compounds with multiple C(sp3)–halogen bonds from 2.7 to 1.2 kcal mol-1 through the targeted augmentation of training data by as little as eight additional molecules. Our updated and expanded model (ALFABET) achieves a mean absolute error of 0.6 kcal mol-1 for both enthalpies and free energies compared to the quantum chemical ground truth. The graph-based representations utilized here outperform traditional cheminformatics features such as radial fingerprints, and there is no discernible improvement in accuracy by including more expensive QM-derived parameters, such as optimized bond lengths. Finally, we illustrate high accuracy in external prediction tasks for large halogenated natural products, pharmaceutically relevant halogenated molecules, atmospherically important halocarbons, and polyfluoroalkyl substances related to environmental toxicity.
- Research Organization:
- National Renewable Energy Laboratory (NREL), Golden, CO (United States)
- Sponsoring Organization:
- USDOE; National Science Foundation (NSF)
- Grant/Contract Number:
- AC36-08GO28308; CHE–2202693; 2201538
- OSTI ID:
- 2203761
- Alternate ID(s):
- OSTI ID: 2279169
- Report Number(s):
- NREL/JA-2700-88470; MainId:89249; UUID:b4f8df09-0031-4732-9e03-debc76630278; MainAdminID:71473
- Journal Information:
- Digital Discovery, Vol. 2, Issue 6; ISSN 2635-098X
- Publisher:
- Royal Society of ChemistryCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Similar Records
OrbNet Denali: A machine learning potential for biological and organic chemistry with semi-empirical cost and DFT accuracy
Molecular-orbital-based machine learning for open-shell and multi-reference systems with kernel addition Gaussian process regression