DOE Data Explorer title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Datasets for Custom-trained Machine-learning Interatomic Potentials: Nitric Acid Aqueous Solution

Abstract

This dataset was generated using an iterative active learning strategy with the ArcaNN software package (https://github.com/arcann-chem/arcann_training) to train machine-learning interatomic potentials (MLIPs) for aqueous nitric acid. Each active-learning cycle consisted of three stages: (1) training, (2) exploration, and (3) labeling. The initial training set comprised approximately 800 randomly selected configurations from a previous study by Lewis et al. (https://doi.org/10.1021/jp205510q), which investigated nitric acid solutions at 2, 3, 4, and 5 mol/L. For all configurations, single-point calculations of atomic forces and total energies were performed at the quantum density functional theory BLYP-D2 and PBE-D3 levels of theory using the CP2K Quickstep module. Valence electrons were treated explicitly, while core electrons on all atoms were represented by norm-conserving Goedecker–Teter–Hutter (GTH) pseudopotentials. Long-range dispersion interactions were accounted for using Grimme dispersion corrections. Wave functions were expanded in a mixed Gaussian-and-plane-wave scheme using TZV2P-MOLOPT basis sets for all elements and an 800 Ry auxiliary plane-wave cutoff for the electron density. Self-consistent field convergence was accelerated using orbital transformation and Direct Inversion in the Iterative Subspace, with a convergence threshold of 10^{-6}. All single-point calculations were carried out in periodic orthorhombic cells whose dimensions match those of the molecular configurations sampled from earlier trajectories. The CELL_REFmore » keyword in CP2K was used to define a fixed reference cell, ensuring consistency in the reference data used for MLIP training, particularly when cell fluctuations are present in NpT simulations. The resulting high-fidelity energies and forces constitute the ground-truth labels used to train the MLIPs contained in this dataset.« less

Authors:
; ; ; ORCiD logo ; ORCiD logo ; ; ; ORCiD logo
  1. Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
Publication Date:
DOE Contract Number:  
AC05-76RL01830
Research Org.:
PNNL (PNNL2)
Sponsoring Org.:
USDOE Office of Science (SC), Basic Energy Sciences (BES)
OSTI Identifier:
3004762
DOI:
https://doi.org/10.25584/3004762

Citation Formats

Dinpajooh, Mohammadhasan, LaCount, Michael D, Muller, Scott E, Henson, Neil J, Mejia Rodriguez, Daniel, Gomez, Axel, Mundy, Christopher J, and Ritzmann, Andrew M. Datasets for Custom-trained Machine-learning Interatomic Potentials: Nitric Acid Aqueous Solution. United States: N. p., 2025. Web. doi:10.25584/3004762.
Dinpajooh, Mohammadhasan, LaCount, Michael D, Muller, Scott E, Henson, Neil J, Mejia Rodriguez, Daniel, Gomez, Axel, Mundy, Christopher J, & Ritzmann, Andrew M. Datasets for Custom-trained Machine-learning Interatomic Potentials: Nitric Acid Aqueous Solution. United States. doi:https://doi.org/10.25584/3004762
Dinpajooh, Mohammadhasan, LaCount, Michael D, Muller, Scott E, Henson, Neil J, Mejia Rodriguez, Daniel, Gomez, Axel, Mundy, Christopher J, and Ritzmann, Andrew M. 2025. "Datasets for Custom-trained Machine-learning Interatomic Potentials: Nitric Acid Aqueous Solution". United States. doi:https://doi.org/10.25584/3004762. https://www.osti.gov/servlets/purl/3004762. Pub date:Thu Nov 20 23:00:00 EST 2025
@article{osti_3004762,
title = {Datasets for Custom-trained Machine-learning Interatomic Potentials: Nitric Acid Aqueous Solution},
author = {Dinpajooh, Mohammadhasan and LaCount, Michael D and Muller, Scott E and Henson, Neil J and Mejia Rodriguez, Daniel and Gomez, Axel and Mundy, Christopher J and Ritzmann, Andrew M},
abstractNote = {This dataset was generated using an iterative active learning strategy with the ArcaNN software package (https://github.com/arcann-chem/arcann_training) to train machine-learning interatomic potentials (MLIPs) for aqueous nitric acid. Each active-learning cycle consisted of three stages: (1) training, (2) exploration, and (3) labeling. The initial training set comprised approximately 800 randomly selected configurations from a previous study by Lewis et al. (https://doi.org/10.1021/jp205510q), which investigated nitric acid solutions at 2, 3, 4, and 5 mol/L. For all configurations, single-point calculations of atomic forces and total energies were performed at the quantum density functional theory BLYP-D2 and PBE-D3 levels of theory using the CP2K Quickstep module. Valence electrons were treated explicitly, while core electrons on all atoms were represented by norm-conserving Goedecker–Teter–Hutter (GTH) pseudopotentials. Long-range dispersion interactions were accounted for using Grimme dispersion corrections. Wave functions were expanded in a mixed Gaussian-and-plane-wave scheme using TZV2P-MOLOPT basis sets for all elements and an 800 Ry auxiliary plane-wave cutoff for the electron density. Self-consistent field convergence was accelerated using orbital transformation and Direct Inversion in the Iterative Subspace, with a convergence threshold of 10^{-6}. All single-point calculations were carried out in periodic orthorhombic cells whose dimensions match those of the molecular configurations sampled from earlier trajectories. The CELL_REF keyword in CP2K was used to define a fixed reference cell, ensuring consistency in the reference data used for MLIP training, particularly when cell fluctuations are present in NpT simulations. The resulting high-fidelity energies and forces constitute the ground-truth labels used to train the MLIPs contained in this dataset.},
doi = {10.25584/3004762},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Thu Nov 20 23:00:00 EST 2025},
month = {Thu Nov 20 23:00:00 EST 2025}
}