DOE Data Explorer title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Models, data, and scripts associated with “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning”

Abstract

This data package is associated with the publication “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning’’ submitted to the Journal of Geophysical Research: Machine Learning and Computation (Scheibe et al. 2024). River sediment respiration observations are expensive and labor intensive to obtain and there is no physical model for predicting this quantity. The Worldwide Hydrobiogeochemisty Observation Network for Dynamic River Systems (WHONDRS) observational data set (Goldman et al.; 2020) is used to train machine learning (ML) models to predict respiration rates at unsampled sites. This repository archives training data, ML models, predictions, and model evaluation results for the purposes of reproducibility of the results in the associated manuscript and community reuse of the ML models trained in this project. One of the key challenges in this work was to find an optimum configuration for machine learning models to work with this feature-rich (i.e. 100+ possible input variables) data set. Here, we used a two-tiered approach to managing the analysis of this complex data set: 1) a stacked ensemble of ML models that can automatically optimize hyperparameters to accelerate the process of model selection and tuning and 2) feature permutation importance to iteratively select the mostmore » important features (i.e. inputs) to the ML models. The major elements of this ML workflow are modular, portable, open, and cloud-based, thus making this implementation a potential template for other applications. This data package is associated with the GitHub repository found at https://github.com/parallelworks/sl-archive-whondrs. A static copy of the GitHub repository is included in this data package as an archived version at the time of publishing this data package (March 2023). However, we recommend accessing these files via GitHub for full functionality.Please see the file level metadata (flmd; “sl-archive-whondrs_flmd.csv”) for a list of all files contained in this data package and descriptions for each. Please see the data dictionary (dd; “sl-archive-whondrs_dd.csv”) for a list of all column headers contained within comma separated value (csv) files in this data package and descriptions for each. The GitHub repository is organized into five top-level directories: (1) “input_data” holds the training data for the ML models; (2) “ml_models” holds machine learning models trained on the data in “input_data”; (3) “scripts” contains data preprocessing and postprocessing scripts and intermediate results specific to this data set that bookend the ML workflow; (4) “examples” contains the visualization of the results in this repository including plotting scripts for the manuscript (e.g., model evaluation, FPI results) and scripts for running predictions with the ML models (i.e., reusing the trained ML models); (5) “output_data” holds the overall results of the ML model on that branch. Each trained ML model resides on its own branch in the repository; this means that inputs and outputs can be different branch-to-branch. Furthermore, depending on the number of features used to train the ML models, the preprocessing and postprocessing scripts, and their intermediate results, can also be different branch-to-branch. The “main-*” branches are meant to be starting points (i.e. trunks) for each model branch (i.e. sprouts). Please see the Branch Navigation section in the top-level README.md in the GitHub repository for more details. There is also one hidden directory “.github/workflows”. This hidden directory contains information for how to run the ML workflow as an end-to-end automated GitHub Action but it is not needed for reusing the ML models archived here. Please the top-level README.md in the GitHub repository for more details on the automation.« less

Authors:
ORCiD logo ; ORCiD logo ; ORCiD logo ; ORCiD logo ; ORCiD logo ; ORCiD logo ; ORCiD logo ; ORCiD logo
  1. Parallel Works Inc.
  2. Pacific Northwest National Laboratory
Publication Date:
DOE Contract Number:  
DOE Award #54737
Research Org.:
Environmental System Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) (United States)
Sponsoring Org.:
U.S. DOE > Office of Science > Biological and Environmental Research (BER)
Subject:
54 ENVIRONMENTAL SCIENCES
Keywords:
WHONDRS; Hyporheic zone; Respiration; Machine learning; ML; River corridor; River; Stream; Watershed; Catchment; CONUS; Contiguous Unites States; Flow cytometry; ESS-DIVE CSV File Formatting Guidelines Reporting Format; ESS-DIVE File Level Metadata Reporting Format; ESS-DIVE Model Data Archiving Guidelines; Sediment respiration rate; 15N; Bacterial abundance; FTICR-MS; Non-purgeable organic carbon; NPOC; Dissolved organic carbon; DOC; Grain size; Land use; Climate; 13C; EARTH SCIENCE > TERRESTRIAL HYDROSPHERE > WATER QUALITY/WATER CHEMISTRY > ISOTOPES > STABLE ISOTOPES; EARTH SCIENCE > LAND SURFACE > LAND USE/LAND COVER > LAND USE CLASSES
OSTI Identifier:
2318723
DOI:
https://doi.org/10.15485/2318723

Citation Formats

Gary, Stefan, Scheibe, Timothy D., Rexer, Em, Wilde, Michael, Vidal Torreira, Alvaro, Garayburu-Caruso, Vanessa A., Goldman, Amy E., and Stegen, James C. Models, data, and scripts associated with “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning”. United States: N. p., 2024. Web. doi:10.15485/2318723.
Gary, Stefan, Scheibe, Timothy D., Rexer, Em, Wilde, Michael, Vidal Torreira, Alvaro, Garayburu-Caruso, Vanessa A., Goldman, Amy E., & Stegen, James C. Models, data, and scripts associated with “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning”. United States. doi:https://doi.org/10.15485/2318723
Gary, Stefan, Scheibe, Timothy D., Rexer, Em, Wilde, Michael, Vidal Torreira, Alvaro, Garayburu-Caruso, Vanessa A., Goldman, Amy E., and Stegen, James C. 2024. "Models, data, and scripts associated with “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning”". United States. doi:https://doi.org/10.15485/2318723. https://www.osti.gov/servlets/purl/2318723. Pub date:Fri Feb 23 00:00:00 EST 2024
@article{osti_2318723,
title = {Models, data, and scripts associated with “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning”},
author = {Gary, Stefan and Scheibe, Timothy D. and Rexer, Em and Wilde, Michael and Vidal Torreira, Alvaro and Garayburu-Caruso, Vanessa A. and Goldman, Amy E. and Stegen, James C.},
abstractNote = {This data package is associated with the publication “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning’’ submitted to the Journal of Geophysical Research: Machine Learning and Computation (Scheibe et al. 2024). River sediment respiration observations are expensive and labor intensive to obtain and there is no physical model for predicting this quantity. The Worldwide Hydrobiogeochemisty Observation Network for Dynamic River Systems (WHONDRS) observational data set (Goldman et al.; 2020) is used to train machine learning (ML) models to predict respiration rates at unsampled sites. This repository archives training data, ML models, predictions, and model evaluation results for the purposes of reproducibility of the results in the associated manuscript and community reuse of the ML models trained in this project. One of the key challenges in this work was to find an optimum configuration for machine learning models to work with this feature-rich (i.e. 100+ possible input variables) data set. Here, we used a two-tiered approach to managing the analysis of this complex data set: 1) a stacked ensemble of ML models that can automatically optimize hyperparameters to accelerate the process of model selection and tuning and 2) feature permutation importance to iteratively select the most important features (i.e. inputs) to the ML models. The major elements of this ML workflow are modular, portable, open, and cloud-based, thus making this implementation a potential template for other applications. This data package is associated with the GitHub repository found at https://github.com/parallelworks/sl-archive-whondrs. A static copy of the GitHub repository is included in this data package as an archived version at the time of publishing this data package (March 2023). However, we recommend accessing these files via GitHub for full functionality.Please see the file level metadata (flmd; “sl-archive-whondrs_flmd.csv”) for a list of all files contained in this data package and descriptions for each. Please see the data dictionary (dd; “sl-archive-whondrs_dd.csv”) for a list of all column headers contained within comma separated value (csv) files in this data package and descriptions for each. The GitHub repository is organized into five top-level directories: (1) “input_data” holds the training data for the ML models; (2) “ml_models” holds machine learning models trained on the data in “input_data”; (3) “scripts” contains data preprocessing and postprocessing scripts and intermediate results specific to this data set that bookend the ML workflow; (4) “examples” contains the visualization of the results in this repository including plotting scripts for the manuscript (e.g., model evaluation, FPI results) and scripts for running predictions with the ML models (i.e., reusing the trained ML models); (5) “output_data” holds the overall results of the ML model on that branch. Each trained ML model resides on its own branch in the repository; this means that inputs and outputs can be different branch-to-branch. Furthermore, depending on the number of features used to train the ML models, the preprocessing and postprocessing scripts, and their intermediate results, can also be different branch-to-branch. The “main-*” branches are meant to be starting points (i.e. trunks) for each model branch (i.e. sprouts). Please see the Branch Navigation section in the top-level README.md in the GitHub repository for more details. There is also one hidden directory “.github/workflows”. This hidden directory contains information for how to run the ML workflow as an end-to-end automated GitHub Action but it is not needed for reusing the ML models archived here. Please the top-level README.md in the GitHub repository for more details on the automation.},
doi = {10.15485/2318723},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Fri Feb 23 00:00:00 EST 2024},
month = {Fri Feb 23 00:00:00 EST 2024}
}