Models, data, and scripts associated with “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning”

Gary, Stefan; Scheibe, Timothy D.; Rexer, Em; Wilde, Michael; Vidal Torreira, Alvaro; Garayburu-Caruso, Vanessa A.; Goldman, Amy E.; Stegen, James C.

doi:10.15485/2318723

Title: Models, data, and scripts associated with “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning”

Dataset
Other Related Research

Abstract

This data package is associated with the publication “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning’’ submitted to the Journal of Geophysical Research: Machine Learning and Computation (Scheibe et al. 2024). River sediment respiration observations are expensive and labor intensive to obtain and there is no physical model for predicting this quantity. The Worldwide Hydrobiogeochemisty Observation Network for Dynamic River Systems (WHONDRS) observational data set (Goldman et al.; 2020) is used to train machine learning (ML) models to predict respiration rates at unsampled sites. This repository archives training data, ML models, predictions, and model evaluation results for the purposes of reproducibility of the results in the associated manuscript and community reuse of the ML models trained in this project. One of the key challenges in this work was to find an optimum configuration for machine learning models to work with this feature-rich (i.e. 100+ possible input variables) data set. Here, we used a two-tiered approach to managing the analysis of this complex data set: 1) a stacked ensemble of ML models that can automatically optimize hyperparameters to accelerate the process of model selection and tuning and 2) feature permutation importance to iteratively select the mostmore »« less

Authors:

;

Parallel Works Inc.
Pacific Northwest National Laboratory

Publication Date:: Fri Feb 23 00:00:00 EST 2024

DOE Contract Number:: DOE Award #54737

Research Org.:: Environmental System Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) (United States)

Sponsoring Org.:: U.S. DOE > Office of Science > Biological and Environmental Research (BER)

Subject:: 54 ENVIRONMENTAL SCIENCES

Keywords:: WHONDRS; Hyporheic zone; Respiration; Machine learning; ML; River corridor; River; Stream; Watershed; Catchment; CONUS; Contiguous Unites States; Flow cytometry; ESS-DIVE CSV File Formatting Guidelines Reporting Format; ESS-DIVE File Level Metadata Reporting Format; ESS-DIVE Model Data Archiving Guidelines; Sediment respiration rate; 15N; Bacterial abundance; FTICR-MS; Non-purgeable organic carbon; NPOC; Dissolved organic carbon; DOC; Grain size; Land use; Climate; 13C; EARTH SCIENCE > TERRESTRIAL HYDROSPHERE > WATER QUALITY/WATER CHEMISTRY > ISOTOPES > STABLE ISOTOPES; EARTH SCIENCE > LAND SURFACE > LAND USE/LAND COVER > LAND USE CLASSES

OSTI Identifier:: 2318723

DOI:: https://doi.org/10.15485/2318723

Citation Formats


                    Gary, Stefan, Scheibe, Timothy D., Rexer, Em, Wilde, Michael, Vidal Torreira, Alvaro, Garayburu-Caruso, Vanessa A., Goldman, Amy E., and Stegen, James C. Models, data, and scripts associated with “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning”.  United States: N. p., 2024. 
        Web.  doi:10.15485/2318723.

Copy to clipboard


                    Gary, Stefan, Scheibe, Timothy D., Rexer, Em, Wilde, Michael, Vidal Torreira, Alvaro, Garayburu-Caruso, Vanessa A., Goldman, Amy E., & Stegen, James C. Models, data, and scripts associated with “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning”.  United States.  doi:https://doi.org/10.15485/2318723

Copy to clipboard


                    Gary, Stefan, Scheibe, Timothy D., Rexer, Em, Wilde, Michael, Vidal Torreira, Alvaro, Garayburu-Caruso, Vanessa A., Goldman, Amy E., and Stegen, James C. 2024.  
"Models, data, and scripts associated with “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning”".  United States.  doi:https://doi.org/10.15485/2318723.  https://www.osti.gov/servlets/purl/2318723. Pub date:Fri Feb 23 00:00:00 EST 2024

Copy to clipboard


                    
@article{osti_2318723,

  title        = {Models, data, and scripts associated with “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning”},

  author       = {Gary, Stefan and Scheibe, Timothy D. and Rexer, Em and Wilde, Michael and Vidal Torreira, Alvaro and Garayburu-Caruso, Vanessa A. and Goldman, Amy E. and Stegen, James C.},

  abstractNote = {This data package is associated with the publication “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning’’ submitted to the Journal of Geophysical Research: Machine Learning and Computation (Scheibe et al. 2024). River sediment respiration observations are expensive and labor intensive to obtain and there is no physical model for predicting this quantity. The Worldwide Hydrobiogeochemisty Observation Network for Dynamic River Systems (WHONDRS) observational data set (Goldman et al.; 2020) is used to train machine learning (ML) models to predict respiration rates at unsampled sites. This repository archives training data, ML models, predictions, and model evaluation results for the purposes of reproducibility of the results in the associated manuscript and community reuse of the ML models trained in this project. One of the key challenges in this work was to find an optimum configuration for machine learning models to work with this feature-rich (i.e. 100+ possible input variables) data set. Here, we used a two-tiered approach to managing the analysis of this complex data set: 1) a stacked ensemble of ML models that can automatically optimize hyperparameters to accelerate the process of model selection and tuning and 2) feature permutation importance to iteratively select the most important features (i.e. inputs) to the ML models. The major elements of this ML workflow are modular, portable, open, and cloud-based, thus making this implementation a potential template for other applications. This data package is associated with the GitHub repository found at https://github.com/parallelworks/sl-archive-whondrs. A static copy of the GitHub repository is included in this data package as an archived version at the time of publishing this data package (March 2023). However, we recommend accessing these files via GitHub for full functionality.Please see the file level metadata (flmd; “sl-archive-whondrs_flmd.csv”) for a list of all files contained in this data package and descriptions for each. Please see the data dictionary (dd; “sl-archive-whondrs_dd.csv”) for a list of all column headers contained within comma separated value (csv) files in this data package and descriptions for each. The GitHub repository is organized into five top-level directories: (1) “input_data” holds the training data for the ML models; (2) “ml_models” holds machine learning models trained on the data in “input_data”; (3) “scripts” contains data preprocessing and postprocessing scripts and intermediate results specific to this data set that bookend the ML workflow; (4) “examples” contains the visualization of the results in this repository including plotting scripts for the manuscript (e.g., model evaluation, FPI results) and scripts for running predictions with the ML models (i.e., reusing the trained ML models); (5) “output_data” holds the overall results of the ML model on that branch. Each trained ML model resides on its own branch in the repository; this means that inputs and outputs can be different branch-to-branch. Furthermore, depending on the number of features used to train the ML models, the preprocessing and postprocessing scripts, and their intermediate results, can also be different branch-to-branch. The “main-*” branches are meant to be starting points (i.e. trunks) for each model branch (i.e. sprouts). Please see the Branch Navigation section in the top-level README.md in the GitHub repository for more details. There is also one hidden directory “.github/workflows”. This hidden directory contains information for how to run the ML workflow as an end-to-end automated GitHub Action but it is not needed for reusing the ML models archived here. Please the top-level README.md in the GitHub repository for more details on the automation.},

  doi          = {10.15485/2318723},

  journal      = {},

  number       = ,

  volume       = ,

  place        = {United States},

  year         = {Fri Feb 23 00:00:00 EST 2024},

  month        = {Fri Feb 23 00:00:00 EST 2024}

}

Copy to clipboard

Dataset:

View Dataset

DOI: https://doi.org/10.15485/2318723

Save / Share:

Export Metadata

Save to My Library

Similar records in DOE Data Explorer and OSTI.GOV collections:

Data and scripts associated with the manuscript "Encoding Diel Hysteresis and the Birch Effect in Dryland Soil Respiration Models through Knowledge-Guided Deep Learning"

Dataset Jiang, Peishi ; Chen, Xingyuan ; Missik, Justine ; ...

This package contains the data and scripts used in "Encoding Diel Hysteresis and the Birch Effect in Dryland Soil Respiration Models through Knowledge-Guided Deep Learning" (Jiang et al., 2022). The data.zip file contains the flux tower and automated chamber observations used for developing the deep learning model for modeling soil respiration. The scripts.zip file contains the Jupyter notebooks and python scripts for preprocessing the data, training the deep learning models, and postprocessing the results. The src.zip contains the source code for training the deep learning model, performing mutual information analysis, and plotting functions. The trained_models.zip contains multiple folders used formore »« less
Data and scripts associated with a manuscript investigating dissolved organic matter and microbial community linkages across seven globally distributed rivers

Dataset Danczak, Robert E. ; Goldman, Amy E. ; Borton, Mikayla A. ; ...

This data package is associated with the publication “Meta-metabolome ecology reveals that geochemistry and microbial functional potential are linked to organic matter development across seven rivers” submitted to Science of the Total Environment. This data package includes the data necessary to replicate the analyses presented within the manuscript to investigate dissolved organic matter (DOM) development across broad spatial distances and within divergent biomes. Specifically, we included the Fourier transform ion cyclotron mass spectrometry (FTICR-MS) data, geochemistry data, annotated metagenomic data, and results from ecological null modeling analyses in this data package. Additionally, we included the scripts necessary to generate themore »« less
Data and Scripts Associated with the Manuscript “Yakima River Basin Water Column Respiration is a Minor Component of River Ecosystem Respiration”

Dataset Fulton, Stephanie G. ; Barnes, Morgan ; Borton, Mikayla A. ; ...

This data package is associated with the publication “Yakima River Basin Water Column Respiration is a Minor Component of River Ecosystem Respiration” submitted to EGU Biogeochemistry (Fulton et al. 2023). In this research, water column respiration (ERwc) data, surface water chemistry data, organic matter (OM) chemistry data, and publicly available geospatial data were used in a multiple linear regression model to evaluate the drivers of spatial variability in ERwc at 47 sites across the Yakima River basin in Washington, USA.The data package includes the data inputs, and outputs, and R scripts to calculate descriptive statistics, run the multiple linear regressionmore »« less
WHONDRS River Corridor Dissolved Oxygen, Temperature, Sediment Aerobic Respiration, Grain Size, and Water Chemistry from Machine-Learning-Informed Sites across the Contiguous United States (v3)

Dataset Forbes, Brieanne ; Barnes, Morgan ; Boehnke, Brandon T. ; ...

This dataset supports a broader study examining hyporheic zone respiration rates to improve predictive models at a contiguous United States (CONUS) scale. The CONUS-Scale Model-Sample Study (CM) was designed following ICON (integrated, coordinated, open, and networked) principles to facilitate a model-experiment (ModEx) iteration approach, leveraging crowdsourced sampling across the CONUS. New machine learning models are created every month to guide sampling locations. Data from the resulting samples are used to test and rebuild the machine learning models for the next round of sampling guidance. Sampling began in April 2022 and ended in October 2023 In addition to widely distributed CONUSmore »« less
Model Inputs, Outputs, and Scripts associated with: “Spatial microbial respiration variations in the hyporheic zones within the Columbia River Basin”

Dataset Son, Kyongho ; Fang, Yilin ; Gomez-Velez, Jesus D. ; ...

This data package is associated with the publication “Spatial microbial respiration variations in the hyporheic zones within the Columbia River Basin” published in the Journal of Geophysical Research: Biogeosciences (Son et al. 2022) available at doi: 10.1029/2021JG006654. This data package includes the key model inputs/outputs of the river corridor model for the Columbia River Basin (CRB) and the model source codes, which were used in the manuscript. The model is a carbon-nitrogen-coupled river corridor model (RCM), and the model is used to quantify hyporheic zone (HZ) aerobic and anaerobic respiration at the NHDPLUS stream reach scales. The RCM used inmore »« less

Similar Records