DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: ExaWorks software development kit: a robust and scalable collection of interoperable workflows technologies

Journal Article · · Frontiers in High Performance Computing
 [1];  [2];  [3];  [4];  [5];  [6];  [7];  [7];  [3];  [8];  [2];  [3];  [2];  [4];  [1];  [7]
  1. Brookhaven National Laboratory (BNL), Upton, NY (United States); Rutgers Univ., New Brunswick, NJ (United States)
  2. Univ. of Chicago, IL (United States); Argonne National Laboratory (ANL), Argonne, IL (United States)
  3. Brookhaven National Laboratory (BNL), Upton, NY (United States)
  4. Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
  5. Rutgers Univ., New Brunswick, NJ (United States)
  6. Incomputable LLC, Highland Park, NJ (United States)
  7. Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
  8. Argonne National Laboratory (ANL), Argonne, IL (United States)

Scientific discovery increasingly requires executing heterogeneous scientific workflows on high-performance computing (HPC) platforms. Heterogeneous workflows contain different types of tasks (e.g., simulation, analysis, and learning) that need to be mapped, scheduled, and launched on different computing. That requires a software stack that enables users to code their workflows and automate resource management and workflow execution. Currently, there are many workflow technologies with diverse levels of robustness and capabilities, and users face difficult choices of software that can effectively and efficiently support their use cases on HPC machines, especially when considering the latest exascale platforms. We contributed to addressing this issue by developing the ExaWorks Software Development Kit (SDK). The SDK is a curated collection of workflow technologies engineered following current best practices and specifically designed to work on HPC platforms. We present our experience with (1) curating those technologies, (2) integrating them to provide users with new capabilities, (3) developing a continuous integration platform to test the SDK on DOE HPC platforms, (4) designing a dashboard to publish the results of those tests, and (5) devising an innovative documentation platform to help users to use those technologies. Our experience details the requirements and the best practices needed to curate workflow technologies, and it also serves as a blueprint for the capabilities and services that DOE will have to offer to support a variety of scientific heterogeneous workflows on the newly available exascale HPC platforms.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); USDOE National Nuclear Security Administration (NNSA); National Science Foundation (NSF); National Institutes of Health (NIH)
Grant/Contract Number:
AC05-00OR22725; AC52-07NA27344; AC02-06CH11357; SC0012704
OSTI ID:
2476608
Journal Information:
Frontiers in High Performance Computing, Journal Name: Frontiers in High Performance Computing Vol. 2; ISSN 2813-7337
Publisher:
Frontiers Media S.A.Copyright Statement
Country of Publication:
United States
Language:
English

References (26)

FerroX: A GPU-accelerated, 3D phase-field simulation framework for modeling ferroelectric devices journal September 2023
Flux: Overcoming scheduling challenges for exascale workflows journal September 2020
AFCL: An Abstract Function Choreography Language for serverless workflow specification journal January 2021
Using Machine Learning at scale in numerical simulations with SmartSim: An application to ocean climate modeling journal July 2022
SAGA: A standardized access layer to heterogeneous Distributed Computing Infrastructure journal September 2015
Swift/T: Large-Scale Application Composition via Distributed-Memory Dataflow Processing
  • Wozniak, J. M.; Armstrong, T. G.; Wilde, M.
  • 2013 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing https://doi.org/10.1109/CCGrid.2013.99
conference May 2013
Ensemble Toolkit: Scalable and Flexible Execution of Ensembles of Tasks conference August 2016
Flux: A Next-Generation Resource Management Framework for Large HPC Centers
  • Ahn, Dong H.; Garlick, Jim; Grondona, Mark
  • 2014 43nd International Conference on Parallel Processing Workshops (ICCPW), 2014 43rd International Conference on Parallel Processing Workshops https://doi.org/10.1109/ICPPW.2014.15
conference September 2014
Colmena: Scalable Machine-Learning-Based Steering of Ensemble Simulations for High Performance Computing conference November 2021
High-bypass Learning: Automated Detection of Tumor Cells That Significantly Impact Drug Response
  • Wozniak, Justin M.; Yoo, Hyunseung; Mohd-Yusof, Jamaludin
  • 2020 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) and Workshop on Artificial Intelligence and Machine Learning for Scientific Applications (AI4S) https://doi.org/10.1109/MLHPCAI4S51975.2020.00012
conference November 2020
Design and Performance Characterization of RADICAL-Pilot on Leadership-Class Platforms journal April 2022
ExaWorks: Workflows for Exascale conference November 2021
A Community Roadmap for Scientific Workflows Research and Development conference November 2021
RADICAL-Pilot and Parsl: Executing Heterogeneous Workflows on HPC Platforms conference November 2022
PSI/J: A Portable Interface for Submitting, Monitoring, and Managing Jobs conference October 2023
P∗: A model of pilot-abstractions conference October 2012
A Reuse-Oriented Workflow Definition Language journal March 2003
A Comprehensive Perspective on Pilot-Job Systems journal April 2018
A massively parallel infrastructure for adaptive multiscale simulations: modeling RAS initiation pathway for cancer
  • Di Natale, Francesco; Bhatia, Harsh; Carpenter, Timothy S.
  • SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3295500.3356197
conference November 2019
Glinda: Supporting Data Science with Live Programming, GUIs and a Domain-specific Language conference May 2021
AMReX: Block-structured adaptive mesh refinement for multiphysics applications journal June 2021
CANDLE/Supervisor: a workflow framework for machine learning applied to cancer research journal December 2018
Publishing computational research - a review of infrastructures for reproducible and transparent scholarly communication journal July 2020
Workflows Community Summit 2022: A Roadmap Revolution text January 2023
Scalable Delivery of Scalable Libraries and Tools: How ECP Delivered a Software Ecosystem for Exascale and Beyond preprint January 2023
Common Workflow Language, v1.0 dataset January 2016

Similar Records

Exascale workflow applications and middleware: An ExaWorks retrospective
Journal Article · 2025 · International Journal of High Performance Computing Applications · OSTI ID:2573590

ExaWorks: Workflows for Exascale
Conference · 2021 · OSTI ID:1880770

ExaWorks: Workflows for Exascale
Conference · 2021 · 2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS) · OSTI ID:1863883