Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Running Ensemble Workflows at Extreme Scale: Lessons Learned and Path Forward

Conference ·

The ever-increasing volumes of scientific data combined with sophisticated techniques for extracting information from them have led to the increasing popularity of ensemble workflows which are a collection of runs of individual workflows. A traditional approach followed by scientists to run ensembles is to rely on simple scripts to execute different runs and manage resources. This approach is not scalable and is error-prone, thereby motivating the development of workflow management systems that specialize in executing ensembles on HPC clusters. However, when the size of both the ensemble and the target system reach extreme scales, existing workflow management systems face new challenges that hamper their efficient execution. In this paper, we describe our experience scaling an ensemble workflow from the computational biology domain from the early design stages to the execution at extreme scale on Summit, a leadership class supercomputer at the Oak Ridge National Laboratory. We discuss challenges that arise when scaling ensembles to several million runs on thousands of HPC nodes. We identify challenges with composition of the ensemble itself, its execution at large scale, post-processing of the generated data, and scalability of the file system. Based on the experience acquired, we develop a generic vision of the capabilities and abstractions to add to existing workflow management systems to enable the execution of ensemble workflows at extreme scales. We believe that the understanding of these fundamental challenges will help application teams along with workflow system developers with designing the next generation of infrastructure for composing and executing extreme-scale ensemble workflows.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1968703
Resource Relation:
Conference: 18th International Conference on e-Science (e-Science 2022) - Salt Lake City, Utah, United States of America - 10/10/2022 9:00:00 AM-10/14/2022 4:00:00 AM
Country of Publication:
United States
Language:
English

References (21)

Parsl: Pervasive Parallel Programming in Python January 2019
Pegasus, a workflow management system for science automation May 2015
Nextflow enables reproducible computational workflows April 2017
ExaWorks: Workflows for Exascale November 2021
Swift: A language for distributed parallel scripting September 2011
Workflow Management in Condor January 2007
Using MPI January 1999
PMIx: Process management for exascale environments November 2018
ADIOS 2: The Adaptable Input Output System. A framework for high-performance data management July 2020
An overview of the HDF5 technology suite and its applications January 2011
A High-Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks December 2019
libEnsemble: A Library to Coordinate the Concurrent Evaluation of Dynamic Ensembles of Calculations April 2022
Iterative random forests to discover predictive and stable high-order interactions January 2018
A Codesign Framework for Online Data Analysis and Reduction November 2019
The Exascale Framework for High Fidelity coupled Simulations (EFFIS): Enabling whole device modeling in fusion science May 2021
Feature-preserving Lossy Compression for In Situ Data Analysis August 2020
Serial Generalized Ensemble Simulations of Biomolecules with Self-Consistent Determination of Weights February 2012
Generalized-ensemble algorithms: enhanced sampling techniques for Monte Carlo and molecular dynamics simulations May 2004
DeltaFS: exascale file systems scale better without dedicated servers January 2015
Online data analysis and reduction: An important Co-design motif for extreme-scale computers June 2021
Harnessing the Power of Many: Extensible Toolkit for Scalable Ensemble Applications May 2018