Asynchronous Execution of Heterogeneous Tasks in ML-Driven HPC Workflows
Heterogeneous scientific workflows consist of numerous types of tasks that require execution on heterogeneous resources. Asynchronous execution of those tasks is crucial to improve resource utilization, task throughput and reduce workflows' makespan. Therefore, middleware capable of scheduling and executing different task types across heterogeneous resources must enable asynchronous execution of tasks. In this paper, we investigate the requirements and properties of the asynchronous task execution of machine learning (ML)-driven high-performance computing (HPC) workflows. We model the degree of asynchronicity permitted for arbitrary workflows and propose key metrics that can be used to determine qualitative benefits when employing asynchronous execution. Our experiments represent relevant scientific drivers, we perform them at scale on Summit, and we show that the performance enhancements due to asynchronous execution are consistent with our model.
- Research Organization:
- Brookhaven National Laboratory (BNL), Upton, NY (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- DOE Contract Number:
- SC0012704
- OSTI ID:
- 2333668
- Report Number(s):
- BNL-225411-2024-COPA
- Resource Relation:
- Conference: 26th Workshop on Job Scheduling Strategies for Parallel Processing, St. Petersburg, FL, 5/19/2023 - 5/22/2023
- Country of Publication:
- United States
- Language:
- English
Similar Records
$\mathrm{RADICAL}$-Pilot and $\mathrm{PMIx}$/$\mathrm{PRRTE}$: Executing Heterogeneous Workloads at Large Scale on Partitioned $\mathrm{HPC}$ Resources
Highly interactive, steered scientific workflows on HPC systems: optimizing design solutions