Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Asynchronous Execution of Heterogeneous Tasks in ML-Driven HPC Workflows

Conference ·

Heterogeneous scientific workflows consist of numerous types of tasks that require execution on heterogeneous resources. Asynchronous execution of those tasks is crucial to improve resource utilization, task throughput and reduce workflows' makespan. Therefore, middleware capable of scheduling and executing different task types across heterogeneous resources must enable asynchronous execution of tasks. In this paper, we investigate the requirements and properties of the asynchronous task execution of machine learning (ML)-driven high-performance computing (HPC) workflows. We model the degree of asynchronicity permitted for arbitrary workflows and propose key metrics that can be used to determine qualitative benefits when employing asynchronous execution. Our experiments represent relevant scientific drivers, we perform them at scale on Summit, and we show that the performance enhancements due to asynchronous execution are consistent with our model.

Research Organization:
Brookhaven National Laboratory (BNL), Upton, NY (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
DOE Contract Number:
SC0012704
OSTI ID:
2333668
Report Number(s):
BNL-225411-2024-COPA
Resource Relation:
Conference: 26th Workshop on Job Scheduling Strategies for Parallel Processing, St. Petersburg, FL, 5/19/2023 - 5/22/2023
Country of Publication:
United States
Language:
English

References (18)

DeepHyper: Asynchronous Hyperparameter Search for Deep Neural Networks conference December 2018
Ensemble Toolkit: Scalable and Flexible Execution of Ensembles of Tasks conference August 2016
Harnessing the Power of Many: Extensible Toolkit for Scalable Ensemble Applications conference May 2018
Legion: Expressing locality and independence with logical regions
  • Bauer, Michael; Treichler, Sean; Slaughter, Elliott
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.71
conference November 2012
Pandemic drugs at pandemic speed: infrastructure for accelerating COVID-19 drug discovery with hybrid machine learning- and physics-based simulations on high-performance computers journal October 2021
Generalizable coordination of large multiscale workflows: challenges and learnings at scale
  • Bhatia, Harsh; Di Natale, Francesco; Moon, Joseph Y.
  • SC '21: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3458817.3476210
conference November 2021
PaRSEC: Exploiting Heterogeneity to Enhance Scalability journal November 2013
Coupling streaming AI and HPC ensembles to achieve 100–1000× faster biomolecular simulations conference May 2022
A massively parallel infrastructure for adaptive multiscale simulations: modeling RAS initiation pathway for cancer
  • Di Natale, Francesco; Bhatia, Harsh; Carpenter, Timothy S.
  • SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3295500.3356197
conference November 2019
Using MPI book January 1999
HPX: A Task Based Programming Model in a Global Address Space
  • Kaiser, Hartmut; Heller, Thomas; Adelstein-Lelbach, Bryce
  • Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models - PGAS '14 https://doi.org/10.1145/2676870.2676883
conference January 2014
CHARM++: a portable concurrent object oriented system based on C++ journal October 1993
DeepDriveMD: Deep-Learning Driven Adaptive Molecular Simulations for Protein Folding conference November 2019
Enabling machine learning-ready HPC ensembles with Merlin journal June 2022
IMPECCABLE: Integrated Modeling PipelinE for COVID Cure by Assessing Better LEads conference October 2021
A Parallel Machine Learning Workflow for Neutron Scattering Data Analysis conference May 2023
Colmena: Scalable Machine-Learning-Based Steering of Ensemble Simulations for High Performance Computing conference November 2021
Proxima conference June 2021