skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: RADICAL-Pilot and PMIx/PRRTE: Executing Heterogeneous Workloads at Large Scale on Partitioned HPC Resources

Conference ·

Execution of heterogeneous workflows on high-performance computing (HPC) platforms present unprecedented resource management and execution coordination challenges for runtime systems. Task heterogeneity increases the complexity of resource and execution management, limiting the scalability and efficiency of workflow execution. Resource partitioning and distribution of tasks execution over portioned resources promises to address those problems but we lack an experimental evaluation of its performance at scale. This paper provides a performance evaluation of the Process Management Interface for Exascale (PMIx) and its reference implementation PRRTE on the leadership-class HPC platform Summit, when integrated into a pilot-based runtime system called RADICAL-Pilot. We partition resources across multiple PRRTE Distributed Virtual Machine (DVM) environments, responsible for launching tasks via the PMIx interface. We experimentally measure the workload execution performance in terms of task scheduling/launching rate and distribution of DVM task placement times, DVM startup and termination overheads on the Summit leadership-class HPC platform. Integrated solution with PMIx/PRRTE enables using an abstracted, standardized set of interfaces for orchestrating the launch process, dynamic process management and monitoring capabilities. It extends scaling capabilities allowing to overcome a limitation of other launching mechanisms (e.g., JSM/LSF). Explored different DVM setup configurations provide insights on DVM performance and a layout to leverage it. Our experimental results show that heterogeneous workload of 65,500 tasks on 2048 nodes, and partitioned across 32 DVMs, runs steady with resource utilization not lower than 52%. While having less concurrently executed tasks resource utilization is able to reach up to 85%, based on results of heterogeneous workload of 8200 tasks on 256 nodes and 2 DVMs.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE Office of Science (SC)
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1999093
Resource Relation:
Journal Volume: 13592; Conference: 25th Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP 2022) in conjunction with IPDPS 2022 - Lyon, , France - 6/3/2022 4:00:00 AM-6/3/2022 4:00:00 AM
Country of Publication:
United States
Language:
English

References (20)

Flux: Overcoming scheduling challenges for exascale workflows journal September 2020
ExaWorks: Workflows for Exascale conference November 2021
IMPECCABLE: Integrated Modeling PipelinE for COVID Cure by Assessing Better LEads conference October 2021
Workflows are the New Applications: Challenges in Performance, Portability, and Productivity conference November 2020
Job Management and Task Bundling journal January 2018
Generalizable coordination of large multiscale workflows: challenges and learnings at scale
  • Bhatia, Harsh; Di Natale, Francesco; Moon, Joseph Y.
  • SC '21: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3458817.3476210
conference November 2021
AI-driven multiscale simulations illuminate mechanisms of SARS-CoV-2 spike dynamics journal April 2021
PMIx: Process management for exascale environments journal November 2018
OpenMM 7: Rapid development of high performance algorithms for molecular dynamics journal July 2017
Integration of cloud, grid and local cluster resources with DIRAC journal December 2011
CMS use of allocation based HPC resources journal October 2017
Evolution of the ATLAS PanDA workload management system for exascale computational science journal June 2014
Using Pilot Systems to Execute Many Task Workloads on Supercomputers book January 2019
SAGA: A standardized access layer to heterogeneous Distributed Computing Infrastructure journal September 2015
High-Throughput Computing on High-Performance Platforms: A Case Study conference October 2017
Scalable molecular dynamics on CPU and GPU architectures with NAMD journal July 2020
glideinWMS—a generic pilot-based workload management system journal July 2008
BigPanDA: PanDA Workload Management System and its Applications beyond ATLAS journal January 2019
Characterizing the Performance of Executing Many-tasks on Summit conference November 2019
Colmena: Scalable Machine-Learning-Based Steering of Ensemble Simulations for High Performance Computing conference November 2021

Similar Records

$\mathrm{RADICAL}$-Pilot and $\mathrm{PMIx}$/$\mathrm{PRRTE}$: Executing Heterogeneous Workloads at Large Scale on Partitioned $\mathrm{HPC}$ Resources
Journal Article · Thu Jan 12 00:00:00 EST 2023 · Lecture Notes in Computer Science · OSTI ID:1999093

Characterizing the Performance of Executing Many-tasks on Summit
Conference · Wed Jan 01 00:00:00 EST 2020 · OSTI ID:1999093

Design and Performance Characterization of RADICAL-Pilot on Leadership-Class Platforms
Journal Article · Fri Apr 01 00:00:00 EDT 2022 · IEEE Transactions on Parallel and Distributed Systems · OSTI ID:1999093

Related Subjects