$\mathrm{RADICAL}$-Pilot and $\mathrm{PMIx}$/$\mathrm{PRRTE}$: Executing Heterogeneous Workloads at Large Scale on Partitioned $\mathrm{HPC}$ Resources

Titov, Mikhail; Turilli, Matteo; Merzky, Andre; Naughton, Thomas; Elwasif, Wael; Jha, Shantenu

doi:10.1007/978-3-031-22698-4_5

Title: $$\mathrm{RADICAL}$$-Pilot and $$\mathrm{PMIx}$$/$$\mathrm{PRRTE}$$: Executing Heterogeneous Workloads at Large Scale on Partitioned $$\mathrm{HPC}$$ Resources

Journal Article · Thu Jan 12 00:00:00 EST 2023 · Lecture Notes in Computer Science

DOI:https://doi.org/10.1007/978-3-031-22698-4_5· OSTI ID:1963184

^[1];

^[2];

^[3];

^[4];

^[2]

Brookhaven National Laboratory (BNL), Upton, NY (United States)
Brookhaven National Laboratory (BNL), Upton, NY (United States); Rutgers University, Piscataway, NJ (United States)
Rutgers University, Piscataway, NJ (United States)
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Execution of heterogeneous workflows on high-performance computing (HPC) platforms present unprecedented resource management and execution coordination challenges for runtime systems. Task heterogeneity increases the complexity of resource and execution management, limiting the scalability and efficiency of workflow execution. Re-source partitioning and distribution of tasks execution over portioned re-sources promises to address those problems but we lack an experimental evaluation of its performance at scale. Here this paper provides a performance evaluation of the Process Management Interface for Exascale (PMIx) and its reference implementation PRRTE on the leadership-class HPC plat-form Summit, when integrated into a pilot-based runtime system called RADICAL-Pilot. We partition resources across multiple PRRTE Distributed Virtual Machine (DVM) environments, responsible for launching tasks via the PMIx interface. We experimentally measure the work-load execution performance in terms of task scheduling/launching rate and distribution of DVM task placement times, DVM startup and termination overheads on the Summit leadership-class HPC platform. Integrated solution with PMIx/PRRTE enables using an abstracted, standardized set of interfaces for orchestrating the launch process, dynamic process management and monitoring capabilities. It extends scaling capabilities allowing to overcome a limitation of other launching mechanisms (e.g., JSM/LSF). Explored different DVM setup configurations provide insights on DVM performance and a layout to leverage it. Our experimental results show that heterogeneous workload of 65,500 tasks on 2048 nodes, and partitioned across 32 DVMs, runs steady with resource utilization not lower than 52%. While having less concurrently executed tasks resource utilization is able to reach up to 85%, based on results of heterogeneous workload of 8200 tasks on 256 nodes and 2 DVMs.

View Accepted Manuscript (DOE)

Cite

Export

Save

Research Organization:: Brookhaven National Laboratory (BNL), Upton, NY (United States); Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); USDOE Office of Science (SC), High Energy Physics (HEP)

Grant/Contract Number:: SC0012704; AC05-00OR22725

OSTI ID:: 1963184

Report Number(s):: BNL-224123-2023-JAAM

Journal Information:: Lecture Notes in Computer Science, Vol. 13592; Conference: 25. Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP 2022), Held Virtually, 3 Jun 2022; ISSN 0302-9743

Publisher:: SpringerCopyright Statement

Country of Publication:: United States

Language:: English

References (20)

IMPECCABLE: Integrated Modeling PipelinE for COVID Cure by Assessing Better LEads Saadi, Aymen Al; Alfe, Dario; Babuji, Yadu ICPP 2021: 50th International Conference on Parallel Processing https://doi.org/10.1145/3472456.3473524	conference	October 2021
PMIx: Process management for exascale environments Castain, Ralph H.; Hursey, Joshua; Bouteiller, Aurelien Parallel Computing, Vol. 79 https://doi.org/10.1016/j.parco.2018.08.002	journal	November 2018
CMS use of allocation based HPC resources Hufnagel, Dirk Journal of Physics: Conference Series, Vol. 898 https://doi.org/10.1088/1742-6596/898/9/092050	journal	October 2017
Integration of cloud, grid and local cluster resources with DIRAC Fifield, Tom; Carmona, Ana; Casajús, Adrián Journal of Physics: Conference Series, Vol. 331, Issue 6 https://doi.org/10.1088/1742-6596/331/6/062009	journal	December 2011
SAGA: A standardized access layer to heterogeneous Distributed Computing Infrastructure Merzky, Andre; Weidner, Ole; Jha, Shantenu SoftwareX, Vol. 1-2 https://doi.org/10.1016/j.softx.2015.03.001	journal	September 2015
AI-driven multiscale simulations illuminate mechanisms of SARS-CoV-2 spike dynamics Casalino, Lorenzo; Dommer, Abigail C.; Gaieb, Zied The International Journal of High Performance Computing Applications https://doi.org/10.1177/10943420211006452	journal	April 2021
Evolution of the ATLAS PanDA workload management system for exascale computational science Maeno, T.; De, K.; Klimentov, A. Journal of Physics: Conference Series, Vol. 513, Issue 3 https://doi.org/10.1088/1742-6596/513/3/032062	journal	June 2014
Workflows are the New Applications: Challenges in Performance, Portability, and Productivity Ben-Nun, Tal; Gamblin, Todd; Hollman, D. S. 2020 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC) https://doi.org/10.1109/P3HPC51967.2020.00011	conference	November 2020
Using Pilot Systems to Execute Many Task Workloads on Supercomputers Merzky, Andre; Turilli, Matteo; Maldonado, Manuel Job Scheduling Strategies for Parallel Processing https://doi.org/10.1007/978-3-030-10632-4_4	book	January 2019
BigPanDA: PanDA Workload Management System and its Applications beyond ATLAS Svirin, Pavlo; De, Kaushik; Forti, Alessandra EPJ Web of Conferences, Vol. 214 https://doi.org/10.1051/epjconf/201921403050	journal	January 2019
Job Management and Task Bundling Berkowitz, Evan; Jansen, Gustav R.; McElvain, Kenneth EPJ Web of Conferences, Vol. 175 https://doi.org/10.1051/epjconf/201817509007	journal	January 2018
Flux: Overcoming scheduling challenges for exascale workflows Ahn, Dong H.; Bass, Ned; Chu, Albert Future Generation Computer Systems, Vol. 110 https://doi.org/10.1016/j.future.2020.04.006	journal	September 2020
Scalable molecular dynamics on CPU and GPU architectures with NAMD Phillips, James C.; Hardy, David J.; Maia, Julio D. C. The Journal of Chemical Physics, Vol. 153, Issue 4 https://doi.org/10.1063/5.0014475	journal	July 2020
glideinWMS—a generic pilot-based workload management system Sfiligoi, I. Journal of Physics: Conference Series, Vol. 119, Issue 6 https://doi.org/10.1088/1742-6596/119/6/062044	journal	July 2008
High-Throughput Computing on High-Performance Platforms: A Case Study Oleynik, Danila; Panitkin, Sergey; Turilli, Matteo 2017 IEEE 13th International Conference on e-Science (e-Science) https://doi.org/10.1109/eScience.2017.43	conference	October 2017
OpenMM 7: Rapid development of high performance algorithms for molecular dynamics Eastman, Peter; Swails, Jason; Chodera, John D. PLOS Computational Biology, Vol. 13, Issue 7 https://doi.org/10.1371/journal.pcbi.1005659	journal	July 2017
ExaWorks: Workflows for Exascale Al-Saadi, Aymen; Ahn, Dong H.; Babuji, Yadu 2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS) https://doi.org/10.1109/WORKS54523.2021.00012	conference	November 2021
Generalizable coordination of large multiscale workflows: challenges and learnings at scale Bhatia, Harsh; Di Natale, Francesco; Moon, Joseph Y. SC '21: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3458817.3476210	conference	November 2021
Characterizing the Performance of Executing Many-tasks on Summit Turilli, Matteo; Merzky, Andre; Naughton, Thomas 2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM) https://doi.org/10.1109/IPDRM49579.2019.00007	conference	November 2019
Colmena: Scalable Machine-Learning-Based Steering of Ensemble Simulations for High Performance Computing Ward, Logan; Sivaraman, Ganesh; Pauloski, J. Gregory 2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) https://doi.org/10.1109/MLHPC54614.2021.00007	conference	November 2021

Similar Records

RADICAL-Pilot and PMIx/PRRTE: Executing Heterogeneous Workloads at Large Scale on Partitioned HPC Resources

Conference · Sun Jan 01 00:00:00 EST 2023 · OSTI ID:1963184

Titov, Mikhail; Matteo, Turilli; Merzky, Andre; +3 more

Characterizing the Performance of Executing Many-tasks on Summit

Conference · Wed Jan 01 00:00:00 EST 2020 · OSTI ID:1963184

Matteo, Turilli; Merzky, Andre; Naughton III, Thomas; +2 more

Design and Performance Characterization of RADICAL-Pilot on Leadership-Class Platforms

Journal Article · Fri Apr 01 00:00:00 EDT 2022 · IEEE Transactions on Parallel and Distributed Systems · OSTI ID:1963184

Merzky, Andre; Turilli, Matteo; Titov, Mikhail; +2 more

Related Subjects

97 MATHEMATICS AND COMPUTING
high performance computing
resource management
middleware
runtime system
runtime environment

Title: $$\mathrm{RADICAL}$$-Pilot and $$\mathrm{PMIx}$$/$$\mathrm{PRRTE}$$: Executing Heterogeneous Workloads at Large Scale on Partitioned $$\mathrm{HPC}$$ Resources

Citation Formats

References (20)

Similar Records

Related Subjects