skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Design and Performance Characterization of RADICAL-Pilot on Leadership-Class Platforms

Journal Article · · IEEE Transactions on Parallel and Distributed Systems

Many extreme scale scientific applications have workloads comprised of a large number of individual highperformance tasks. The Pilot abstraction decouples workload specification, resource management, and task execution via job placeholders and late-binding. As such, suitable implementations of the Pilot abstraction can support the collective execution of large number of tasks on supercomputers. We introduce RADICAL-Pilot (RP) as a portable, modular and extensible Pilot enabled runtime system. We describe RP's design, architecture and implementation. We characterize its performance and show its ability to scalably execute workloads comprised of tens of thousands heterogeneous tasks on DOE and NSF leadership-class HPC platforms. Specifically, we investigate RP's weak/strong scaling with CPU/GPU, single/multi core, (non)MPI tasks and python functions when using most of ORNL Summit and TACC Frontera. RADICAL-Pilot can be used stand-alone, as well as the runtime for third-party workflow systems.

Research Organization:
Brookhaven National Laboratory (BNL), Upton, NY (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research
Grant/Contract Number:
SC0012704
OSTI ID:
1830194
Report Number(s):
BNL-222357-2021-JAAM
Journal Information:
IEEE Transactions on Parallel and Distributed Systems, Vol. 33, Issue 4; ISSN 1045-9219
Publisher:
IEEECopyright Statement
Country of Publication:
United States
Language:
English

References (36)

Makeflow: a portable abstraction for data intensive computing on clusters, clouds, and grids
  • Albrecht, Michael; Donnelly, Patrick; Bui, Peter
  • Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies - SWEET '12 https://doi.org/10.1145/2443416.2443417
conference January 2012
Pegasus: A Framework for Mapping Complex Scientific Workflows onto Distributed Systems journal January 2005
Pegasus, a workflow management system for science automation journal May 2015
Eden: Simplified Management of Atypical High-Performance Computing Jobs journal November 2013
JETS: Language and System Support for Many-Parallel-Task Workflows journal June 2013
Coasters: Uniform Resource Provisioning and Access for Clouds and Grids
  • Hategan, M.; Wozniak, J.; Maheshwari, K.
  • 2011 IEEE 4th International Conference on Utility and Cloud Computing (UCC 2011), 2011 Fourth IEEE International Conference on Utility and Cloud Computing https://doi.org/10.1109/UCC.2011.25
conference December 2011
Swift: Fast, Reliable, Loosely Coupled Parallel Computation conference July 2007
High-throughput binding affinity calculations at extreme scales journal December 2018
High-Throughput Computing on High-Performance Platforms: A Case Study conference October 2017
Evaluating Distributed Execution of Workloads conference October 2017
Middleware Building Blocks for Workflow Systems journal July 2019
Flux: A Next-Generation Resource Management Framework for Large HPC Centers
  • Ahn, Dong H.; Garlick, Jim; Grondona, Mark
  • 2014 43nd International Conference on Parallel Processing Workshops (ICCPW), 2014 43rd International Conference on Parallel Processing Workshops https://doi.org/10.1109/ICPPW.2014.15
conference September 2014
HPX: A Task Based Programming Model in a Global Address Space
  • Kaiser, Hartmut; Heller, Thomas; Adelstein-Lelbach, Bryce
  • Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models - PGAS '14 https://doi.org/10.1145/2676870.2676883
conference January 2014
Cilk: an efficient multithreaded runtime system journal August 1995
GWpilot: Enabling multi-level scheduling in distributed infrastructures with GridWay and pilot jobs journal April 2015
FireWorks: a dynamic workflow system designed for high-throughput applications: FireWorks: A Dynamic Workflow System Designed for High-Throughput Applications journal May 2015
AI-driven multiscale simulations illuminate mechanisms of SARS-CoV-2 spike dynamics journal April 2021
A Comprehensive Perspective on Pilot-Job Systems journal April 2018
Parsl: Pervasive Parallel Programming in Python
  • Babuji, Yadu; Foster, Ian; Wilde, Michael
  • Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '19 https://doi.org/10.1145/3307681.3325400
conference January 2019
Resource Allocation Policies for Loosely Coupled Applications in Heterogeneous Computing Systems journal August 2016
The future of scientific workflows journal April 2017
Pilot factory – a Condor-based system for scalable Pilot Job generation in the Panda WMS framework journal April 2010
Characterization and identification of HPC applications at leadership computing facility
  • Liu, Zhengchun; Lewis, Ryan; Kettimuthu, Rajkumar
  • ICS '20: 2020 International Conference on Supercomputing, Proceedings of the 34th ACM International Conference on Supercomputing https://doi.org/10.1145/3392717.3392774
conference June 2020
The Impact of Heterogeneous Computing on Workflows for Biomolecular Simulation and Analysis journal March 2015
Synapse: Synthetic application profiler and emulator journal July 2018
Evolution of the ATLAS PanDA workload management system for exascale computational science journal June 2014
Characterizing the Performance of Executing Many-tasks on Summit conference November 2019
The open science grid journal July 2007
SAGA: A standardized access layer to heterogeneous Distributed Computing Infrastructure journal September 2015
DIRAC pilot framework and the DIRAC Workload Management System journal April 2010
glideinWMS—a generic pilot-based workload management system journal July 2008
Falkon: a Fast and Light-weight tasK executiON framework conference January 2007
CernVM Co-Pilot: an Extensible Framework for Building Scalable Computing Infrastructures on the Cloud journal December 2012
Fast recovery of free energy landscapes via diffusion-map-directed molecular dynamics journal January 2014
Supercomputing Pipelines Search for Therapeutics Against COVID-19 journal January 2020
Characterizing the Performance of Executing Many-tasks on Summit preprint January 2019