Checkpointing Strategies for Shared High-Performance Computing Platforms
Journal Article
·
· International Journal of Networking and Computing
- Univ. of Tennessee, Knoxville, TN (United States)
- ENS Lyon (France); Univ. of Tennessee, Knoxville, TN (United States)
- Emory Univ., Atlanta, GA (United States)
- Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
- Univ. of Manchester (United Kingdom); Univ. of Tennessee, Knoxville, TN (United States)
Input/output (I/O) from various sources often contend for scarcely available bandwidth. For example, checkpoint/restart (CR) protocols can help to ensure application progress in failure-prone environments. However, CR I/O alongside an application's normal, requisite I/O can increase I/O contention and might negatively impact performance. In this work, we consider different aspects (system-level scheduling policies and hardware) that optimize the overall performance of concurrently executing CR-based applications that share I/O resources. We provide a theoretical model and derive a set of necessary constraints to minimize the global waste on a given platform. Our results demonstrate that Young/Daly's optimal checkpoint interval, despite providing a sensible metric for a single, undisturbed application, is not sufficient to optimally address resource contention at scale. We show that by combining optimal checkpointing periods with contention-aware system-level I/O scheduling strategies, we can significantly improve overall application performance and maximize the platform throughput. Finally, we evaluate how specialized hardware, namely burst buffers, may help to mitigate the I/O contention problem. Altogether, these results provide critical analysis and direct guidance on how to design efficient, CR ready, large -scale platforms without a large investment in the I/O subsystem.
- Research Organization:
- Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
- Grant/Contract Number:
- AC04-94AL85000
- OSTI ID:
- 1492861
- Report Number(s):
- SAND--2018-12751J; 669702
- Journal Information:
- International Journal of Networking and Computing, Journal Name: International Journal of Networking and Computing Journal Issue: 1 Vol. 9; ISSN 2185-2839
- Publisher:
- IJNC Editorial CommitteeCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Similar Records
Checkpointing Shared Memory Programs at the Application-level
A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?
A checkpoint compression study for high-performance computing systems
Conference
·
Wed Sep 08 00:00:00 EDT 2004
·
OSTI ID:15014797
A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?
Journal Article
·
Wed Jul 17 20:00:00 EDT 2024
· Future Generations Computer Systems
·
OSTI ID:2406527
A checkpoint compression study for high-performance computing systems
Journal Article
·
Mon Feb 16 23:00:00 EST 2015
· International Journal of High Performance Computing Applications
·
OSTI ID:1426906