Optimizing checkpoint data placement with guaranteed burst buffer endurance in large-scale hierarchical storage systems
Journal Article
·
· Journal of Parallel and Distributed Computing
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Univ. of Tennessee, Knoxville, TN (United States)
- Univ. of Tennessee, Knoxville, TN (United States)
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Non-volatile devices, like SSDs, will be an integral part of the deepening storage hierarchy on large-scale HPC systems. These devices can be on the compute nodes as part of a distributed burst buffer service or they can be external. Wherever they are located in the hierarchy, one critical design issue is the SSD endurance under the write-heavy workloads, such as the checkpoint I/O for scientific applications. For these environments, it is widely assumed that checkpoint operations can occur once every 60 min and for each checkpoint step as much as half of the system memory can be written out. Unfortunately, for large-scale HPC applications, the burst buffer SSDs can be worn out much more quickly given the extensive amount of data written at every checkpoint step. One possible solution is to control the amount of data written by reducing the checkpoint frequency. However, a direct effect caused by reduced checkpoint frequency is the increased vulnerability window of system failures and therefore potentially wasted computation time, especially for large-scale compute jobs.In this paper, we propose a new checkpoint placement optimization model which collaboratively utilizes both the burst buffer and the parallel file system to store the checkpoints, with design goals of maximizing computation efficiency while guaranteeing the SSD endurance requirements. Moreover, we present an adaptive algorithm which can dynamically adjust the checkpoint placement based on the system’s dynamic runtime characteristics and continuously optimize the burst buffer utilization. The evaluation results show that by using our adaptive checkpoint placement algorithm we can guarantee the burst buffer endurance with at most 5% performance degradation per application and less than 3% for the entire system.
- Research Organization:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- Grant/Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1648989
- Alternate ID(s):
- OSTI ID: 1410852
OSTI ID: 1565566
- Journal Information:
- Journal of Parallel and Distributed Computing, Journal Name: Journal of Parallel and Distributed Computing Journal Issue: 2 Vol. 100; ISSN 0743-7315
- Publisher:
- ElsevierCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Similar Records
Optimizing checkpoint data placement with guaranteed burst buffer endurance in large-scale hierarchical storage systems
Orchestrating Fault Prediction with Live Migration and Checkpointing
SPARC: Demonstrate burst-buffer-based checkpoint/restart on ATS-1.
Journal Article
·
Tue Jan 31 23:00:00 EST 2017
· Journal of Parallel and Distributed Computing
·
OSTI ID:1565566
Orchestrating Fault Prediction with Live Migration and Checkpointing
Conference
·
Mon Jun 01 00:00:00 EDT 2020
·
OSTI ID:1648858
SPARC: Demonstrate burst-buffer-based checkpoint/restart on ATS-1.
Technical Report
·
Sun Dec 31 23:00:00 EST 2017
·
OSTI ID:1417577