Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning
With the emergence of versatile storage systems, multi-level checkpointing (MLC) has become a common approach to gain efficiency. However, multi-level checkpoint/restart can cause enormous I/O traffic on HPC systems. To use multilevel checkpointing efficiently, it is important to optimize check-point/restart configurations. Current approaches, namely modeling and simulation, are either inaccurate or slow in determining the optimal configuration for a large scale system. In this paper, we show that machine learning models can be used in combination with accurate simulation to determine the optimal checkpoint configurations. We also demonstrate that more advanced techniques such as neural networks can further improve the performance in optimizing checkpoint configurations.
- Research Organization:
- Argonne National Laboratory (ANL)
- Sponsoring Organization:
- U.S. Department of Energy (Office not specified); USDOE Office of Science; National Science Foundation (NSF)
- DOE Contract Number:
- AC02-06CH11357
- OSTI ID:
- 1770373
- Country of Publication:
- United States
- Language:
- English
| Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System | report | April 2010 |
A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing
|
conference | December 2008 |
Random decision forests
|
conference | January 1995 |
Checkpointing algorithms and fault prediction
|
journal | February 2014 |
Checkpointing strategies for parallel jobs
|
conference | January 2011 |
Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing
|
conference | May 2013 |
Design and modeling of a non-blocking checkpointing system
|
conference | November 2012 |
Modeling Coordinated Checkpointing for Large-Scale Supercomputers
|
conference | January 2005 |
FTI: high performance fault tolerance interface for hybrid systems
|
conference | January 2011 |
Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications
|
conference | May 2014 |
Similar Records
Asynchronous Checkpoint Migration with MRNet in the Scalable Checkpoint / Restart Library
Checkpoint/Restart Vision and Strategies for NERSC’s Production Workloads
Toward an optimal online checkpoint solution under a two-level HPC checkpoint model
Conference
·
2012
·
OSTI ID:1047769
Checkpoint/Restart Vision and Strategies for NERSC’s Production Workloads
Technical Report
·
2021
·
OSTI ID:1814161
Toward an optimal online checkpoint solution under a two-level HPC checkpoint model
Journal Article
·
2016
· IEEE Transactions on Parallel and Distributed Systems
·
OSTI ID:1346727