Checkpoint/Restart Vision and Strategies for NERSC’s Production Workloads
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
As a primary approach to fault-tolerant computing, Checkpoint/Restart (C/R) improves scientific productivity for users, provides scheduling flexibility for computing centers, and protects against system failures. While both applicationspecific (or application-level) and transparent C/R are used in practice, we are interested in transparent checkpointing, which is vital for system-level checkpointing. Developing and maintaining transparent C/R tools for HPC applications, however, is labor intensive and highly complex due to ever-changing HPC systems and diverse production workloads. Existing C/R tools are often research-oriented, so there is a gap to close before they can be used reliably with production workloads, especially on cutting edge HPC systems. In this position paper, we present our journey to prepare a production-ready MPI-Agnostic Network-Agnostic (MANA) transparent checkpointing tool for NERSC, and share our vision and strategies to bring transparent C/R capabilities to NERSC’s production workloads on current and future systems.
- Research Organization:
- Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Basic Energy Sciences (BES). Scientific User Facilities Division
- DOE Contract Number:
- AC02-05CH11231
- OSTI ID:
- 1814161
- Country of Publication:
- United States
- Language:
- English
Similar Records
Requirements for Linux Checkpoint/Restart
Affinity-aware checkpoint restart
Checkpoint/restart-enabled parallel debugging
Technical Report
·
Mon Feb 25 23:00:00 EST 2002
·
OSTI ID:793773
Affinity-aware checkpoint restart
Journal Article
·
Sun Dec 07 19:00:00 EST 2014
· ACM Digital Library
·
OSTI ID:1342535
Checkpoint/restart-enabled parallel debugging
Conference
·
Thu Nov 11 23:00:00 EST 2010
·
OSTI ID:1407087