Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Lazy Checkpointing : Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems

Conference ·
DOI:https://doi.org/10.1109/DSN.2014.101· OSTI ID:1130431

Continuing increase in the computational power of supercomputers has enabled large-scale scientific applications in the areas of astrophysics, fusion, climate and combustion to run larger and longer-running simulations, facilitating deeper scientific insights. However, these long-running simulations are often interrupted by multiple system failures. Therefore, these applications rely on ``checkpointing'' as a resilience mechanism to store application state to permanent storage and recover from failures. \\ \indent Unfortunately, checkpointing incurs excessive I/O overhead on supercomputers due to large size of checkpoints, resulting in a sub-optimal performance and resource utilization. In this paper, we devise novel mechanisms to show how checkpointing overhead can be mitigated significantly by exploiting the temporal characteristics of system failures. We provide new insights and detailed quantitative understanding of the checkpointing overheads and trade-offs on large-scale machines. Our prototype implementation shows the viability of our approach on extreme-scale machines.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1130431
Country of Publication:
United States
Language:
English

Similar Records

Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems, In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Conference · Sun Jun 01 00:00:00 EDT 2014 · 2014 44TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN) · OSTI ID:1567365

Understanding checkpointing overheads on massive-scale systems : analysis of the IBM Blue Gene/P system.
Journal Article · Sun May 01 00:00:00 EDT 2011 · Int. J. High Perform. Comput. Appl. · OSTI ID:1015548

McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression
Journal Article · Mon Dec 31 23:00:00 EST 2012 · Scientific Programming · OSTI ID:1197891

Related Subjects