Lazy Checkpointing : Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems

Tiwari, Devesh; Gupta, Saurabh; Vazhkudai, Sudharshan S

doi:10.1109/DSN.2014.101

Title: Lazy Checkpointing : Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems

Conference · Wed Jan 01 00:00:00 EST 2014

DOI:https://doi.org/10.1109/DSN.2014.101· OSTI ID:1130431

Tiwari, Devesh ^[1]; Gupta, Saurabh ^[1]; Vazhkudai, Sudharshan S ^[1]

ORNL

Continuing increase in the computational power of supercomputers has enabled large-scale scientific applications in the areas of astrophysics, fusion, climate and combustion to run larger and longer-running simulations, facilitating deeper scientific insights. However, these long-running simulations are often interrupted by multiple system failures. Therefore, these applications rely on ``checkpointing'' as a resilience mechanism to store application state to permanent storage and recover from failures. \\ \indent Unfortunately, checkpointing incurs excessive I/O overhead on supercomputers due to large size of checkpoints, resulting in a sub-optimal performance and resource utilization. In this paper, we devise novel mechanisms to show how checkpointing overhead can be mitigated significantly by exploiting the temporal characteristics of system failures. We provide new insights and detailed quantitative understanding of the checkpointing overheads and trade-offs on large-scale machines. Our prototype implementation shows the viability of our approach on extreme-scale machines.

OSTI does not have a digital full text copy available. For more information, please see document availability, search WorldCat, or search Google Scholar.

Cite

Export

Save

Research Organization:: Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)

Sponsoring Organization:: USDOE

DOE Contract Number:: DE-AC05-00OR22725

OSTI ID:: 1130431

Resource Relation:: Conference: The 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2014), Atlanta, GA, USA, 20140623, 20140626

Country of Publication:: United States

Language:: English

Similar Records

Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems, In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks

Conference · Sun Jun 01 00:00:00 EDT 2014 · 2014 44TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN) · OSTI ID:1130431

Tiwari, Devesh; Gupta, Saurabh; Vazhkudai, Sudharshan S.

A case for Virtual Machine based Fault Injection in a High-Performance Computing Environment

Conference · Sat Jan 01 00:00:00 EST 2011 · OSTI ID:1130431

Vallee, Geoffroy R; Engelmann, Christian; Scott, Stephen L

Resiliency in numerical algorithm design for extreme scale simulations

Journal Article · Fri Dec 10 00:00:00 EST 2021 · International Journal of High Performance Computing Applications · OSTI ID:1130431

Agullo, Emmanuel; Altenbernd, Mirco; Anzt, Hartwig; +33 more

Title: Lazy Checkpointing : Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems

Citation Formats

Similar Records

Related Subjects