Understanding checkpointing overheads on massive-scale systems : analysis of the IBM Blue Gene/P system.

Gupta, R; Naik, H; Beckman, P

doi:10.1177/1094342010369118

Understanding checkpointing overheads on massive-scale systems : analysis of the IBM Blue Gene/P system.

Journal Article · Sun May 01 00:00:00 EDT 2011 · Int. J. High Perform. Comput. Appl.

DOI:https://doi.org/10.1177/1094342010369118· OSTI ID:1015548

Gupta, R; Naik, H; Beckman, P ^[1]

Mathematics and Computer Science

Providing fault tolerance in high-end petascale systems, consisting of millions of hardware components and complex software stacks, is becoming an increasingly challenging task. Checkpointing continues to be the most prevalent technique for providing fault tolerance in such high-end systems. Considerable research has focussed on optimizing checkpointing; however, in practice, checkpointing still involves a high-cost overhead for users. In this paper, we study the checkpointing overhead seen by various applications running on leadership-class machines like the IBM Blue Gene/P at Argonne National Laboratory. In addition to studying popular applications, we design a methodology to help users understand and intelligently choose an optimal checkpointing frequency to reduce the overall checkpointing overhead incurred. In particular, we study the Grid-Based Projector-Augmented Wave application, the Carr-Parrinello Molecular Dynamics application, the Nek5000 computational fluid dynamics application and the Parallel Ocean Program application-and analyze their memory usage and possible checkpointing trends on 65,536 processors of the Blue Gene/P system.

Research Organization:: Argonne National Laboratory (ANL)

Sponsoring Organization:: SC

DOE Contract Number:: AC02-06CH11357

OSTI ID:: 1015548

Report Number(s):: ANL/MCS/JA-67132

Journal Information:: Int. J. High Perform. Comput. Appl., Journal Name: Int. J. High Perform. Comput. Appl. Journal Issue: 2 ; May 2011 Vol. 25; ISSN 1094-3420

Country of Publication:: United States

Language:: ENGLISH

Similar Records

Analyzing checkpointing trends for applications on the IBM Blue Gene/P system.

Conference · Wed Dec 31 23:00:00 EST 2008 · OSTI ID:982646

Keeping checkpoint/restart viable for exascale systems.

Technical Report · Thu Sep 01 00:00:00 EDT 2011 · OSTI ID:1029780

Lazy Checkpointing : Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems

Conference · Tue Dec 31 23:00:00 EST 2013 · OSTI ID:1130431

Related Subjects

99 GENERAL AND MISCELLANEOUS
ANL
COMPUTERIZED SIMULATION
DESIGN
FLUID MECHANICS
TOLERANCE

Understanding checkpointing overheads on massive-scale systems : analysis of the IBM Blue Gene/P system.

Citation Formats

Similar Records

Related Subjects