Analyzing checkpointing trends for applications on the IBM Blue Gene/P system.
Current petascale systems have tens of thousands of hardware components and complex system software stacks, which increase the probability of faults occurring during the lifetime of a process. Checkpointing has been a popular method of providing fault tolerance in high-end systems. While considerable research has been done to optimize checkpointing, in practice the method still involves a high-cost overhead for users. In this paper, we study the checkpointing overhead seen by applications running on leadership-class machines such as the IBM Blue Gene/P at Argonne National Laboratory. We study various applications and design a methodology to assist users in understanding and choosing checkpointing frequency and reducing the overhead incurred. In particular, we study three popular applications -- the Grid-Based Projector-Augmented Wave application, the Carr-Parrinello Molecular Dynamics application, and a Nek5000 computational fluid dynamics application -- and analyze their memory usage and possible checkpointing trends on 32,768 processors of the Blue Gene/P system.
- Research Organization:
- Argonne National Laboratory (ANL)
- Sponsoring Organization:
- SC
- DOE Contract Number:
- AC02-06CH11357
- OSTI ID:
- 982646
- Report Number(s):
- ANL/MCS/CP-64770
- Country of Publication:
- United States
- Language:
- ENGLISH
Similar Records
Asynchronous Two-Level Checkpointing Scheme for Large-Scale Adjoints in the Spectral-Element Solver Nek5000
Scalable Transparent Checkpoint-Restart of Global Address Space Applications on Virtual Machines over Infiniband