Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Analyzing checkpointing trends for applications on the IBM Blue Gene/P system.

Conference ·
OSTI ID:982646

Current petascale systems have tens of thousands of hardware components and complex system software stacks, which increase the probability of faults occurring during the lifetime of a process. Checkpointing has been a popular method of providing fault tolerance in high-end systems. While considerable research has been done to optimize checkpointing, in practice the method still involves a high-cost overhead for users. In this paper, we study the checkpointing overhead seen by applications running on leadership-class machines such as the IBM Blue Gene/P at Argonne National Laboratory. We study various applications and design a methodology to assist users in understanding and choosing checkpointing frequency and reducing the overhead incurred. In particular, we study three popular applications -- the Grid-Based Projector-Augmented Wave application, the Carr-Parrinello Molecular Dynamics application, and a Nek5000 computational fluid dynamics application -- and analyze their memory usage and possible checkpointing trends on 32,768 processors of the Blue Gene/P system.

Research Organization:
Argonne National Laboratory (ANL)
Sponsoring Organization:
SC
DOE Contract Number:
AC02-06CH11357
OSTI ID:
982646
Report Number(s):
ANL/MCS/CP-64770
Country of Publication:
United States
Language:
ENGLISH