skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Analyzing checkpointing trends for applications on the IBM Blue Gene/P system.

Abstract

Current petascale systems have tens of thousands of hardware components and complex system software stacks, which increase the probability of faults occurring during the lifetime of a process. Checkpointing has been a popular method of providing fault tolerance in high-end systems. While considerable research has been done to optimize checkpointing, in practice the method still involves a high-cost overhead for users. In this paper, we study the checkpointing overhead seen by applications running on leadership-class machines such as the IBM Blue Gene/P at Argonne National Laboratory. We study various applications and design a methodology to assist users in understanding and choosing checkpointing frequency and reducing the overhead incurred. In particular, we study three popular applications -- the Grid-Based Projector-Augmented Wave application, the Carr-Parrinello Molecular Dynamics application, and a Nek5000 computational fluid dynamics application -- and analyze their memory usage and possible checkpointing trends on 32,768 processors of the Blue Gene/P system.

Authors:
; ;
Publication Date:
Research Org.:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
982646
Report Number(s):
ANL/MCS/CP-64770
TRN: US201015%%1256
DOE Contract Number:  
DE-AC02-06CH11357
Resource Type:
Conference
Resource Relation:
Conference: 2nd International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2); Sep. 22, 2009 - Sep. 25, 2009; Vienna, Austria
Country of Publication:
United States
Language:
ENGLISH
Subject:
97 MATHEMATICAL METHODS AND COMPUTING; 99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; ANL; COMPUTERIZED SIMULATION; DESIGN; DYNAMICS; FAULT TOLERANT COMPUTERS; FLUID MECHANICS; LIFETIME; PROBABILITY; PROGRAMMING; SUPERCOMPUTERS

Citation Formats

Naik, H, Gupta, R, Beckman, P, and Mathematics and Computer Science. Analyzing checkpointing trends for applications on the IBM Blue Gene/P system.. United States: N. p., 2009. Web.
Naik, H, Gupta, R, Beckman, P, & Mathematics and Computer Science. Analyzing checkpointing trends for applications on the IBM Blue Gene/P system.. United States.
Naik, H, Gupta, R, Beckman, P, and Mathematics and Computer Science. Thu . "Analyzing checkpointing trends for applications on the IBM Blue Gene/P system.". United States.
@article{osti_982646,
title = {Analyzing checkpointing trends for applications on the IBM Blue Gene/P system.},
author = {Naik, H and Gupta, R and Beckman, P and Mathematics and Computer Science},
abstractNote = {Current petascale systems have tens of thousands of hardware components and complex system software stacks, which increase the probability of faults occurring during the lifetime of a process. Checkpointing has been a popular method of providing fault tolerance in high-end systems. While considerable research has been done to optimize checkpointing, in practice the method still involves a high-cost overhead for users. In this paper, we study the checkpointing overhead seen by applications running on leadership-class machines such as the IBM Blue Gene/P at Argonne National Laboratory. We study various applications and design a methodology to assist users in understanding and choosing checkpointing frequency and reducing the overhead incurred. In particular, we study three popular applications -- the Grid-Based Projector-Augmented Wave application, the Carr-Parrinello Molecular Dynamics application, and a Nek5000 computational fluid dynamics application -- and analyze their memory usage and possible checkpointing trends on 32,768 processors of the Blue Gene/P system.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2009},
month = {1}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: