skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Understanding checkpointing overheads on massive-scale systems : analysis of the IBM Blue Gene/P system.

Abstract

Providing fault tolerance in high-end petascale systems, consisting of millions of hardware components and complex software stacks, is becoming an increasingly challenging task. Checkpointing continues to be the most prevalent technique for providing fault tolerance in such high-end systems. Considerable research has focussed on optimizing checkpointing; however, in practice, checkpointing still involves a high-cost overhead for users. In this paper, we study the checkpointing overhead seen by various applications running on leadership-class machines like the IBM Blue Gene/P at Argonne National Laboratory. In addition to studying popular applications, we design a methodology to help users understand and intelligently choose an optimal checkpointing frequency to reduce the overall checkpointing overhead incurred. In particular, we study the Grid-Based Projector-Augmented Wave application, the Carr-Parrinello Molecular Dynamics application, the Nek5000 computational fluid dynamics application and the Parallel Ocean Program application-and analyze their memory usage and possible checkpointing trends on 65,536 processors of the Blue Gene/P system.

Authors:
; ;  [1]
  1. Mathematics and Computer Science
Publication Date:
Research Org.:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1015548
Report Number(s):
ANL/MCS/JA-67132
Journal ID: 1094-3420; TRN: US201111%%658
DOE Contract Number:  
DE-AC02-06CH11357
Resource Type:
Journal Article
Journal Name:
Int. J. High Perform. Comput. Appl.
Additional Journal Information:
Journal Volume: 25; Journal Issue: 2 ; May 2011
Country of Publication:
United States
Language:
ENGLISH
Subject:
99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; ANL; COMPUTERIZED SIMULATION; DESIGN; FLUID MECHANICS; TOLERANCE

Citation Formats

Gupta, R, Naik, H, and Beckman, P. Understanding checkpointing overheads on massive-scale systems : analysis of the IBM Blue Gene/P system.. United States: N. p., 2011. Web. doi:10.1177/1094342010369118.
Gupta, R, Naik, H, & Beckman, P. Understanding checkpointing overheads on massive-scale systems : analysis of the IBM Blue Gene/P system.. United States. doi:10.1177/1094342010369118.
Gupta, R, Naik, H, and Beckman, P. Sun . "Understanding checkpointing overheads on massive-scale systems : analysis of the IBM Blue Gene/P system.". United States. doi:10.1177/1094342010369118.
@article{osti_1015548,
title = {Understanding checkpointing overheads on massive-scale systems : analysis of the IBM Blue Gene/P system.},
author = {Gupta, R and Naik, H and Beckman, P},
abstractNote = {Providing fault tolerance in high-end petascale systems, consisting of millions of hardware components and complex software stacks, is becoming an increasingly challenging task. Checkpointing continues to be the most prevalent technique for providing fault tolerance in such high-end systems. Considerable research has focussed on optimizing checkpointing; however, in practice, checkpointing still involves a high-cost overhead for users. In this paper, we study the checkpointing overhead seen by various applications running on leadership-class machines like the IBM Blue Gene/P at Argonne National Laboratory. In addition to studying popular applications, we design a methodology to help users understand and intelligently choose an optimal checkpointing frequency to reduce the overall checkpointing overhead incurred. In particular, we study the Grid-Based Projector-Augmented Wave application, the Carr-Parrinello Molecular Dynamics application, the Nek5000 computational fluid dynamics application and the Parallel Ocean Program application-and analyze their memory usage and possible checkpointing trends on 65,536 processors of the Blue Gene/P system.},
doi = {10.1177/1094342010369118},
journal = {Int. J. High Perform. Comput. Appl.},
number = 2 ; May 2011,
volume = 25,
place = {United States},
year = {2011},
month = {5}
}