Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Affinity-aware checkpoint restart

Journal Article · · ACM Digital Library
 [1];  [1];  [1];  [2];  [2]
  1. North Carolina State Univ., Raleigh, NC (United States)
  2. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)

Current checkpointing techniques employed to overcome faults for HPC applications result in inferior application performance after restart from a checkpoint for a number of applications. This is due to a lack of page and core affinity awareness of the checkpoint/restart (C/R) mechanism, i.e., application tasks originally pinned to cores may be restarted on different cores, and in case of non-uniform memory architectures (NUMA), quite common today, memory pages associated with tasks on a NUMA node may be associated with a different NUMA node after restart. Here, this work contributes a novel design technique for C/R mechanisms to preserve task-to-core maps and NUMA node specific page affinities across restarts. Experimental results with BLCR, a C/R mechanism, enhanced with affinity awareness demonstrate significant performance benefits of 37%-73% for the NAS Parallel Benchmark codes and 6-12% for NAMD with negligible overheads instead of up to nearly four times longer an execution times without affinity-aware restarts on 16 cores.

Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
Computational Research Division; USDOE
OSTI ID:
1342535
Report Number(s):
LBNL--1006168; ir:1006168
Journal Information:
ACM Digital Library, Journal Name: ACM Digital Library
Country of Publication:
United States
Language:
English

References (8)

Scalable molecular dynamics with NAMD journal January 2005
Adaptive incremental checkpointing for massively parallel systems conference January 2004
Cooperative checkpointing: a robust approach to large-scale systems reliability conference January 2006
CHARM++: a portable concurrent object oriented system based on C++
  • Kale, Laxmikant V.; Krishnan, Sanjeev
  • Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications - OOPSLA '93 https://doi.org/10.1145/165854.165874
conference January 1993
AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing conference January 2013
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
  • Li, Dong; Chen, Zizhong; Wu, Panruo
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503226
conference January 2013
A 'cool' way of improving the reliability of HPC machines
  • Sarood, Osman; Meneses, Esteban; Kale, Laxmikant V.
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503228
conference January 2013
The Nas Parallel Benchmarks journal September 1991

Similar Records

Berkeley Lab Checkpoint/Restart for Linux
Software · Fri Nov 14 19:00:00 EST 2003 · OSTI ID:code-54577

Berkeley lab checkpoint/restart (BLCR) for Linux clusters
Journal Article · Fri Sep 01 00:00:00 EDT 2006 · Journal of Physics. Conference Series · OSTI ID:1407049

Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters
Journal Article · Wed Jul 26 00:00:00 EDT 2006 · Journal of Physcs: Conference Series · OSTI ID:926560