skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Affinity-aware checkpoint restart

Journal Article · · ACM Digital Library
 [1];  [1];  [1];  [2];  [2]
  1. North Carolina State Univ., Raleigh, NC (United States)
  2. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)

Current checkpointing techniques employed to overcome faults for HPC applications result in inferior application performance after restart from a checkpoint for a number of applications. This is due to a lack of page and core affinity awareness of the checkpoint/restart (C/R) mechanism, i.e., application tasks originally pinned to cores may be restarted on different cores, and in case of non-uniform memory architectures (NUMA), quite common today, memory pages associated with tasks on a NUMA node may be associated with a different NUMA node after restart. Here, this work contributes a novel design technique for C/R mechanisms to preserve task-to-core maps and NUMA node specific page affinities across restarts. Experimental results with BLCR, a C/R mechanism, enhanced with affinity awareness demonstrate significant performance benefits of 37%-73% for the NAS Parallel Benchmark codes and 6-12% for NAMD with negligible overheads instead of up to nearly four times longer an execution times without affinity-aware restarts on 16 cores.

Research Organization:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
Computational Research Division; USDOE
OSTI ID:
1342535
Report Number(s):
LBNL-1006168; ir:1006168
Journal Information:
ACM Digital Library, Conference: Proceedings of the 15th International Middleware Conference, Bordeaux (France), 8-12 Dec 2014
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 1 work
Citation information provided by
Web of Science

References (8)

Cooperative checkpointing: a robust approach to large-scale systems reliability conference January 2006
AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing conference January 2013
A 'cool' way of improving the reliability of HPC machines
  • Sarood, Osman; Meneses, Esteban; Kale, Laxmikant V.
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503228
conference January 2013
The Nas Parallel Benchmarks journal September 1991
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
  • Li, Dong; Chen, Zizhong; Wu, Panruo
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503226
conference January 2013
Adaptive incremental checkpointing for massively parallel systems conference January 2004
CHARM++: a portable concurrent object oriented system based on C++
  • Kale, Laxmikant V.; Krishnan, Sanjeev
  • Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications - OOPSLA '93 https://doi.org/10.1145/165854.165874
conference January 1993
Scalable molecular dynamics with NAMD journal January 2005

Similar Records

Berkeley lab checkpoint/restart (BLCR) for Linux clusters
Journal Article · Fri Sep 01 00:00:00 EDT 2006 · Journal of Physics. Conference Series · OSTI ID:1342535

SCR-Exa: Enhanced Scalable Checkpoint Restart (SCR) Library for Next Generation Exascale Computing
Technical Report · Mon Feb 21 00:00:00 EST 2022 · OSTI ID:1342535

A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance
Conference · Mon Jan 01 00:00:00 EST 2007 · OSTI ID:1342535