Affinity-aware checkpoint restart
- North Carolina State Univ., Raleigh, NC (United States)
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Current checkpointing techniques employed to overcome faults for HPC applications result in inferior application performance after restart from a checkpoint for a number of applications. This is due to a lack of page and core affinity awareness of the checkpoint/restart (C/R) mechanism, i.e., application tasks originally pinned to cores may be restarted on different cores, and in case of non-uniform memory architectures (NUMA), quite common today, memory pages associated with tasks on a NUMA node may be associated with a different NUMA node after restart. Here, this work contributes a novel design technique for C/R mechanisms to preserve task-to-core maps and NUMA node specific page affinities across restarts. Experimental results with BLCR, a C/R mechanism, enhanced with affinity awareness demonstrate significant performance benefits of 37%-73% for the NAS Parallel Benchmark codes and 6-12% for NAMD with negligible overheads instead of up to nearly four times longer an execution times without affinity-aware restarts on 16 cores.
- Research Organization:
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
- Sponsoring Organization:
- Computational Research Division; USDOE
- OSTI ID:
- 1342535
- Report Number(s):
- LBNL-1006168; ir:1006168
- Journal Information:
- ACM Digital Library, Conference: Proceedings of the 15th International Middleware Conference, Bordeaux (France), 8-12 Dec 2014
- Country of Publication:
- United States
- Language:
- English
Web of Science
Cooperative checkpointing: a robust approach to large-scale systems reliability
|
conference | January 2006 |
AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing
|
conference | January 2013 |
A 'cool' way of improving the reliability of HPC machines
|
conference | January 2013 |
The Nas Parallel Benchmarks
|
journal | September 1991 |
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
|
conference | January 2013 |
Adaptive incremental checkpointing for massively parallel systems
|
conference | January 2004 |
CHARM++: a portable concurrent object oriented system based on C++
|
conference | January 1993 |
Scalable molecular dynamics with NAMD
|
journal | January 2005 |
Similar Records
SCR-Exa: Enhanced Scalable Checkpoint Restart (SCR) Library for Next Generation Exascale Computing
A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance