Affinity-aware checkpoint restart

Saini, Ajay; Rezaei, Arash; Mueller, Frank; Hargrove, Paul; Roman, Eric

doi:10.1145/2663165.2663325

Affinity-aware checkpoint restart

Journal Article · Mon Dec 08 00:00:00 EST 2014 · ACM Digital Library

DOI:https://doi.org/10.1145/2663165.2663325· OSTI ID:1342535

Saini, Ajay ^[1]; Rezaei, Arash ^[1]; Mueller, Frank ^[1]; Hargrove, Paul ^[2]; Roman, Eric ^[2]

North Carolina State Univ., Raleigh, NC (United States)
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)

Current checkpointing techniques employed to overcome faults for HPC applications result in inferior application performance after restart from a checkpoint for a number of applications. This is due to a lack of page and core affinity awareness of the checkpoint/restart (C/R) mechanism, i.e., application tasks originally pinned to cores may be restarted on different cores, and in case of non-uniform memory architectures (NUMA), quite common today, memory pages associated with tasks on a NUMA node may be associated with a different NUMA node after restart. Here, this work contributes a novel design technique for C/R mechanisms to preserve task-to-core maps and NUMA node specific page affinities across restarts. Experimental results with BLCR, a C/R mechanism, enhanced with affinity awareness demonstrate significant performance benefits of 37%-73% for the NAS Parallel Benchmark codes and 6-12% for NAMD with negligible overheads instead of up to nearly four times longer an execution times without affinity-aware restarts on 16 cores.

Research Organization:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Organization:: Computational Research Division; USDOE

OSTI ID:: 1342535

Report Number(s):: LBNL--1006168; ir:1006168

Journal Information:: ACM Digital Library, Journal Name: ACM Digital Library

Country of Publication:: United States

Language:: English

References (8)

Scalable molecular dynamics with NAMD Phillips, James C.; Braun, Rosemary; Wang, Wei Journal of Computational Chemistry, Vol. 26, Issue 16, p. 1781-1802 https://doi.org/10.1002/jcc.20289	journal	January 2005
Adaptive incremental checkpointing for massively parallel systems Agarwal, Saurabh; Garg, Rahul; Gupta, Meeta S. Proceedings of the 18th annual international conference on Supercomputing - ICS '04 https://doi.org/10.1145/1006209.1006248	conference	January 2004
Cooperative checkpointing: a robust approach to large-scale systems reliability Oliner, Adam J.; Rudolph, Larry; Sahoo, Ramendra K. Proceedings of the 20th annual international conference on Supercomputing - ICS '06 https://doi.org/10.1145/1183401.1183406	conference	January 2006
CHARM++: a portable concurrent object oriented system based on C++ Kale, Laxmikant V.; Krishnan, Sanjeev Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications - OOPSLA '93 https://doi.org/10.1145/165854.165874	conference	January 1993
AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing Nicolae, Bogdan; Cappello, Franck Proceedings of the 22nd international symposium on High-performance parallel and distributed computing - HPDC '13 https://doi.org/10.1145/2493123.2462918	conference	January 2013
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach Li, Dong; Chen, Zizhong; Wu, Panruo Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503226	conference	January 2013
A 'cool' way of improving the reliability of HPC machines Sarood, Osman; Meneses, Esteban; Kale, Laxmikant V. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503228	conference	January 2013
The Nas Parallel Benchmarks Bailey, D. H.; Barszcz, E.; Barton, J. T. The International Journal of Supercomputing Applications, Vol. 5, Issue 3 https://doi.org/10.1177/109434209100500306	journal	September 1991

Similar Records

Berkeley Lab Checkpoint/Restart for Linux

Software · Fri Nov 14 19:00:00 EST 2003 · OSTI ID:code-54577

Berkeley lab checkpoint/restart (BLCR) for Linux clusters

Journal Article · Thu Aug 31 20:00:00 EDT 2006 · Journal of Physics. Conference Series · OSTI ID:1407049

Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters

Journal Article · Wed Jul 26 00:00:00 EDT 2006 · Journal of Physcs: Conference Series · OSTI ID:926560

Related Subjects

97 MATHEMATICS AND COMPUTING
NUMA
checkpoint and restart
fault tolerance
multi-core
system software

Affinity-aware checkpoint restart

Citation Formats

References (8)

Similar Records

Related Subjects