skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Affinity-aware checkpoint restart

Abstract

Current checkpointing techniques employed to overcome faults for HPC applications result in inferior application performance after restart from a checkpoint for a number of applications. This is due to a lack of page and core affinity awareness of the checkpoint/restart (C/R) mechanism, i.e., application tasks originally pinned to cores may be restarted on different cores, and in case of non-uniform memory architectures (NUMA), quite common today, memory pages associated with tasks on a NUMA node may be associated with a different NUMA node after restart. Here, this work contributes a novel design technique for C/R mechanisms to preserve task-to-core maps and NUMA node specific page affinities across restarts. Experimental results with BLCR, a C/R mechanism, enhanced with affinity awareness demonstrate significant performance benefits of 37%-73% for the NAS Parallel Benchmark codes and 6-12% for NAMD with negligible overheads instead of up to nearly four times longer an execution times without affinity-aware restarts on 16 cores.

Authors:
 [1];  [1];  [1];  [2];  [2]
  1. North Carolina State Univ., Raleigh, NC (United States)
  2. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Publication Date:
Research Org.:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
Computational Research Division; USDOE
OSTI Identifier:
1342535
Report Number(s):
LBNL-1006168
ir:1006168
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
ACM Digital Library
Additional Journal Information:
Conference: Proceedings of the 15th International Middleware Conference, Bordeaux (France), 8-12 Dec 2014
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; checkpoint and restart; fault tolerance; multi-core; NUMA; system software

Citation Formats

Saini, Ajay, Rezaei, Arash, Mueller, Frank, Hargrove, Paul, and Roman, Eric. Affinity-aware checkpoint restart. United States: N. p., 2014. Web. doi:10.1145/2663165.2663325.
Saini, Ajay, Rezaei, Arash, Mueller, Frank, Hargrove, Paul, & Roman, Eric. Affinity-aware checkpoint restart. United States. https://doi.org/10.1145/2663165.2663325
Saini, Ajay, Rezaei, Arash, Mueller, Frank, Hargrove, Paul, and Roman, Eric. 2014. "Affinity-aware checkpoint restart". United States. https://doi.org/10.1145/2663165.2663325. https://www.osti.gov/servlets/purl/1342535.
@article{osti_1342535,
title = {Affinity-aware checkpoint restart},
author = {Saini, Ajay and Rezaei, Arash and Mueller, Frank and Hargrove, Paul and Roman, Eric},
abstractNote = {Current checkpointing techniques employed to overcome faults for HPC applications result in inferior application performance after restart from a checkpoint for a number of applications. This is due to a lack of page and core affinity awareness of the checkpoint/restart (C/R) mechanism, i.e., application tasks originally pinned to cores may be restarted on different cores, and in case of non-uniform memory architectures (NUMA), quite common today, memory pages associated with tasks on a NUMA node may be associated with a different NUMA node after restart. Here, this work contributes a novel design technique for C/R mechanisms to preserve task-to-core maps and NUMA node specific page affinities across restarts. Experimental results with BLCR, a C/R mechanism, enhanced with affinity awareness demonstrate significant performance benefits of 37%-73% for the NAS Parallel Benchmark codes and 6-12% for NAMD with negligible overheads instead of up to nearly four times longer an execution times without affinity-aware restarts on 16 cores.},
doi = {10.1145/2663165.2663325},
url = {https://www.osti.gov/biblio/1342535}, journal = {ACM Digital Library},
number = ,
volume = ,
place = {United States},
year = {Mon Dec 08 00:00:00 EST 2014},
month = {Mon Dec 08 00:00:00 EST 2014}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 1 work
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

Cooperative checkpointing: a robust approach to large-scale systems reliability
conference, January 2006


AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing
conference, January 2013


A 'cool' way of improving the reliability of HPC machines
conference, January 2013

  • Sarood, Osman; Meneses, Esteban; Kale, Laxmikant V.
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13
  • https://doi.org/10.1145/2503210.2503228

The Nas Parallel Benchmarks
journal, September 1991


Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
conference, January 2013

  • Li, Dong; Chen, Zizhong; Wu, Panruo
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13
  • https://doi.org/10.1145/2503210.2503226

Adaptive incremental checkpointing for massively parallel systems
conference, January 2004


CHARM++: a portable concurrent object oriented system based on C++
conference, January 1993

  • Kale, Laxmikant V.; Krishnan, Sanjeev
  • Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications - OOPSLA '93
  • https://doi.org/10.1145/165854.165874

Scalable molecular dynamics with NAMD
journal, January 2005