Affinity-aware checkpoint restart

Saini, Ajay; Rezaei, Arash; Mueller, Frank; Hargrove, Paul; Roman, Eric

doi:10.1145/2663165.2663325

Title: Affinity-aware checkpoint restart

Abstract

Current checkpointing techniques employed to overcome faults for HPC applications result in inferior application performance after restart from a checkpoint for a number of applications. This is due to a lack of page and core affinity awareness of the checkpoint/restart (C/R) mechanism, i.e., application tasks originally pinned to cores may be restarted on different cores, and in case of non-uniform memory architectures (NUMA), quite common today, memory pages associated with tasks on a NUMA node may be associated with a different NUMA node after restart. Here, this work contributes a novel design technique for C/R mechanisms to preserve task-to-core maps and NUMA node specific page affinities across restarts. Experimental results with BLCR, a C/R mechanism, enhanced with affinity awareness demonstrate significant performance benefits of 37%-73% for the NAS Parallel Benchmark codes and 6-12% for NAMD with negligible overheads instead of up to nearly four times longer an execution times without affinity-aware restarts on 16 cores.

Authors:

Saini, Ajay ^[1]; Rezaei, Arash ^[1]; Mueller, Frank ^[1]; Hargrove, Paul ^[2]; Roman, Eric ^[2]

North Carolina State Univ., Raleigh, NC (United States)
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)

Publication Date:: Mon Dec 08 00:00:00 EST 2014

Research Org.:: Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)

Sponsoring Org.:: Computational Research Division; USDOE

OSTI Identifier:: 1342535

Report Number(s):: LBNL-1006168
ir:1006168

Resource Type:: Journal Article: Accepted Manuscript

Journal Name:: ACM Digital Library

Additional Journal Information:: Conference: Proceedings of the 15th International Middleware Conference, Bordeaux (France), 8-12 Dec 2014

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING; checkpoint and restart; fault tolerance; multi-core; NUMA; system software

Citation Formats


                    Saini, Ajay, Rezaei, Arash, Mueller, Frank, Hargrove, Paul, and Roman, Eric. Affinity-aware checkpoint restart.  United States: N. p., 2014. 
        Web.  doi:10.1145/2663165.2663325.

Copy to clipboard


                    Saini, Ajay, Rezaei, Arash, Mueller, Frank, Hargrove, Paul, & Roman, Eric. Affinity-aware checkpoint restart.  United States.  https://doi.org/10.1145/2663165.2663325

Copy to clipboard


                    Saini, Ajay, Rezaei, Arash, Mueller, Frank, Hargrove, Paul, and Roman, Eric. 2014.  
        "Affinity-aware checkpoint restart".  United States.  https://doi.org/10.1145/2663165.2663325.  https://www.osti.gov/servlets/purl/1342535.

Copy to clipboard


                    
@article{osti_1342535,

  title        = {Affinity-aware checkpoint restart},

  author       = {Saini, Ajay and Rezaei, Arash and Mueller, Frank and Hargrove, Paul and Roman, Eric},

  abstractNote = {Current checkpointing techniques employed to overcome faults for HPC applications result in inferior application performance after restart from a checkpoint for a number of applications. This is due to a lack of page and core affinity awareness of the checkpoint/restart (C/R) mechanism, i.e., application tasks originally pinned to cores may be restarted on different cores, and in case of non-uniform memory architectures (NUMA), quite common today, memory pages associated with tasks on a NUMA node may be associated with a different NUMA node after restart. Here, this work contributes a novel design technique for C/R mechanisms to preserve task-to-core maps and NUMA node specific page affinities across restarts. Experimental results with BLCR, a C/R mechanism, enhanced with affinity awareness demonstrate significant performance benefits of 37%-73% for the NAS Parallel Benchmark codes and 6-12% for NAMD with negligible overheads instead of up to nearly four times longer an execution times without affinity-aware restarts on 16 cores.},

  doi          = {10.1145/2663165.2663325},

  url          = {https://www.osti.gov/biblio/1342535},
  journal      = {ACM Digital Library},
number       = ,

  volume       = ,

  place        = {United States},

  year         = {Mon Dec 08 00:00:00 EST 2014},

  month        = {Mon Dec 08 00:00:00 EST 2014}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1145/2663165.2663325

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 1 work

Citation information provided by
Web of Science

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

Cooperative checkpointing: a robust approach to large-scale systems reliability
conference, January 2006

Oliner, Adam J.; Rudolph, Larry; Sahoo, Ramendra K.
Proceedings of the 20th annual international conference on Supercomputing - ICS '06
https://doi.org/10.1145/1183401.1183406

AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing
conference, January 2013

Nicolae, Bogdan; Cappello, Franck
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing - HPDC '13
https://doi.org/10.1145/2493123.2462918

A 'cool' way of improving the reliability of HPC machines
conference, January 2013

Sarood, Osman; Meneses, Esteban; Kale, Laxmikant V.
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13
https://doi.org/10.1145/2503210.2503228

The Nas Parallel Benchmarks
journal, September 1991

Bailey, D. H.; Barszcz, E.; Barton, J. T.
The International Journal of Supercomputing Applications, Vol. 5, Issue 3
https://doi.org/10.1177/109434209100500306

Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
conference, January 2013

Li, Dong; Chen, Zizhong; Wu, Panruo
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13
https://doi.org/10.1145/2503210.2503226

Adaptive incremental checkpointing for massively parallel systems
conference, January 2004

Agarwal, Saurabh; Garg, Rahul; Gupta, Meeta S.
Proceedings of the 18th annual international conference on Supercomputing - ICS '04
https://doi.org/10.1145/1006209.1006248

CHARM++: a portable concurrent object oriented system based on C++
conference, January 1993

Kale, Laxmikant V.; Krishnan, Sanjeev
Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications - OOPSLA '93
https://doi.org/10.1145/165854.165874

Scalable molecular dynamics with NAMD
journal, January 2005

Phillips, James C.; Braun, Rosemary; Wang, Wei
Journal of Computational Chemistry, Vol. 26, Issue 16, p. 1781-1802
https://doi.org/10.1002/jcc.20289

Similar records in OSTI.GOV collections:

Berkeley lab checkpoint/restart (BLCR) for Linux clusters

Journal Article Hargrove, Paul; Duell, Jason - Journal of Physics. Conference Series

This article describes the motivation, design and implementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-level checkpoint/restart implementation for Linux clusters that targets the space of typical High Performance Computing applications, including MPI. Application-level solutions, including both checkpointing and fault-tolerant algorithms, are recognized as more time and space efficient than system-level checkpoints, which cannot make use of any application-specific knowledge. However, system-level checkpointing allows for preemption, making it suitable for responding to fault precursors (for instance, elevated error rates from ECC memory or network CRCs, or elevated temperature from sensors). Preemption can also increase the efficiency of batch scheduling; for instancemore »« less
Cited by 165
https://doi.org/10.1088/1742-6596/46/1/067

Full Text Available
SCR-Exa: Enhanced Scalable Checkpoint Restart (SCR) Library for Next Generation Exascale Computing

Technical Report Dai, Donglai

As the field of High-Performance Computing (HPC) heads towards exascale with modern processing, networking and storage technologies, it is increasingly important to provide support for fast I/O operations and scalable checkpoint-restart for users of these systems. Fast I/O support is critical for applications handling large-scale data and for visualizing the results. Checkpoint-restart enables users to tolerate failures in the underlying commodity components (processors, memory, interconnect, and storage) of HPC systems and run applications on a continuous basis without productivity loss. The Scalable Checkpoint-Restart (SCR) project, funded by DOE and developed by researchers from the Lawrence Livermore National Laboratory (LLNL), hasmore »« less
A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance

Conference Wang, Chao; Mueller, Frank; Engelmann, Christian; ...

Checkpoint/restart (C/R) has become a requirement for long-running jobs in large-scale clusters due to a meantime- to-failure (MTTF) in the order of hours. After a failure, C/R mechanisms generally require a complete restart of an MPI job from the last checkpoint. A complete restart, however, is unnecessary since all but one node are typically still alive. Furthermore, a restart may result in lengthy job requeuing even though the original job had not exceeded its time quantum. In this paper, we overcome these shortcomings. Instead of job restart, we have developed a transparent mechanism for job pause within LAM/MPI+BLCR. This mechanismmore »« less
https://doi.org/10.1109/IPDPS.2007.370307
Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters

Journal Article Hargrove, Paul; Duell, Jason - Journal of Physcs: Conference Series

This article describes the motivation, design andimplementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-levelcheckpoint/restart implementation for Linux clusters that targets thespace of typical High Performance Computing applications, including MPI.Application-level solutions, including both checkpointing andfault-tolerant algorithms, are recognized as more time and spaceefficient than system-level checkpoints, which cannot make use of anyapplication-specific knowledge. However, system-level checkpointingallows for preemption, making it suitable for responding to "faultprecursors" (for instance, elevated error rates from ECC memory ornetwork CRCs, or elevated temperature from sensors). Preemption can alsoincrease the efficiency of batch scheduling; for instance reducing idlecycles (by allowing for shutdown without any queue draining periodmore »« less
https://doi.org/10.1088/1742-6596/46/1/067
Scalable Transparent Checkpoint-Restart of Global Address Space Applications on Virtual Machines over Infiniband

Conference Villa, Oreste; Krishnamoorthy, Sriram; Nieplocha, Jaroslaw; ...

Checkpoint-Restart is one of the most used software approaches to achieve fault-tolerance in high-end clusters. While standard techniques typically focus on user-level solutions, the advent of virtualization software has enabled efficient and transparent system-level approaches. In this paper, we present a scalable transparent system-level solution to address fault-tolerance for applications based on global address space (GAS) programming models on Infiniband clusters. In addition to handling communication, the solution addresses transparent checkpoint of user-generated files. We exploit the support for the Infiniband network in the Xen virtual machine environment. We have developed a version of the Aggregate Remote Memory Copy Interfacemore »« less
https://doi.org/10.1145/1531743.1531776

Similar Records

Title: Affinity-aware checkpoint restart

Abstract

Citation Formats

Cooperative checkpointing: a robust approach to large-scale systems reliability conference, January 2006

AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing conference, January 2013

A 'cool' way of improving the reliability of HPC machines conference, January 2013

The Nas Parallel Benchmarks journal, September 1991

Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach conference, January 2013

Adaptive incremental checkpointing for massively parallel systems conference, January 2004

CHARM++: a portable concurrent object oriented system based on C++ conference, January 1993

Scalable molecular dynamics with NAMD journal, January 2005

Cooperative checkpointing: a robust approach to large-scale systems reliability
conference, January 2006

AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing
conference, January 2013

A 'cool' way of improving the reliability of HPC machines
conference, January 2013

The Nas Parallel Benchmarks
journal, September 1991

Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
conference, January 2013

Adaptive incremental checkpointing for massively parallel systems
conference, January 2004

CHARM++: a portable concurrent object oriented system based on C++
conference, January 1993

Scalable molecular dynamics with NAMD
journal, January 2005