Implementing Software Resiliency in HPX for Extreme Scale Computing
- Computer Science and Engineering IIT Roorkee, (India)
- Sandia National Lab. (SNL-CA), Livermore, CA (United States)
- Louisiana State Univ., Baton Rouge, LA (United States)
The DOE Office of Science Exascale Computing Project (ECP) outlines the next milestones in the supercomputing domain. The target computing systems under the project will deliver 10x performance while keeping the power budget under 30 megawatts. With such large machines, the need to make applications resilient has become paramount. The benefits of adding resiliency to mission critical and scientific applications, includes the reduced cost of restarting the failed simulation both in terms of time and power. Most of the current implementation of resiliency at the software level makes use of a Coordinated Checkpoint and Restart (C/R). This technique of resiliency generates a consistent global snapshot, also called a checkpoint. Generating snapshots involves global communication and coordination and is achieved by synchronizing all running processes. The generated checkpoint is then stored in some form of persistent storage. On failure detection, the runtime initiates a global rollback to the most recent previously saved checkpoint. This involves aborting all running processes, rolling them back to the previous state and restarting them.
- Research Organization:
- Sandia National Laboratories (SNL-CA), Livermore, CA (United States)
- Sponsoring Organization:
- USDOE National Nuclear Security Administration (NNSA)
- DOE Contract Number:
- AC04-94AL85000; NA0003525
- OSTI ID:
- 1614897
- Report Number(s):
- SAND--2020-3975R; 685292
- Country of Publication:
- United States
- Language:
- English
Similar Records
Node failure resiliency for Uintah without checkpointing
Resiliency in numerical algorithm design for extreme scale simulations
Towards Low-Overhead Resilience for Data Parallel Deep Learning
Journal Article
·
Sat Jun 01 20:00:00 EDT 2019
· Concurrency and Computation. Practice and Experience
·
OSTI ID:1637354
Resiliency in numerical algorithm design for extreme scale simulations
Journal Article
·
Thu Dec 09 19:00:00 EST 2021
· International Journal of High Performance Computing Applications
·
OSTI ID:1855669
Towards Low-Overhead Resilience for Data Parallel Deep Learning
Conference
·
Fri Dec 31 23:00:00 EST 2021
·
OSTI ID:1887187