Implementing Software Resiliency in HPX for Extreme Scale Computing
- Computer Science and Engineering IIT Roorkee, (India)
- Sandia National Lab. (SNL-CA), Livermore, CA (United States)
- Louisiana State Univ., Baton Rouge, LA (United States)
The DOE Office of Science Exascale Computing Project (ECP) outlines the next milestones in the supercomputing domain. The target computing systems under the project will deliver 10x performance while keeping the power budget under 30 megawatts. With such large machines, the need to make applications resilient has become paramount. The benefits of adding resiliency to mission critical and scientific applications, includes the reduced cost of restarting the failed simulation both in terms of time and power. Most of the current implementation of resiliency at the software level makes use of a Coordinated Checkpoint and Restart (C/R). This technique of resiliency generates a consistent global snapshot, also called a checkpoint. Generating snapshots involves global communication and coordination and is achieved by synchronizing all running processes. The generated checkpoint is then stored in some form of persistent storage. On failure detection, the runtime initiates a global rollback to the most recent previously saved checkpoint. This involves aborting all running processes, rolling them back to the previous state and restarting them.
- Research Organization:
- Sandia National Laboratories (SNL-CA), Livermore, CA (United States)
- Sponsoring Organization:
- USDOE National Nuclear Security Administration (NNSA)
- DOE Contract Number:
- AC04-94AL85000; NA0003525
- OSTI ID:
- 1614897
- Report Number(s):
- SAND--2020-3975R; 685292
- Country of Publication:
- United States
- Language:
- English
Similar Records
Resiliency in numerical algorithm design for extreme scale simulations
Towards Low-Overhead Resilience for Data Parallel Deep Learning