Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Implementing Software Resiliency in HPX for Extreme Scale Computing

Technical Report ·
DOI:https://doi.org/10.2172/1614897· OSTI ID:1614897
 [1];  [2];  [3];  [3]
  1. Computer Science and Engineering IIT Roorkee, (India)
  2. Sandia National Lab. (SNL-CA), Livermore, CA (United States)
  3. Louisiana State Univ., Baton Rouge, LA (United States)

The DOE Office of Science Exascale Computing Project (ECP) outlines the next milestones in the supercomputing domain. The target computing systems under the project will deliver 10x performance while keeping the power budget under 30 megawatts. With such large machines, the need to make applications resilient has become paramount. The benefits of adding resiliency to mission critical and scientific applications, includes the reduced cost of restarting the failed simulation both in terms of time and power. Most of the current implementation of resiliency at the software level makes use of a Coordinated Checkpoint and Restart (C/R). This technique of resiliency generates a consistent global snapshot, also called a checkpoint. Generating snapshots involves global communication and coordination and is achieved by synchronizing all running processes. The generated checkpoint is then stored in some form of persistent storage. On failure detection, the runtime initiates a global rollback to the most recent previously saved checkpoint. This involves aborting all running processes, rolling them back to the previous state and restarting them.

Research Organization:
Sandia National Laboratories (SNL-CA), Livermore, CA (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA)
DOE Contract Number:
AC04-94AL85000; NA0003525
OSTI ID:
1614897
Report Number(s):
SAND--2020-3975R; 685292
Country of Publication:
United States
Language:
English

Similar Records

Node failure resiliency for Uintah without checkpointing
Journal Article · Sun Jun 02 00:00:00 EDT 2019 · Concurrency and Computation. Practice and Experience · OSTI ID:1637354

Resiliency in numerical algorithm design for extreme scale simulations
Journal Article · Thu Dec 09 23:00:00 EST 2021 · International Journal of High Performance Computing Applications · OSTI ID:1855669

Towards Low-Overhead Resilience for Data Parallel Deep Learning
Conference · Fri Dec 31 23:00:00 EST 2021 · OSTI ID:1887187