Implementing Software Resiliency in HPX for Extreme Scale Computing

Gupta, Nikunj; Mayo, Jackson R.; Lemoine, Adrian S.; Hartmut, Kaiser

doi:10.2172/1614897

Implementing Software Resiliency in HPX for Extreme Scale Computing

Technical Report · Wed Apr 15 04:00:00 EDT 2020

DOI:https://doi.org/10.2172/1614897· OSTI ID:1614897

Gupta, Nikunj ^[1]; Mayo, Jackson R. ^[2]; Lemoine, Adrian S. ^[3]; Hartmut, Kaiser ^[3]

Computer Science and Engineering IIT Roorkee, (India)
Sandia National Lab. (SNL-CA), Livermore, CA (United States)
Louisiana State Univ., Baton Rouge, LA (United States)

The DOE Office of Science Exascale Computing Project (ECP) outlines the next milestones in the supercomputing domain. The target computing systems under the project will deliver 10x performance while keeping the power budget under 30 megawatts. With such large machines, the need to make applications resilient has become paramount. The benefits of adding resiliency to mission critical and scientific applications, includes the reduced cost of restarting the failed simulation both in terms of time and power. Most of the current implementation of resiliency at the software level makes use of a Coordinated Checkpoint and Restart (C/R). This technique of resiliency generates a consistent global snapshot, also called a checkpoint. Generating snapshots involves global communication and coordination and is achieved by synchronizing all running processes. The generated checkpoint is then stored in some form of persistent storage. On failure detection, the runtime initiates a global rollback to the most recent previously saved checkpoint. This involves aborting all running processes, rolling them back to the previous state and restarting them.

Research Organization:: Sandia National Laboratories (SNL-CA), Livermore, CA (United States)

Sponsoring Organization:: USDOE National Nuclear Security Administration (NNSA)

DOE Contract Number:: AC04-94AL85000; NA0003525

OSTI ID:: 1614897

Report Number(s):: SAND--2020-3975R; 685292

Country of Publication:: United States

Language:: English

Similar Records

Node failure resiliency for Uintah without checkpointing

Journal Article · Sat Jun 01 20:00:00 EDT 2019 · Concurrency and Computation. Practice and Experience · OSTI ID:1637354

Resiliency in numerical algorithm design for extreme scale simulations

Journal Article · Thu Dec 09 19:00:00 EST 2021 · International Journal of High Performance Computing Applications · OSTI ID:1855669

Towards Low-Overhead Resilience for Data Parallel Deep Learning

Conference · Fri Dec 31 23:00:00 EST 2021 · OSTI ID:1887187

Related Subjects

97 MATHEMATICS AND COMPUTING
HPX
asynchronous many task systems
parallel and distributed computing
software resilience

Implementing Software Resiliency in HPX for Extreme Scale Computing

Citation Formats

Similar Records

Related Subjects