skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Using Rollback Avoidance to Mitigate Failures in Next-Generation Extreme-Scale Systems

Thesis/Dissertation ·
OSTI ID:1226922
 [1]
  1. Univ. of New Mexico, Albuquerque, NM (United States)

High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in many important physical systems. The next major milestone in the development of HPC systems is the construction of the rst supercomputer capable executing more than an exa op, 1018 oating point operations per second. On systems of this scale, failures will occur much more frequently than on current systems. As a result, resilience is a key obstacle to building next-generation extremescale systems. Coordinated checkpointing is currently the most widely-used mechanism for handling failures on HPC systems. Although coordinated checkpointing remains e ective on current systems, increasing the scale of today's systems to build next-generation systems will increase the cost of fault tolerance as more and more time is taken away from the application to protect against or recover from failure. Rollback avoidance techniques seek to mitigate the cost of checkpoint/restart by allowing an application to continue its execution rather than rolling back to an earlier checkpoint when failures occur. These techniqes include failure prediction and preventive migration, replicated computation, fault-tolerant algorithms, and softwarebased memory fault correction. In this thesis, we examine how rollback avoidance techniques can be used to address failures on extreme-scale systems. Using a combination of analytic modeling and simulation, we evaluate the potential impact of rollback avoidance on these systems. We then present a novel rollback avoidance technique that exploits similarities in application memory. Finally, we examine the feasibility of using this technique to protect against memory faults in kernel memory.

Research Organization:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA)
DOE Contract Number:
AC04-94AL85000
OSTI ID:
1226922
Report Number(s):
SAND2015-10197T; 609792
Country of Publication:
United States
Language:
English

Similar Records

Resiliency in numerical algorithm design for extreme scale simulations
Journal Article · Fri Dec 10 00:00:00 EST 2021 · International Journal of High Performance Computing Applications · OSTI ID:1226922

Combining Partial Redundancy and Checkpointing for HPC
Conference · Sun Jan 01 00:00:00 EST 2012 · OSTI ID:1226922

A case for Virtual Machine based Fault Injection in a High-Performance Computing Environment
Conference · Sat Jan 01 00:00:00 EST 2011 · OSTI ID:1226922

Related Subjects