Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Using Rollback Avoidance to Mitigate Failures in Next-Generation Extreme-Scale Systems

Thesis/Dissertation ·
OSTI ID:1226922
 [1]
  1. Univ. of New Mexico, Albuquerque, NM (United States)
High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in many important physical systems. The next major milestone in the development of HPC systems is the construction of the rst supercomputer capable executing more than an exa op, 1018 oating point operations per second. On systems of this scale, failures will occur much more frequently than on current systems. As a result, resilience is a key obstacle to building next-generation extremescale systems. Coordinated checkpointing is currently the most widely-used mechanism for handling failures on HPC systems. Although coordinated checkpointing remains e ective on current systems, increasing the scale of today's systems to build next-generation systems will increase the cost of fault tolerance as more and more time is taken away from the application to protect against or recover from failure. Rollback avoidance techniques seek to mitigate the cost of checkpoint/restart by allowing an application to continue its execution rather than rolling back to an earlier checkpoint when failures occur. These techniqes include failure prediction and preventive migration, replicated computation, fault-tolerant algorithms, and softwarebased memory fault correction. In this thesis, we examine how rollback avoidance techniques can be used to address failures on extreme-scale systems. Using a combination of analytic modeling and simulation, we evaluate the potential impact of rollback avoidance on these systems. We then present a novel rollback avoidance technique that exploits similarities in application memory. Finally, we examine the feasibility of using this technique to protect against memory faults in kernel memory.
Research Organization:
Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA)
DOE Contract Number:
AC04-94AL85000
OSTI ID:
1226922
Report Number(s):
SAND2015--10197T; 609792
Country of Publication:
United States
Language:
English

Similar Records

Mini-Ckpts: Surviving OS Failures in Persistent Memory
Conference · Thu Dec 31 23:00:00 EST 2015 · OSTI ID:1260089

A case for Virtual Machine based Fault Injection in a High-Performance Computing Environment
Conference · Fri Dec 31 23:00:00 EST 2010 · OSTI ID:1037028

Reverse Computation for Rollback-based Fault Tolerance in Large Parallel Systems
Journal Article · Mon Dec 31 23:00:00 EST 2012 · Cluster Computing · OSTI ID:1088141

Related Subjects