Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

A simulation infrastructure for examining the performance of resilience strategies at scale

Technical Report ·
DOI:https://doi.org/10.2172/1088091· OSTI ID:1088091
 [1];  [2];  [2]
  1. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
  2. Univ. of New Mexico, Albuquerque, NM (United States)
Fault-tolerance is a major challenge for many current and future extreme-scale systems, with many studies showing it to be the key limiter to application scalability. While there are a number of studies investigating the performance of various resilience mechanisms, these are typically limited to scales orders of magnitude smaller than expected for next-generation systems and simple benchmark problems. In this paper we show how, with very minor changes, a previously published and validated simulation framework for investigating application performance of OS noise can be used to simulate the overheads of various resilience mechanisms at scale. Using this framework, we compare the failure-free performance of this simulator against an analytic model to validate its performance and demonstrate its ability to simulate the performance of two popular rollback recovery methods on traces from real.
Research Organization:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); University of New Mexico, Albuquerque, NM (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA)
DOE Contract Number:
AC04-94AL85000
OSTI ID:
1088091
Report Number(s):
SAND--2013-3180; 456237
Country of Publication:
United States
Language:
English

Similar Records

Cooperative Application/OS DRAM Fault Recovery
Technical Report · Mon Apr 30 20:00:00 EDT 2012 · OSTI ID:1044954

Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.0)
Technical Report · Sat Oct 01 00:00:00 EDT 2016 · OSTI ID:1338552

Using Simulation to Evaluate the Performance of Resilience Strategies and Process Failures
Technical Report · Tue Dec 31 23:00:00 EST 2013 · OSTI ID:1204092

Related Subjects