A simulation infrastructure for examining the performance of resilience strategies at scale
- Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
- Univ. of New Mexico, Albuquerque, NM (United States)
Fault-tolerance is a major challenge for many current and future extreme-scale systems, with many studies showing it to be the key limiter to application scalability. While there are a number of studies investigating the performance of various resilience mechanisms, these are typically limited to scales orders of magnitude smaller than expected for next-generation systems and simple benchmark problems. In this paper we show how, with very minor changes, a previously published and validated simulation framework for investigating application performance of OS noise can be used to simulate the overheads of various resilience mechanisms at scale. Using this framework, we compare the failure-free performance of this simulator against an analytic model to validate its performance and demonstrate its ability to simulate the performance of two popular rollback recovery methods on traces from real.
- Research Organization:
- Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); University of New Mexico, Albuquerque, NM (United States)
- Sponsoring Organization:
- USDOE National Nuclear Security Administration (NNSA)
- DOE Contract Number:
- AC04-94AL85000
- OSTI ID:
- 1088091
- Report Number(s):
- SAND--2013-3180; 456237
- Country of Publication:
- United States
- Language:
- English
Similar Records
Cooperative Application/OS DRAM Fault Recovery
Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.0)
Using Simulation to Evaluate the Performance of Resilience Strategies and Process Failures
Technical Report
·
Mon Apr 30 20:00:00 EDT 2012
·
OSTI ID:1044954
Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.0)
Technical Report
·
Sat Oct 01 00:00:00 EDT 2016
·
OSTI ID:1338552
Using Simulation to Evaluate the Performance of Resilience Strategies and Process Failures
Technical Report
·
Tue Dec 31 23:00:00 EST 2013
·
OSTI ID:1204092