skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A simulation infrastructure for examining the performance of resilience strategies at scale.

Technical Report ·
DOI:https://doi.org/10.2172/1088091· OSTI ID:1088091

Fault-tolerance is a major challenge for many current and future extreme-scale systems, with many studies showing it to be the key limiter to application scalability. While there are a number of studies investigating the performance of various resilience mechanisms, these are typically limited to scales orders of magnitude smaller than expected for next-generation systems and simple benchmark problems. In this paper we show how, with very minor changes, a previously published and validated simulation framework for investigating appli- cation performance of OS noise can be used to simulate the overheads of various resilience mechanisms at scale. Using this framework, we compare the failure-free performance of this simulator against an analytic model to validate its performance and demonstrate its ability to simulate the performance of two popular rollback recovery methods on traces from real

Research Organization:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); University of New Mexico,, Albuquerque, NM
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA) HPC workloads, showing how performance can vary dramatically both with scale and the communication behavior of the application.
DOE Contract Number:
AC04-94AL85000
OSTI ID:
1088091
Report Number(s):
SAND2013-3180; 456237
Country of Publication:
United States
Language:
English