skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A Model-Based Case for Redundant Computation

Abstract

Despite its seemingly nonsensical cost, we show through modeling and simulation that redundant computation merits full consideration as a resilience strategy for next-generation systems. Without revolutionary breakthroughs in failure rates, part counts, or stable-storage bandwidths, it has been shown that the utility of Exascale systems will be crushed by the overheads of traditional checkpoint/restart mechanisms. Alternate resilience strategies must be considered, and redundancy is a proven unrivaled approach in many domains. We develop a distribution-independent model for job interrupts on systems of arbitrary redundancy, adapt Daly’s model for total application runtime, and find that his estimate for optimal checkpoint interval remains valid for redundant systems. We then identify conditions where redundancy is more cost effective than non-redundancy. These are done in the context of the number one supercomputers of the last decade, showing that thorough consideration of redundant computation is timely - if not overdue.

Authors:
 [1];  [1];  [1];  [2]
  1. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
  2. IBM Research, Dublin (Ireland)
Publication Date:
Research Org.:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1113872
Report Number(s):
SAND2011-5909
464290
DOE Contract Number:  
AC04-94AL85000
Resource Type:
Technical Report
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING

Citation Formats

Stearley, Jon R., Robinson, David Gerald, Ferreira, Kurt Brian, and Riesen, Rolf. A Model-Based Case for Redundant Computation. United States: N. p., 2011. Web. doi:10.2172/1113872.
Stearley, Jon R., Robinson, David Gerald, Ferreira, Kurt Brian, & Riesen, Rolf. A Model-Based Case for Redundant Computation. United States. doi:10.2172/1113872.
Stearley, Jon R., Robinson, David Gerald, Ferreira, Kurt Brian, and Riesen, Rolf. Mon . "A Model-Based Case for Redundant Computation". United States. doi:10.2172/1113872. https://www.osti.gov/servlets/purl/1113872.
@article{osti_1113872,
title = {A Model-Based Case for Redundant Computation},
author = {Stearley, Jon R. and Robinson, David Gerald and Ferreira, Kurt Brian and Riesen, Rolf},
abstractNote = {Despite its seemingly nonsensical cost, we show through modeling and simulation that redundant computation merits full consideration as a resilience strategy for next-generation systems. Without revolutionary breakthroughs in failure rates, part counts, or stable-storage bandwidths, it has been shown that the utility of Exascale systems will be crushed by the overheads of traditional checkpoint/restart mechanisms. Alternate resilience strategies must be considered, and redundancy is a proven unrivaled approach in many domains. We develop a distribution-independent model for job interrupts on systems of arbitrary redundancy, adapt Daly’s model for total application runtime, and find that his estimate for optimal checkpoint interval remains valid for redundant systems. We then identify conditions where redundancy is more cost effective than non-redundancy. These are done in the context of the number one supercomputers of the last decade, showing that thorough consideration of redundant computation is timely - if not overdue.},
doi = {10.2172/1113872},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2011},
month = {8}
}

Technical Report:

Save / Share: