A Model-Based Case for Redundant Computation

Stearley, Jon R.; Robinson, David Gerald; Ferreira, Kurt Brian; Riesen, Rolf

doi:10.2172/1113872

A Model-Based Case for Redundant Computation

Technical Report · Mon Aug 01 04:00:00 EDT 2011

DOI:https://doi.org/10.2172/1113872· OSTI ID:1113872

Stearley, Jon R. ^[1]; Robinson, David Gerald ^[1]; Ferreira, Kurt Brian ^[1]; Riesen, Rolf ^[2]

Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
IBM Research, Dublin (Ireland)

Despite its seemingly nonsensical cost, we show through modeling and simulation that redundant computation merits full consideration as a resilience strategy for next-generation systems. Without revolutionary breakthroughs in failure rates, part counts, or stable-storage bandwidths, it has been shown that the utility of Exascale systems will be crushed by the overheads of traditional checkpoint/restart mechanisms. Alternate resilience strategies must be considered, and redundancy is a proven unrivaled approach in many domains. We develop a distribution-independent model for job interrupts on systems of arbitrary redundancy, adapt Daly’s model for total application runtime, and find that his estimate for optimal checkpoint interval remains valid for redundant systems. We then identify conditions where redundancy is more cost effective than non-redundancy. These are done in the context of the number one supercomputers of the last decade, showing that thorough consideration of redundant computation is timely - if not overdue.

Research Organization:: Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States)

Sponsoring Organization:: USDOE National Nuclear Security Administration (NNSA)

DOE Contract Number:: AC04-94AL85000

OSTI ID:: 1113872

Report Number(s):: SAND2011--5909; 464290

Country of Publication:: United States

Language:: English

Similar Records

Redundant computing for exascale systems.

Technical Report · Tue Nov 30 23:00:00 EST 2010 · OSTI ID:1011662

Combining Partial Redundancy and Checkpointing for HPC

Conference · Sat Dec 31 23:00:00 EST 2011 · OSTI ID:1081906

Implementing Software Resiliency in HPX for Extreme Scale Computing

Technical Report · Wed Apr 15 00:00:00 EDT 2020 · OSTI ID:1614897

Related Subjects

97 MATHEMATICS AND COMPUTING

A Model-Based Case for Redundant Computation

Citation Formats

Similar Records

Related Subjects