HPC application fault-tolerance using transparent redundant computation.

Riesen, Rolf E; Laros, III, James H; Pedretti, Kevin Thomas Tauke; Oldfield, Ron A; Ferreira, Kurt Brian; Brightwell, Ronald Brian

HPC application fault-tolerance using transparent redundant computation.

Conference · Sat Aug 01 04:00:00 EDT 2009

OSTI ID:971418

Riesen, Rolf E; Laros, III, James H; Pedretti, Kevin Thomas Tauke; Oldfield, Ron A; Ferreira, Kurt Brian; Brightwell, Ronald Brian

As the core count of HPC machines continue to grow in size, issues such as fault tolerance and reliability are becoming limiting factors for application scalability. Current techniques to ensure progress across faults, for example coordinated checkpoint-restart, are unsuitable for machines of this scale due to their predicted high overheads. In this study, we present the design and implementation of a novel system for ensuring reliability which uses transparent, rank-level, redundant computation. Using this system, we show the overheads involved in redundant computation for a number of real-world HPC applications. Additionally, we relate the communication characteristics of an application to the overheads observed.

🛈

OSTI does not have a digital full text copy available. For more information, please see document availability, search WorldCat, or search Google Scholar.

Research Organization:: Sandia National Laboratories

Sponsoring Organization:: USDOE

DOE Contract Number:: AC04-94AL85000

OSTI ID:: 971418

Report Number(s):: SAND2009-5267C

Country of Publication:: United States

Language:: English

Similar Records

rMPI

Software · Tue Aug 24 00:00:00 EDT 2010 · OSTI ID:1231486

rMPI : increasing fault resiliency in a message-passing environment.

Technical Report · Fri Apr 01 00:00:00 EDT 2011 · OSTI ID:1012733

Fault-tolerance for exascale systems.

Conference · Sun Aug 01 00:00:00 EDT 2010 · OSTI ID:1028416

Related Subjects

97 MATHEMATICS AND COMPUTING
99 GENERAL AND MISCELLANEOUS
COMPUTER CALCULATIONS
COMPUTERS
DATA TRANSMISSION
DESIGN
ERRORS
IMPLEMENTATION
PERFORMANCE
RELIABILITY

HPC application fault-tolerance using transparent redundant computation.

Citation Formats

Similar Records

Related Subjects