Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

HPC application fault-tolerance using transparent redundant computation.

Conference ·
OSTI ID:971418
As the core count of HPC machines continue to grow in size, issues such as fault tolerance and reliability are becoming limiting factors for application scalability. Current techniques to ensure progress across faults, for example coordinated checkpoint-restart, are unsuitable for machines of this scale due to their predicted high overheads. In this study, we present the design and implementation of a novel system for ensuring reliability which uses transparent, rank-level, redundant computation. Using this system, we show the overheads involved in redundant computation for a number of real-world HPC applications. Additionally, we relate the communication characteristics of an application to the overheads observed.
Research Organization:
Sandia National Laboratories
Sponsoring Organization:
USDOE
DOE Contract Number:
AC04-94AL85000
OSTI ID:
971418
Report Number(s):
SAND2009-5267C
Country of Publication:
United States
Language:
English

Similar Records

rMPI
Software · Tue Aug 24 00:00:00 EDT 2010 · OSTI ID:1231486

rMPI : increasing fault resiliency in a message-passing environment.
Technical Report · Fri Apr 01 00:00:00 EDT 2011 · OSTI ID:1012733

Fault-tolerance for exascale systems.
Conference · Sun Aug 01 00:00:00 EDT 2010 · OSTI ID:1028416