Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Fault-tolerance for exascale systems.

Conference ·
OSTI ID:1028416

Periodic, coordinated, checkpointing to disk is the most prevalent fault tolerance method used in modern large-scale, capability class, high-performance computing (HPC) systems. Previous work has shown that as the system grows in size, the inherent synchronization of coordinated checkpoint/restart (CR) limits application scalability; at large node counts the application spends most of its time checkpointing instead of executing useful work. Furthermore, a single component failure forces an application restart from the last correct checkpoint. Suggested alternatives to coordinated CR include uncoordinated CR with message logging, redundant computation, and RAID-inspired, in-memory distributed checkpointing schemes. Each of these alternatives have differing overheads that are dependent on both the scale and communication characteristics of the application. In this work, using the Structural Simulation Toolkit (SST) simulator, we compare the performance characteristics of each of these resilience methods for a number of HPC application patterns on a number of proposed exascale machines. The result of this work provides valuable guidance on the most efficient resilience methods for exascale systems.

Research Organization:
Sandia National Laboratories (SNL), Albuquerque, NM, and Livermore, CA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC04-94AL85000
OSTI ID:
1028416
Report Number(s):
SAND2010-5637C; TRN: US201122%%284
Resource Relation:
Conference: Proposed for presentation at the IEEE International Conference on Cluster Computing 2010 held September 20-24, 2010 in Herakloin, Crete, Greece.
Country of Publication:
United States
Language:
English