2009 fault tolerance for extreme-scale computing workshop, Albuquerque, NM - March 19-20, 2009.

Katz, D S; Daly, J; DeBardeleben, N; Elnozahy, M; Kramer, B; Lathrop, S; Nystrom, N; Milfeld, K; Sanielevici, S; Scott, S; Votta, L; LANL,; IBM,; Foundation, Shodor; ORNL,

doi:10.2172/971988

Title: 2009 fault tolerance for extreme-scale computing workshop, Albuquerque, NM - March 19-20, 2009.

Technical Report · Sun Feb 01 00:00:00 EST 2009

DOI:https://doi.org/10.2172/971988· OSTI ID:971988

Katz, D S; Daly, J; DeBardeleben, N; Elnozahy, M; Kramer, B; Lathrop, S; Nystrom, N; Milfeld, K; Sanielevici, S; Scott, S; Votta, L; LANL,; IBM,; Foundation, Shodor; ORNL,

This is a report on the third in a series of petascale workshops co-sponsored by Blue Waters and TeraGrid to address challenges and opportunities for making effective use of emerging extreme-scale computing. This workshop was held to discuss fault tolerance on large systems for running large, possibly long-running applications. The main point of the workshop was to have systems people, middleware people (including fault-tolerance experts), and applications people talk about the issues and figure out what needs to be done, mostly at the middleware and application levels, to run such applications on the emerging petascale systems, without having faults cause large numbers of application failures. The workshop found that there is considerable interest in fault tolerance, resilience, and reliability of high-performance computing (HPC) systems in general, at all levels of HPC. The only way to recover from faults is through the use of some redundancy, either in space or in time. Redundancy in time, in the form of writing checkpoints to disk and restarting at the most recent checkpoint after a fault that cause an application to crash/halt, is the most common tool used in applications today, but there are questions about how long this can continue to be a good solution as systems and memories grow faster than I/O bandwidth to disk. There is interest in both modifications to this, such as checkpoints to memory, partial checkpoints, and message logging, and alternative ideas, such as in-memory recovery using residues. We believe that systematic exploration of these ideas holds the most promise for the scientific applications community. Fault tolerance has been an issue of discussion in the HPC community for at least the past 10 years; but much like other issues, the community has managed to put off addressing it during this period. There is a growing recognition that as systems continue to grow to petascale and beyond, the field is approaching the point where we don't have any choice but to address this through R&D efforts.

View Technical Report

Cite

Export

Save

Research Organization:: Argonne National Lab. (ANL), Argonne, IL (United States)

Sponsoring Organization:: USDOE Office of Science (SC)

DOE Contract Number:: DE-AC02-06CH11357

OSTI ID:: 971988

Report Number(s):: ANL/MCS-TM-312; TRN: US201006%%806

Country of Publication:: United States

Language:: ENGLISH

Similar Records

PETASCALE DATA STORAGE INSTITUTE (PDSI) Final Report

Technical Report · Mon Nov 26 00:00:00 EST 2012 · OSTI ID:971988

Gibson, Garth

Fault-tolerance for exascale systems.

Conference · Sun Aug 01 00:00:00 EDT 2010 · OSTI ID:971988

Riesen, Rolf E; Varela, Maria Ruiz; Ferreira, Kurt Brian

CIFTS : A coordinated infrastructure for fault-tolerant systems.

Conference · Thu Jan 01 00:00:00 EST 2009 · OSTI ID:971988

Gupta, R; Beckman, P; Park, B H; +8 more

Related Subjects

97 MATHEMATICAL METHODS AND COMPUTING
99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE
SUPERCOMPUTERS
ERRORS
MITIGATION
RELIABILITY

Title: 2009 fault tolerance for extreme-scale computing workshop, Albuquerque, NM - March 19-20, 2009.

Citation Formats

Similar Records

Related Subjects