2009 fault tolerance for extreme-scale computing workshop, Albuquerque, NM - March 19-20, 2009.
2009 fault tolerance for extreme-scale computing workshop, Albuquerque, NM - March 19-20, 2009. This is a report on the third in a series of petascale workshops co-sponsored by Blue Waters and TeraGrid to address challenges and opportunities for making effective use of emerging extreme-scale computing. This workshop was held to discuss fault tolerance on large systems for running large, possibly long-running applications. The main point of the workshop was to have systems people, middleware people (including fault-tolerance experts), and applications people talk about the issues and figure out what needs to be done, mostly at the middleware and application levels, to run such applications on the emerging petascale systems, without having faults cause large numbers of application failures. The workshop found that there is considerable interest in fault tolerance, resilience, and reliability of high-performance computing (HPC) systems in general, at all levels of HPC. The only way to recover from faults is through the use of some redundancy, either in space or in time. Redundancy in time, in the form of writing checkpoints to disk and restarting at the most recent checkpoint after a fault that cause an application to crash/halt, is the most common tool used in applications today, but there are questions about how long this can continue to be a more »
|Authors:||; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; more »|
|OSTI Identifier:||OSTI ID: 971988|
|DOE Contract Number:||DE-AC02-06CH11357|
|Resource Type:||Technical Report|
|Research Org:||Argonne National Laboratory (ANL)|
|Country of Publication:||United States|
|Subject:||97 MATHEMATICAL METHODS AND COMPUTING; 99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; SUPERCOMPUTERS; ERRORS; MITIGATION; RELIABILITY|
Enter terms in the toolbar above to search the full text of this document for pages containing specific keywords.