Evaluating Application Resilience with XRay
- Louisiana State Univ., Baton Rouge, LA (United States)
- Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
- Barcelona Supercomputing Center (Spain)
The rising count and shrinking feature size of transistors within modern computers is making them increasingly vulnerable to various types of soft faults. This problem is especially acute in high-performance computing (HPC) systems used for scientific computing, because these systems include many thousands of compute cores and nodes, all of which may be utilized in a single large-scale run. The increasing vulnerability of HPC applications to errors induced by soft faults is motivating extensive work on techniques to make these applications more resiilent to such faults, ranging from generic techniques such as replication or checkpoint/restart to algorithmspecific error detection and tolerance techniques. Effective use of such techniques requires a detailed understanding of how a given application is affected by soft faults to ensure that (i) efforts to improve application resilience are spent in the code regions most vulnerable to faults and (ii) the appropriate resilience technique is applied to each code region. This paper presents XRay, a tool to view the application vulnerability to soft errors, and illustrates how XRay can be used in the context of a representative application. In addition to providing actionable insights into application behavior XRay automatically selects the number of fault injection experiments required to provide an informative view of application behavior, ensuring that the information is statistically well-grounded without performing unnecessary experiments.
- Research Organization:
- Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC52-07NA27344
- OSTI ID:
- 1184175
- Report Number(s):
- LLNL-TR-670535
- Country of Publication:
- United States
- Language:
- English
Similar Records
Holistic Measurement Driven Resilience: Combining Operational Fault and Failure Measurements and Fault Injection for Quantifying Fault Detection, Propagation and Impact. Final report
Resiliency in numerical algorithm design for extreme scale simulations