skip to main content

SciTech ConnectSciTech Connect

Title: Evaluating Application Resilience with XRay

The rising count and shrinking feature size of transistors within modern computers is making them increasingly vulnerable to various types of soft faults. This problem is especially acute in high-performance computing (HPC) systems used for scientific computing, because these systems include many thousands of compute cores and nodes, all of which may be utilized in a single large-scale run. The increasing vulnerability of HPC applications to errors induced by soft faults is motivating extensive work on techniques to make these applications more resiilent to such faults, ranging from generic techniques such as replication or checkpoint/restart to algorithmspecific error detection and tolerance techniques. Effective use of such techniques requires a detailed understanding of how a given application is affected by soft faults to ensure that (i) efforts to improve application resilience are spent in the code regions most vulnerable to faults and (ii) the appropriate resilience technique is applied to each code region. This paper presents XRay, a tool to view the application vulnerability to soft errors, and illustrates how XRay can be used in the context of a representative application. In addition to providing actionable insights into application behavior XRay automatically selects the number of fault injection experiments required tomore » provide an informative view of application behavior, ensuring that the information is statistically well-grounded without performing unnecessary experiments.« less
 [1] ;  [2] ;  [1] ;  [3] ;  [1]
  1. Louisiana State Univ., Baton Rouge, LA (United States)
  2. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
  3. Barcelona Supercomputing Center (Spain)
Publication Date:
OSTI Identifier:
Report Number(s):
DOE Contract Number:
Resource Type:
Technical Report
Research Org:
Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
Sponsoring Org:
Country of Publication:
United States