skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Online Monitoring System for Performance Fault Detection

Journal Article · · Parallel Processing Letters
 [1];  [1];  [1]
  1. Pacific Northwest National Lab. (PNNL), Richland, WA (United States)

To achieve the exaFLOPS performance within a contained power budget, next generation supercomputers will feature hundreds of millions of components operating at low- and near-threshold voltage. As the probability that at least one of these components fails during the execution of an application approaches certainty, it seems unrealistic to expect that any run of a scientific application will not experience some performance faults. We believe that there is need of a new generation of light-weight performance and debugging tools that can be used online even during production runs of parallel applications and that can identify performance anomalies during the application execution. In this work we propose the design and implementation of such a monitoring system.

Research Organization:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-76RL01830
OSTI ID:
1225167
Report Number(s):
PNNL-SA-105705; KJ0402000
Journal Information:
Parallel Processing Letters, Vol. 24, Issue 4; ISSN 0129-6264
Publisher:
World Scientific
Country of Publication:
United States
Language:
English