skip to main content

SciTech ConnectSciTech Connect

Title: Online Monitoring System for Performance Fault Detection

To achieve the exaFLOPS performance within a contain power budget, next supercomputers will feature hundreds of millions of components operating at low- and near-threshold voltage. As the probability that at least one of these components fails during the execution of an application approaches certainty, it seems unrealistic to expect that any run of a scientific application will not experience some performance faults. We believe that there is need of a new generation of light-weight performance and debugging tools that can be used online even during production runs of parallel applications and that can identify performance anomalies during the application execution. In this work we propose the design and implementation of a monitoring system that continuously inspects the evolution of run
Authors:
; ;
Publication Date:
OSTI Identifier:
1197092
Report Number(s):
PNNL-SA-101741
KJ0402000
DOE Contract Number:
AC05-76RL01830
Resource Type:
Conference
Resource Relation:
Conference: IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW 2014), May 19-23, 2014, Phoenix, Arizona, 1475-1484
Publisher:
IEEE , Piscataway, NJ, United States(US).
Research Org:
Pacific Northwest National Laboratory (PNNL), Richland, WA (US)
Sponsoring Org:
USDOE
Country of Publication:
United States
Language:
English
Subject:
HPC; Operating System; runtime; tools; reliability