skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Statistical Fault Detection for Parallel Applications with AutomaDeD

Abstract

Today's largest systems have over 100,000 cores, with million-core systems expected over the next few years. The large component count means that these systems fail frequently and often in very complex ways, making them difficult to use and maintain. While prior work on fault detection and diagnosis has focused on faults that significantly reduce system functionality, the wide variety of failure modes in modern systems makes them likely to fail in complex ways that impair system performance but are difficult to detect and diagnose. This paper presents AutomaDeD, a statistical tool that models the timing behavior of each application task and tracks its behavior to identify any abnormalities. If any are observed, AutomaDeD can immediately detect them and report to the system administrator the task where the problem began. This identification of the fault's initial manifestation can provide administrators with valuable insight into the fault's root causes, making it significantly easier and cheaper for them to understand and repair it. Our experimental evaluation shows that AutomaDeD detects a wide range of faults immediately after they occur 80% of the time, with a low false-positive rate. Further, it identifies weaknesses of the current approach that motivate future research.

Authors:
; ; ; ; ;
Publication Date:
Research Org.:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
974392
Report Number(s):
LLNL-CONF-426254
TRN: US201007%%695
DOE Contract Number:  
W-7405-ENG-48
Resource Type:
Conference
Resource Relation:
Conference: Presented at: IEEE Workshop on Silicon Errors in Logic - System Effects, Stanford, CA, United States, Mar 23 - Mar 24, 2010
Country of Publication:
United States
Language:
English
Subject:
99 GENERAL AND MISCELLANEOUS; DETECTION; DIAGNOSIS; EVALUATION; PERFORMANCE; REPAIR; SILICON

Citation Formats

Bronevetsky, G, Laguna, I, Bagchi, S, de Supinski, B R, Ahn, D, and Schulz, M. Statistical Fault Detection for Parallel Applications with AutomaDeD. United States: N. p., 2010. Web.
Bronevetsky, G, Laguna, I, Bagchi, S, de Supinski, B R, Ahn, D, & Schulz, M. Statistical Fault Detection for Parallel Applications with AutomaDeD. United States.
Bronevetsky, G, Laguna, I, Bagchi, S, de Supinski, B R, Ahn, D, and Schulz, M. Tue . "Statistical Fault Detection for Parallel Applications with AutomaDeD". United States. https://www.osti.gov/servlets/purl/974392.
@article{osti_974392,
title = {Statistical Fault Detection for Parallel Applications with AutomaDeD},
author = {Bronevetsky, G and Laguna, I and Bagchi, S and de Supinski, B R and Ahn, D and Schulz, M},
abstractNote = {Today's largest systems have over 100,000 cores, with million-core systems expected over the next few years. The large component count means that these systems fail frequently and often in very complex ways, making them difficult to use and maintain. While prior work on fault detection and diagnosis has focused on faults that significantly reduce system functionality, the wide variety of failure modes in modern systems makes them likely to fail in complex ways that impair system performance but are difficult to detect and diagnose. This paper presents AutomaDeD, a statistical tool that models the timing behavior of each application task and tracks its behavior to identify any abnormalities. If any are observed, AutomaDeD can immediately detect them and report to the system administrator the task where the problem began. This identification of the fault's initial manifestation can provide administrators with valuable insight into the fault's root causes, making it significantly easier and cheaper for them to understand and repair it. Our experimental evaluation shows that AutomaDeD detects a wide range of faults immediately after they occur 80% of the time, with a low false-positive rate. Further, it identifies weaknesses of the current approach that motivate future research.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2010},
month = {3}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: