skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Automatic Fault Characterization via Abnormality-Enhanced Classification

Abstract

Enterprise and high-performance computing systems are growing extremely large and complex, employing hundreds to hundreds of thousands of processors and software/hardware stacks built by many people across many organizations. As the growing scale of these machines increases the frequency of faults, system complexity makes these faults difficult to detect and to diagnose. Current system management techniques, which focus primarily on efficient data access and query mechanisms, require system administrators to examine the behavior of various system services manually. Growing system complexity is making this manual process unmanageable: administrators require more effective management tools that can detect faults and help to identify their root causes. System administrators need timely notification when a fault is manifested that includes the type of fault, the time period in which it occurred and the processor on which it originated. Statistical modeling approaches can accurately characterize system behavior. However, the complex effects of system faults make these tools difficult to apply effectively. This paper investigates the application of classification and clustering algorithms to fault detection and characterization. We show experimentally that naively applying these methods achieves poor accuracy. Further, we design novel techniques that combine classification algorithms with information on the abnormality of application behavior tomore » improve detection and characterization accuracy. Our experiments demonstrate that these techniques can detect and characterize faults with 65% accuracy, compared to just 5% accuracy for naive approaches.« less

Authors:
; ;
Publication Date:
Research Org.:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1018832
Report Number(s):
LLNL-CONF-464273
TRN: US201114%%362
DOE Contract Number:  
W-7405-ENG-48
Resource Type:
Conference
Resource Relation:
Conference: Presented at: Conference on Dependable Systems and Networks, Hong Kong, China, Jun 27 - Jun 27, 2011
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; ACCURACY; ALGORITHMS; CLASSIFICATION; DESIGN; DETECTION; MANAGEMENT; SIMULATION

Citation Formats

Bronevetsky, G, Laguna, I, and de Supinski, B R. Automatic Fault Characterization via Abnormality-Enhanced Classification. United States: N. p., 2010. Web. doi:10.1109/DSN.2012.6263926.
Bronevetsky, G, Laguna, I, & de Supinski, B R. Automatic Fault Characterization via Abnormality-Enhanced Classification. United States. https://doi.org/10.1109/DSN.2012.6263926
Bronevetsky, G, Laguna, I, and de Supinski, B R. 2010. "Automatic Fault Characterization via Abnormality-Enhanced Classification". United States. https://doi.org/10.1109/DSN.2012.6263926. https://www.osti.gov/servlets/purl/1018832.
@article{osti_1018832,
title = {Automatic Fault Characterization via Abnormality-Enhanced Classification},
author = {Bronevetsky, G and Laguna, I and de Supinski, B R},
abstractNote = {Enterprise and high-performance computing systems are growing extremely large and complex, employing hundreds to hundreds of thousands of processors and software/hardware stacks built by many people across many organizations. As the growing scale of these machines increases the frequency of faults, system complexity makes these faults difficult to detect and to diagnose. Current system management techniques, which focus primarily on efficient data access and query mechanisms, require system administrators to examine the behavior of various system services manually. Growing system complexity is making this manual process unmanageable: administrators require more effective management tools that can detect faults and help to identify their root causes. System administrators need timely notification when a fault is manifested that includes the type of fault, the time period in which it occurred and the processor on which it originated. Statistical modeling approaches can accurately characterize system behavior. However, the complex effects of system faults make these tools difficult to apply effectively. This paper investigates the application of classification and clustering algorithms to fault detection and characterization. We show experimentally that naively applying these methods achieves poor accuracy. Further, we design novel techniques that combine classification algorithms with information on the abnormality of application behavior to improve detection and characterization accuracy. Our experiments demonstrate that these techniques can detect and characterize faults with 65% accuracy, compared to just 5% accuracy for naive approaches.},
doi = {10.1109/DSN.2012.6263926},
url = {https://www.osti.gov/biblio/1018832}, journal = {},
number = ,
volume = ,
place = {United States},
year = {2010},
month = {12}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: