skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: MACORD: Online Adaptive Machine Learning Framework for Silent Error Detection

Abstract

Future HPC systems with ever-increasing resource capacity (such as compute cores, memory and storage) may significantly increase the risks on reliability. Silent data corruptions (SDCs) or silent errors are one of the major sources that corrupt HPC execution results. Unlike fail-stop errors, SDCs are rather harmful and dangerous in that they cannot be detected by hardware. We propose an online machine-learning based silent data corruption detection framework (abbreviated as MACORD) for detecting SDCs in HPC applications. In particular, we comprehensively investigate the prediction ability of a multitude of machine-learning algorithms in our study, and enable the detector to automatically select the best-fit algorithms at runtime to adapt to the data dynamics. Our learning framework exhibits low memory overhead (less than 1%), since it takes only spatial features (i.e., neighboring data values for each data point in the current time step) into the training data. Experiments based on real-world scientific applications/benchmarks show that our framework can get the detection sensitivity (i.e., recall) up to 99% while the false positive rate is limited to 0.1% in most cases, which is one order of magnitude improvement compared with the latest state-of-art spatial technique.

Authors:
 [1];  [2];  [2];  [3];  [3];  [4];  [1];  [2]
  1. BATTELLE (PACIFIC NW LAB)
  2. Argonne National Laboratory
  3. Barcelona Supercomputing Center
  4. IIIA - CSIC - Spanish National Research Council
Publication Date:
Research Org.:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1526315
Report Number(s):
PNNL-SA-128115
DOE Contract Number:  
AC05-76RL01830
Resource Type:
Conference
Resource Relation:
Conference: IEEE International Conference on Cluster Computing (CLUSTER 2017), September 5-8, 2017, Honolulu, HI
Country of Publication:
United States
Language:
English

Citation Formats

Subasi, Omer, Di, Sheng, Balaprakash, Prasanna, Unsal, Osman, Labarta, Jesus, Cristal, Adrian, Krishnamoorthy, Sriram, and Cappello, Franck. MACORD: Online Adaptive Machine Learning Framework for Silent Error Detection. United States: N. p., 2017. Web. doi:10.1109/CLUSTER.2017.128.
Subasi, Omer, Di, Sheng, Balaprakash, Prasanna, Unsal, Osman, Labarta, Jesus, Cristal, Adrian, Krishnamoorthy, Sriram, & Cappello, Franck. MACORD: Online Adaptive Machine Learning Framework for Silent Error Detection. United States. doi:10.1109/CLUSTER.2017.128.
Subasi, Omer, Di, Sheng, Balaprakash, Prasanna, Unsal, Osman, Labarta, Jesus, Cristal, Adrian, Krishnamoorthy, Sriram, and Cappello, Franck. Tue . "MACORD: Online Adaptive Machine Learning Framework for Silent Error Detection". United States. doi:10.1109/CLUSTER.2017.128.
@article{osti_1526315,
title = {MACORD: Online Adaptive Machine Learning Framework for Silent Error Detection},
author = {Subasi, Omer and Di, Sheng and Balaprakash, Prasanna and Unsal, Osman and Labarta, Jesus and Cristal, Adrian and Krishnamoorthy, Sriram and Cappello, Franck},
abstractNote = {Future HPC systems with ever-increasing resource capacity (such as compute cores, memory and storage) may significantly increase the risks on reliability. Silent data corruptions (SDCs) or silent errors are one of the major sources that corrupt HPC execution results. Unlike fail-stop errors, SDCs are rather harmful and dangerous in that they cannot be detected by hardware. We propose an online machine-learning based silent data corruption detection framework (abbreviated as MACORD) for detecting SDCs in HPC applications. In particular, we comprehensively investigate the prediction ability of a multitude of machine-learning algorithms in our study, and enable the detector to automatically select the best-fit algorithms at runtime to adapt to the data dynamics. Our learning framework exhibits low memory overhead (less than 1%), since it takes only spatial features (i.e., neighboring data values for each data point in the current time step) into the training data. Experiments based on real-world scientific applications/benchmarks show that our framework can get the detection sensitivity (i.e., recall) up to 99% while the false positive rate is limited to 0.1% in most cases, which is one order of magnitude improvement compared with the latest state-of-art spatial technique.},
doi = {10.1109/CLUSTER.2017.128},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2017},
month = {9}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: