skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Exploring the capabilities of support vector machines in detecting silent data corruptions

Abstract

As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems with targeted power and energy budget goals introduces significant challenges in reliability. Silent data corruptions (SDCs), or silent errors, are one of the major sources that corrupt the execution results of HPC applications without being detected. Here in this paper, we explore a set of novel SDC detectors – by leveraging epsilon-insensitive support vector machine regression – to detect SDCs that occur in HPC applications. The key contributions are threefold. (1) Our exploration takes temporal, spatial, and spatiotemporal features into account and analyzes different detectors based on different features. (2) We provide an in-depth study on the detection ability and performance with different parameters, and we optimize the detection range carefully. (3) Experiments with eight real-world HPC applications show that support-vector-machine-based detectors can achieve detection sensitivity (i.e., recall) up to 99% yet suffer a less than 1% false positive rate for most cases. Our detectors incur low performance overhead, 5% on average, for all benchmarks studied in this work.

Authors:
 [1];  [2];  [3];  [2];  [3];  [3];  [4];  [1];  [2]
  1. Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
  2. Argonne National Lab. (ANL), Argonne, IL (United States)
  3. Barcelona Supercomputing Center (Spain)
  4. Barcelona Supercomputing Center (Spain); Spanish National Research Council (CSIC), Madrid (Spain). IIIA - Artificial Intelligence Research Inst.
Publication Date:
Research Org.:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE; National Science Foundation (NSF); European Union (EU)
OSTI Identifier:
1422782
Report Number(s):
PNNL-SA-131767
Journal ID: ISSN 2210-5379; PII: S2210537917300896
Grant/Contract Number:  
1619253; AC05-76RL01830; AC02-06CH11357; TIN2015-65316-P
Resource Type:
Accepted Manuscript
Journal Name:
Sustainable Computing
Additional Journal Information:
Journal Volume: 19; Journal ID: ISSN 2210-5379
Publisher:
Elsevier
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; Silent data corruptions; Support vector machines; HPC applications

Citation Formats

Subasi, Omer, Di, Sheng, Bautista-Gomez, Leonardo, Balaprakash, Prasanna, Unsal, Osman, Labarta, Jesus, Cristal, Adrian, Krishnamoorthy, Sriram, and Cappello, Franck. Exploring the capabilities of support vector machines in detecting silent data corruptions. United States: N. p., 2018. Web. doi:10.1016/J.SUSCOM.2018.01.004.
Subasi, Omer, Di, Sheng, Bautista-Gomez, Leonardo, Balaprakash, Prasanna, Unsal, Osman, Labarta, Jesus, Cristal, Adrian, Krishnamoorthy, Sriram, & Cappello, Franck. Exploring the capabilities of support vector machines in detecting silent data corruptions. United States. doi:10.1016/J.SUSCOM.2018.01.004.
Subasi, Omer, Di, Sheng, Bautista-Gomez, Leonardo, Balaprakash, Prasanna, Unsal, Osman, Labarta, Jesus, Cristal, Adrian, Krishnamoorthy, Sriram, and Cappello, Franck. Thu . "Exploring the capabilities of support vector machines in detecting silent data corruptions". United States. doi:10.1016/J.SUSCOM.2018.01.004. https://www.osti.gov/servlets/purl/1422782.
@article{osti_1422782,
title = {Exploring the capabilities of support vector machines in detecting silent data corruptions},
author = {Subasi, Omer and Di, Sheng and Bautista-Gomez, Leonardo and Balaprakash, Prasanna and Unsal, Osman and Labarta, Jesus and Cristal, Adrian and Krishnamoorthy, Sriram and Cappello, Franck},
abstractNote = {As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems with targeted power and energy budget goals introduces significant challenges in reliability. Silent data corruptions (SDCs), or silent errors, are one of the major sources that corrupt the execution results of HPC applications without being detected. Here in this paper, we explore a set of novel SDC detectors – by leveraging epsilon-insensitive support vector machine regression – to detect SDCs that occur in HPC applications. The key contributions are threefold. (1) Our exploration takes temporal, spatial, and spatiotemporal features into account and analyzes different detectors based on different features. (2) We provide an in-depth study on the detection ability and performance with different parameters, and we optimize the detection range carefully. (3) Experiments with eight real-world HPC applications show that support-vector-machine-based detectors can achieve detection sensitivity (i.e., recall) up to 99% yet suffer a less than 1% false positive rate for most cases. Our detectors incur low performance overhead, 5% on average, for all benchmarks studied in this work.},
doi = {10.1016/J.SUSCOM.2018.01.004},
journal = {Sustainable Computing},
number = ,
volume = 19,
place = {United States},
year = {2018},
month = {2}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Figures / Tables:

Fig. 1 Fig. 1: SVM classification compared with other linear classifiers.

Save / Share: