skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era

Abstract

As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems with targeted power and energy budget goals introduces significant challenges in reliability. Silent data corruptions (SDCs) or silent errors are one of the major sources that corrupt the executionresults of HPC applications without being detected. In this work, we explore a low-memory-overhead SDC detector, by leveraging epsilon-insensitive support vector machine regression, to detect SDCs that occur in HPC applications that can be characterized by an impact error bound. The key contributions are three fold. (1) Our design takes spatialfeatures (i.e., neighbouring data values for each data point in a snapshot) into training data, such that little memory overhead (less than 1%) is introduced. (2) We provide an in-depth study on the detection ability and performance with different parameters, and we optimize the detection range carefully. (3) Experiments with eight real-world HPC applications show thatour detector can achieve the detection sensitivity (i.e., recall) up to 99% yet suffer a less than 1% of false positive rate for most cases. Our detector incurs low performance overhead, 5% on average, for all benchmarks studied in the paper. Compared with other state-of-the-art techniques, our detector exhibits the best tradeoff considering themore » detection ability and overheads.« less

Authors:
; ; ; ; ; ; ;
Publication Date:
Research Org.:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Org.:
USDOE Office of Science - Office of Advanced Scientific Computing Research
OSTI Identifier:
1336035
DOE Contract Number:  
AC02-06CH11357
Resource Type:
Conference
Resource Relation:
Conference: 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 05/16/16 - 05/19/16, Cartagena, CO
Country of Publication:
United States
Language:
English

Citation Formats

Subasi, Omer, Di, Sheng, Bautista-Gomez, Leonardo, Balaprakash, Prasanna, Unsal, Osman, Labarta, Jesus, Cristal, Adrian, and Cappello, Franck. Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era. United States: N. p., 2016. Web. doi:10.1109/CCGrid.2016.33.
Subasi, Omer, Di, Sheng, Bautista-Gomez, Leonardo, Balaprakash, Prasanna, Unsal, Osman, Labarta, Jesus, Cristal, Adrian, & Cappello, Franck. Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era. United States. https://doi.org/10.1109/CCGrid.2016.33
Subasi, Omer, Di, Sheng, Bautista-Gomez, Leonardo, Balaprakash, Prasanna, Unsal, Osman, Labarta, Jesus, Cristal, Adrian, and Cappello, Franck. Fri . "Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era". United States. https://doi.org/10.1109/CCGrid.2016.33.
@article{osti_1336035,
title = {Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era},
author = {Subasi, Omer and Di, Sheng and Bautista-Gomez, Leonardo and Balaprakash, Prasanna and Unsal, Osman and Labarta, Jesus and Cristal, Adrian and Cappello, Franck},
abstractNote = {As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems with targeted power and energy budget goals introduces significant challenges in reliability. Silent data corruptions (SDCs) or silent errors are one of the major sources that corrupt the executionresults of HPC applications without being detected. In this work, we explore a low-memory-overhead SDC detector, by leveraging epsilon-insensitive support vector machine regression, to detect SDCs that occur in HPC applications that can be characterized by an impact error bound. The key contributions are three fold. (1) Our design takes spatialfeatures (i.e., neighbouring data values for each data point in a snapshot) into training data, such that little memory overhead (less than 1%) is introduced. (2) We provide an in-depth study on the detection ability and performance with different parameters, and we optimize the detection range carefully. (3) Experiments with eight real-world HPC applications show thatour detector can achieve the detection sensitivity (i.e., recall) up to 99% yet suffer a less than 1% of false positive rate for most cases. Our detector incurs low performance overhead, 5% on average, for all benchmarks studied in the paper. Compared with other state-of-the-art techniques, our detector exhibits the best tradeoff considering the detection ability and overheads.},
doi = {10.1109/CCGrid.2016.33},
url = {https://www.osti.gov/biblio/1336035}, journal = {},
number = ,
volume = ,
place = {United States},
year = {2016},
month = {1}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: