skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Exploring the capabilities of support vector machines in detecting silent data corruptions

Journal Article · · Sustainable Computing
 [1];  [2];  [3];  [2];  [3];  [3];  [4];  [1];  [2]
  1. Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
  2. Argonne National Lab. (ANL), Argonne, IL (United States)
  3. Barcelona Supercomputing Center (Spain)
  4. Barcelona Supercomputing Center (Spain); Spanish National Research Council (CSIC), Madrid (Spain). IIIA - Artificial Intelligence Research Inst.

As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems with targeted power and energy budget goals introduces significant challenges in reliability. Silent data corruptions (SDCs), or silent errors, are one of the major sources that corrupt the execution results of HPC applications without being detected. Here in this paper, we explore a set of novel SDC detectors – by leveraging epsilon-insensitive support vector machine regression – to detect SDCs that occur in HPC applications. The key contributions are threefold. (1) Our exploration takes temporal, spatial, and spatiotemporal features into account and analyzes different detectors based on different features. (2) We provide an in-depth study on the detection ability and performance with different parameters, and we optimize the detection range carefully. (3) Experiments with eight real-world HPC applications show that support-vector-machine-based detectors can achieve detection sensitivity (i.e., recall) up to 99% yet suffer a less than 1% false positive rate for most cases. Our detectors incur low performance overhead, 5% on average, for all benchmarks studied in this work.

Research Organization:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Organization:
USDOE; National Science Foundation (NSF); European Union (EU)
Grant/Contract Number:
1619253; AC05-76RL01830; AC02-06CH11357; TIN2015-65316-P
OSTI ID:
1422782
Report Number(s):
PNNL-SA-131767; PII: S2210537917300896
Journal Information:
Sustainable Computing, Vol. 19; ISSN 2210-5379
Publisher:
ElsevierCopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 9 works
Citation information provided by
Web of Science

Figures / Tables (18)


Similar Records

Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era
Conference · Fri Jan 01 00:00:00 EST 2016 · OSTI ID:1422782

Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications
Journal Article · Sat Oct 01 00:00:00 EDT 2016 · IEEE Transactions on Parallel and Distributed Systems · OSTI ID:1422782

Toward General Software Level Silent Data Corruption Detection for Parallel Applications
Journal Article · Fri Aug 04 00:00:00 EDT 2017 · IEEE Transactions on Parallel and Distributed Systems · OSTI ID:1422782