skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Detecting Silent Data Corruption for Extreme-Scale Applications through Data Mining

Abstract

Supercomputers allow scientists to study natural phenomena by means of computer simulations. Next-generation machines are expected to have more components and, at the same time, consume several times less energy per operation. These trends are pushing supercomputer construction to the limits of miniaturization and energy-saving strategies. Consequently, the number of soft errors is expected to increase dramatically in the coming years. While mechanisms are in place to correct or at least detect some soft errors, a significant percentage of those errors pass unnoticed by the hardware. Such silent errors are extremely damaging because they can make applications silently produce wrong results. In this work we propose a technique that leverages certain properties of high-performance computing applications in order to detect silent errors at the application level. Our technique detects corruption solely based on the behavior of the application datasets and is completely application-agnostic. We propose multiple corruption detectors, and we couple them to work together in a fashion transparent to the user. We demonstrate that this strategy can detect the majority of the corruptions, while incurring negligible overhead. We show that with the help of these detectors, applications can have up to 80% of coverage against data corruption.

Authors:
 [1];  [1]
  1. Argonne National Lab. (ANL), Argonne, IL (United States)
Publication Date:
Research Org.:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1177404
Report Number(s):
ANL/MCS-TM-346
109004
DOE Contract Number:  
AC02-06CH11357
Resource Type:
Technical Report
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING

Citation Formats

Bautista-Gomez, Leonardo, and Cappello, Franck. Detecting Silent Data Corruption for Extreme-Scale Applications through Data Mining. United States: N. p., 2014. Web. doi:10.2172/1177404.
Bautista-Gomez, Leonardo, & Cappello, Franck. Detecting Silent Data Corruption for Extreme-Scale Applications through Data Mining. United States. https://doi.org/10.2172/1177404
Bautista-Gomez, Leonardo, and Cappello, Franck. Thu . "Detecting Silent Data Corruption for Extreme-Scale Applications through Data Mining". United States. https://doi.org/10.2172/1177404. https://www.osti.gov/servlets/purl/1177404.
@article{osti_1177404,
title = {Detecting Silent Data Corruption for Extreme-Scale Applications through Data Mining},
author = {Bautista-Gomez, Leonardo and Cappello, Franck},
abstractNote = {Supercomputers allow scientists to study natural phenomena by means of computer simulations. Next-generation machines are expected to have more components and, at the same time, consume several times less energy per operation. These trends are pushing supercomputer construction to the limits of miniaturization and energy-saving strategies. Consequently, the number of soft errors is expected to increase dramatically in the coming years. While mechanisms are in place to correct or at least detect some soft errors, a significant percentage of those errors pass unnoticed by the hardware. Such silent errors are extremely damaging because they can make applications silently produce wrong results. In this work we propose a technique that leverages certain properties of high-performance computing applications in order to detect silent errors at the application level. Our technique detects corruption solely based on the behavior of the application datasets and is completely application-agnostic. We propose multiple corruption detectors, and we couple them to work together in a fashion transparent to the user. We demonstrate that this strategy can detect the majority of the corruptions, while incurring negligible overhead. We show that with the help of these detectors, applications can have up to 80% of coverage against data corruption.},
doi = {10.2172/1177404},
url = {https://www.osti.gov/biblio/1177404}, journal = {},
number = ,
volume = ,
place = {United States},
year = {2014},
month = {1}
}