skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Toward General Software Level Silent Data Corruption Detection for Parallel Applications

Abstract

Silent data corruption (SDC) poses a great challenge for high-performance computing (HPC) applications as we move to extreme-scale systems. Mechanisms have been proposed that are able to detect SDC in HPC applications by using the peculiarities of the data (more specifically, its “smoothness” in time and space) to make predictions. However, these data-analytic solutions are still far from fully protecting applications to a level comparable with more expensive solutions such as full replication. In this work, we propose partial replication to overcome this limitation. More specifically, we have observed that not all processes of an MPI application experience the same level of data variability at exactly the same time. Thus, we can smartly choose and replicate only those processes for which the lightweight data-analytic detectors would perform poorly. In addition, we propose a new evaluation method based on the probability that a corruption will pass unnoticed by a particular detector (instead of just reporting overall single-bit precision and recall). Here in our experiments, we use four applications dealing with different explosions. Finally, our results indicate that our new approach can protect the MPI applications analyzed with 7–70% less overhead (depending on the application) than that of full duplication with similarmore » detection recall.« less

Authors:
ORCiD logo [1];  [2];  [3];  [1];  [3]
  1. Illinois Inst. of Technology, Chicago, IL (United States). Dept. of Computer Science
  2. Barcelona Supercomputing Center, Barcelona (Spain)
  3. Argonne National Lab. (ANL), Argonne, IL (United States)
Publication Date:
Research Org.:
Bettis Atomic Power Laboratory (BAPL), West Mifflin, PA (United States)
Sponsoring Org.:
National Science Foundation (NSF); Institut national de recherche en informatique et en automatique (INRIA); Agence Nationale de la recherche (ANR); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
OSTI Identifier:
1413980
Grant/Contract Number:  
AC02-06CH11357
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
IEEE Transactions on Parallel and Distributed Systems
Additional Journal Information:
Journal Volume: 28; Journal Issue: 12; Journal ID: ISSN 1045-9219
Publisher:
IEEE
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; Data Analysis; High-Performance Computing; Parallel Applications; Partial Replication; Silent Data Corruption Detection

Citation Formats

Berrocal, Eduardo, Bautista-Gomez, Leonardo, Di, Sheng, Lan, Zhiling, and Cappello, Franck. Toward General Software Level Silent Data Corruption Detection for Parallel Applications. United States: N. p., 2017. Web. doi:10.1109/TPDS.2017.2735971.
Berrocal, Eduardo, Bautista-Gomez, Leonardo, Di, Sheng, Lan, Zhiling, & Cappello, Franck. Toward General Software Level Silent Data Corruption Detection for Parallel Applications. United States. doi:10.1109/TPDS.2017.2735971.
Berrocal, Eduardo, Bautista-Gomez, Leonardo, Di, Sheng, Lan, Zhiling, and Cappello, Franck. Fri . "Toward General Software Level Silent Data Corruption Detection for Parallel Applications". United States. doi:10.1109/TPDS.2017.2735971. https://www.osti.gov/servlets/purl/1413980.
@article{osti_1413980,
title = {Toward General Software Level Silent Data Corruption Detection for Parallel Applications},
author = {Berrocal, Eduardo and Bautista-Gomez, Leonardo and Di, Sheng and Lan, Zhiling and Cappello, Franck},
abstractNote = {Silent data corruption (SDC) poses a great challenge for high-performance computing (HPC) applications as we move to extreme-scale systems. Mechanisms have been proposed that are able to detect SDC in HPC applications by using the peculiarities of the data (more specifically, its “smoothness” in time and space) to make predictions. However, these data-analytic solutions are still far from fully protecting applications to a level comparable with more expensive solutions such as full replication. In this work, we propose partial replication to overcome this limitation. More specifically, we have observed that not all processes of an MPI application experience the same level of data variability at exactly the same time. Thus, we can smartly choose and replicate only those processes for which the lightweight data-analytic detectors would perform poorly. In addition, we propose a new evaluation method based on the probability that a corruption will pass unnoticed by a particular detector (instead of just reporting overall single-bit precision and recall). Here in our experiments, we use four applications dealing with different explosions. Finally, our results indicate that our new approach can protect the MPI applications analyzed with 7–70% less overhead (depending on the application) than that of full duplication with similar detection recall.},
doi = {10.1109/TPDS.2017.2735971},
journal = {IEEE Transactions on Parallel and Distributed Systems},
number = 12,
volume = 28,
place = {United States},
year = {Fri Aug 04 00:00:00 EDT 2017},
month = {Fri Aug 04 00:00:00 EDT 2017}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Save / Share: