skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Understanding scale-Dependent soft-Error Behavior of Scientific Applications

Abstract

Analyzing application fault behavior on large-scale systems is time-consuming and resource-demanding. Currently, researchers need to perform fault injection campaigns at full scale to understand the effects of soft errors on applications and whether these faults result in silent data corruption. Both time and resource requirements greatly limit the scope of the resilience studies that can be currently performed. In this work, we propose a methodology to model application fault behavior at large scale based on a reduced set of experiments performed at small scale. We employ machine learning techniques to accurately model application fault behavior using a set of experiments that can be executed in parallel at small scale. Our methodology drastically reduces the set and the scale of the fault injection experiments to be performed and provides a validated methodology to study application fault behavior at large scale. We show that our methodology can accurately model application fault behavior at large scale by using only small scale experiments. In some cases, we can model the fault behavior of a parallel application running on 4,096 cores with about 90% accuracy based on experiments on a single core.

Authors:
ORCiD logo [1];  [1]; ORCiD logo [1];  [2]
  1. ORNL
  2. Pacific Northwest National Laboratory (PNNL)
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1474572
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) - Washington, DC, USA, , United States of America - 5/1/2018 4:00:00 AM-5/4/2018 4:00:00 AM
Country of Publication:
United States
Language:
English

Citation Formats

Kestor Gioiosa, Gokcen, Peng, Ivy Bo, Gioiosa, Roberto, and Krishnamoorthy, Sriram. Understanding scale-Dependent soft-Error Behavior of Scientific Applications. United States: N. p., 2018. Web. doi:10.1109/CCGRID.2018.00075.
Kestor Gioiosa, Gokcen, Peng, Ivy Bo, Gioiosa, Roberto, & Krishnamoorthy, Sriram. Understanding scale-Dependent soft-Error Behavior of Scientific Applications. United States. doi:10.1109/CCGRID.2018.00075.
Kestor Gioiosa, Gokcen, Peng, Ivy Bo, Gioiosa, Roberto, and Krishnamoorthy, Sriram. Tue . "Understanding scale-Dependent soft-Error Behavior of Scientific Applications". United States. doi:10.1109/CCGRID.2018.00075. https://www.osti.gov/servlets/purl/1474572.
@article{osti_1474572,
title = {Understanding scale-Dependent soft-Error Behavior of Scientific Applications},
author = {Kestor Gioiosa, Gokcen and Peng, Ivy Bo and Gioiosa, Roberto and Krishnamoorthy, Sriram},
abstractNote = {Analyzing application fault behavior on large-scale systems is time-consuming and resource-demanding. Currently, researchers need to perform fault injection campaigns at full scale to understand the effects of soft errors on applications and whether these faults result in silent data corruption. Both time and resource requirements greatly limit the scope of the resilience studies that can be currently performed. In this work, we propose a methodology to model application fault behavior at large scale based on a reduced set of experiments performed at small scale. We employ machine learning techniques to accurately model application fault behavior using a set of experiments that can be executed in parallel at small scale. Our methodology drastically reduces the set and the scale of the fault injection experiments to be performed and provides a validated methodology to study application fault behavior at large scale. We show that our methodology can accurately model application fault behavior at large scale by using only small scale experiments. In some cases, we can model the fault behavior of a parallel application running on 4,096 cores with about 90% accuracy based on experiments on a single core.},
doi = {10.1109/CCGRID.2018.00075},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2018},
month = {5}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: