Understanding scale-dependent soft-error behavior of scientific applications

Kestor, Gokcen G.; Peng, Ivy B.; Gioiosa, Roberto; Krishnamoorthy, Sriram

doi:10.1109/CCGRID.2018.00075

Title: Understanding scale-dependent soft-error behavior of scientific applications

Conference · Fri Jul 13 00:00:00 EDT 2018

DOI:https://doi.org/10.1109/CCGRID.2018.00075· OSTI ID:1525479

Kestor, Gokcen G. ^[1]; Peng, Ivy B. ^[1]; Gioiosa, Roberto ^[1]; Krishnamoorthy, Sriram ^[2]

Oak Ridge National Laboratory
BATTELLE (PACIFIC NW LAB)

Analyzing application fault behavior on large-scale systems is time-consuming and resource-demanding. Currently, researchers need to perform fault injection campaigns at full scale to understand the effects of soft errors on applications and whether these faults result in silent data corruption. Both time and resource requirements greatly limit the scope of the resilience studies that can be currently performed. In this work, we propose a methodology to model application fault behavior at large scale based on a reduced set of experiments performed at small scale. We employ machine learning techniques to accurately model application fault behavior using a set of experiments that can be executed in parallel at small scale. Our methodology drastically reduces the set and the scale of the fault injection experiments to be performed and provides a validated methodology to study application fault behavior at large scale. We show that our methodology can accurately model application fault behavior at large scale by using only small scale experiments. In some cases, we can model the fault behavior of a parallel application running on 4,096 cores with about 90% accuracy based on experiments on a single core.

OSTI does not have a digital full text copy available. For more information, please see document availability, search WorldCat, or search Google Scholar.

Cite

Export

Save

Research Organization:: Pacific Northwest National Lab. (PNNL), Richland, WA (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC05-76RL01830

OSTI ID:: 1525479

Report Number(s):: PNNL-SA-132744

Resource Relation:: Conference: Proceedings of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2018), May 1-4, 2018, Washington DC

Country of Publication:: United States

Language:: English

Similar Records

Understanding scale-Dependent soft-Error Behavior of Scientific Applications

Conference · Tue May 01 00:00:00 EDT 2018 · OSTI ID:1525479

Kestor Gioiosa, Gokcen; Peng, Ivy Bo; Gioiosa, Roberto; +1 more

Center for Technology for Advanced Scientific Componet Software (TASCS)

Technical Report · Sun Oct 31 00:00:00 EDT 2010 · OSTI ID:1525479

Govindaraju, Madhusudhan

A Case for Soft Error Detection and Correction in Computational Chemistry

Journal Article · Tue Sep 10 00:00:00 EDT 2013 · Journal of Chemical Theory and Computation, 9(9):3995-4005 · OSTI ID:1525479

van Dam, Hubertus JJ; Vishnu, Abhinav; De Jong, Wibe A.

Title: Understanding scale-dependent soft-error behavior of scientific applications

Citation Formats

Similar Records

Related Subjects