skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Understanding scale-Dependent soft-Error Behavior of Scientific Applications

Conference ·

Analyzing application fault behavior on large-scale systems is time-consuming and resource-demanding. Currently, researchers need to perform fault injection campaigns at full scale to understand the effects of soft errors on applications and whether these faults result in silent data corruption. Both time and resource requirements greatly limit the scope of the resilience studies that can be currently performed. In this work, we propose a methodology to model application fault behavior at large scale based on a reduced set of experiments performed at small scale. We employ machine learning techniques to accurately model application fault behavior using a set of experiments that can be executed in parallel at small scale. Our methodology drastically reduces the set and the scale of the fault injection experiments to be performed and provides a validated methodology to study application fault behavior at large scale. We show that our methodology can accurately model application fault behavior at large scale by using only small scale experiments. In some cases, we can model the fault behavior of a parallel application running on 4,096 cores with about 90% accuracy based on experiments on a single core.

Research Organization:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1474572
Resource Relation:
Conference: 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) - Washington D.C., District of Columbia, United States of America - 5/1/2018 8:00:00 AM-5/4/2018 8:00:00 AM
Country of Publication:
United States
Language:
English

References (23)

Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool
  • Li, Dong; Vetter, Jeffrey S.; Yu, Weikuan
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.29
conference November 2012
Matrix Multiplication on GPUs with On-Line Fault Tolerance
  • Ding, Chong; Karlsson, Christer; Liu, Hui
  • 2011 IEEE 9th International Symposium on Parallel and Distributed Processing with Applications (ISPA), 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications https://doi.org/10.1109/ISPA.2011.50
conference May 2011
Shoestring: probabilistic soft error reliability on the cheap
  • Feng, Shuguang; Gupta, Shantanu; Ansari, Amin
  • Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems - ASPLOS '10 https://doi.org/10.1145/1736020.1736063
conference January 2010
Localized Fault Recovery for Nested Fork-Join Programs conference May 2017
Radiation-induced soft errors in advanced semiconductor technologies journal September 2005
An Experimental Study of Soft Errors in Microprocessors journal November 2005
Understanding Soft Error Resiliency of Blue Gene/Q Compute Chip through Hardware Proton Irradiation and Software Fault Injection
  • Cher, Chen-Yong; Gupta, Meeta S.; Bose, Pradip
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.53
conference November 2014
Quantitative evaluation of soft error injection techniques for robust system design conference January 2013
BoomerAMG: A parallel algebraic multigrid solver and preconditioner journal April 2002
Bulldozer: An Approach to Multithreaded Compute Performance journal March 2011
Fast Parallel Algorithms for Short-Range Molecular Dynamics journal March 1995
Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults
  • Wei, Jiesheng; Thomas, Anna; Li, Guanpeng
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.2
conference June 2014
Instruction-Level Impact Analysis of Low-Level Faults in a Modern Microprocessor Controller journal September 2011
Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application
  • Karlin, Ian; Bhatele, Abhinav; Keasler, Jeff
  • 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2013.115
conference May 2013
SDCTune: a model for predicting the SDC proneness of an application for configurable protection
  • Lu, Qining; Pattabiraman, Karthik; Gupta, Meeta S.
  • ESWEEK'14: TENTH EMBEDDED SYSTEM WEEK, Proceedings of the 2014 International Conference on Compilers, Architecture and Synthesis for Embedded Systems https://doi.org/10.1145/2656106.2656127
conference October 2014
Evaluating the viability of process replication reliability for exascale systems
  • Ferreira, Kurt; Stearley, Jon; Laros, James H.
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063443
conference January 2011
A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor conference January 2003
Searching for exotic particles in high-energy physics with deep learning journal July 2014
The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing journal November 2005
Soft error vulnerability of iterative linear algebra methods conference January 2008
Fault resilience of the algebraic multi-grid solver conference January 2012
Quantitatively Modeling Application Resilience with the Data Vulnerability Factor conference November 2014
Fault Modeling of Extreme Scale Applications Using Machine Learning conference May 2016

Similar Records

Understanding scale-dependent soft-error behavior of scientific applications
Conference · Fri Jul 13 00:00:00 EDT 2018 · OSTI ID:1474572

Center for Technology for Advanced Scientific Componet Software (TASCS)
Technical Report · Sun Oct 31 00:00:00 EDT 2010 · OSTI ID:1474572

A Case for Soft Error Detection and Correction in Computational Chemistry
Journal Article · Tue Sep 10 00:00:00 EDT 2013 · Journal of Chemical Theory and Computation, 9(9):3995-4005 · OSTI ID:1474572

Related Subjects