DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading

Abstract

In the presence of accelerated fault rates, which are projected to be the norm on future exascale systems, it will become increasingly difficult for high-performance computing (HPC) applications to accomplish useful computation. Due to the fault-oblivious nature of current HPC programming paradigms and execution environments, HPC applications are insufficiently equipped to deal with errors. We believe that HPC applications should be enabled with capabilities to actively search for and correct errors in their computations. The redundant multithreading (RMT) approach offers lightweight replicated execution streams of program instructions within the context of a single application process. Furthermore, the use of complete redundancy incurs significant overhead to the application performance.

Authors:
ORCiD logo [1];  [2];  [1];  [1]
  1. Univ. of Southern California, Marina del Rey, CA (United States)
  2. Sandia National Lab. (SNL-CA), Livermore, CA (United States)
Publication Date:
Research Org.:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
OSTI Identifier:
1411881
Report Number(s):
SAND-2017-1596J
Journal ID: ISSN 0885-7458; PII: 492; TRN: US1800281
Grant/Contract Number:  
AC04-94AL85000
Resource Type:
Accepted Manuscript
Journal Name:
International Journal of Parallel Programming
Additional Journal Information:
Journal Volume: 46; Journal Issue: 2; Journal ID: ISSN 0885-7458
Publisher:
Springer
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; Resilience; Exascale; Redundant multithreading; Programming models; Runtime systems; Fault tolerance

Citation Formats

Hukerikar, Saurabh, Teranishi, Keita, Diniz, Pedro C., and Lucas, Robert F. RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading. United States: N. p., 2017. Web. doi:10.1007/s10766-017-0492-3.
Hukerikar, Saurabh, Teranishi, Keita, Diniz, Pedro C., & Lucas, Robert F. RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading. United States. https://doi.org/10.1007/s10766-017-0492-3
Hukerikar, Saurabh, Teranishi, Keita, Diniz, Pedro C., and Lucas, Robert F. Sat . "RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading". United States. https://doi.org/10.1007/s10766-017-0492-3. https://www.osti.gov/servlets/purl/1411881.
@article{osti_1411881,
title = {RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading},
author = {Hukerikar, Saurabh and Teranishi, Keita and Diniz, Pedro C. and Lucas, Robert F.},
abstractNote = {In the presence of accelerated fault rates, which are projected to be the norm on future exascale systems, it will become increasingly difficult for high-performance computing (HPC) applications to accomplish useful computation. Due to the fault-oblivious nature of current HPC programming paradigms and execution environments, HPC applications are insufficiently equipped to deal with errors. We believe that HPC applications should be enabled with capabilities to actively search for and correct errors in their computations. The redundant multithreading (RMT) approach offers lightweight replicated execution streams of program instructions within the context of a single application process. Furthermore, the use of complete redundancy incurs significant overhead to the application performance.},
doi = {10.1007/s10766-017-0492-3},
journal = {International Journal of Parallel Programming},
number = 2,
volume = 46,
place = {United States},
year = {Sat Feb 11 00:00:00 EST 2017},
month = {Sat Feb 11 00:00:00 EST 2017}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 11 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

The International Exascale Software Project roadmap
journal, January 2011

  • Dongarra, Jack; Beckman, Pete; Moore, Terry
  • The International Journal of High Performance Computing Applications, Vol. 25, Issue 1
  • DOI: 10.1177/1094342010391989

Transient-fault recovery using simultaneous multithreading
conference, January 2002

  • Vijaykumar, T. N.; Pomeranz, I.; Cheng, K.
  • Proceedings 29th Annual International Symposium on Computer Architecture
  • DOI: 10.1109/ISCA.2002.1003565

Rolex: resilience-oriented language extensions for extreme-scale systems
journal, May 2016


Multicore soft error rate stabilization using adaptive dual modular redundancy
conference, March 2010

  • Vadlamani, Ramakrishna; Burleson, Wayne
  • 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010)
  • DOI: 10.1109/DATE.2010.5457242

Opportunistic Transient-Fault Detection
conference, January 2005

  • Gomaa, M. A.; Vijaykumar, T. N.
  • 32nd International Symposium on Computer Architecture (ISCA'05)
  • DOI: 10.1109/ISCA.2005.38

SWIFT: Software Implemented Fault Tolerance
conference, January 2005

  • Reis, G. A.; Chang, J.; Vachharajani, N.
  • International Symposium on Code Generation and Optimization
  • DOI: 10.1109/CGO.2005.34

IBM's S/390 G5 microprocessor design
journal, January 1999

  • Slegel, T. J.; Averill, R. M.; Check, M. A.
  • IEEE Micro, Vol. 19, Issue 2
  • DOI: 10.1109/40.755464

Self-stabilizing iterative solvers
conference, January 2013

  • Sao, Piyush; Vuduc, Richard
  • Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - ScalA '13
  • DOI: 10.1145/2530268.2530272

NonStop® Advanced Architecture
conference, January 2005

  • Bernick, D.; Bruckert, B.; Vigna, P. D.
  • 2005 International Conference on Dependable Systems and Networks (DSN'05)
  • DOI: 10.1109/DSN.2005.70

Error detection by duplicated instructions in super-scalar processors
journal, March 2002

  • Oh, N.; Shirvani, P. P.; McCluskey, E. J.
  • IEEE Transactions on Reliability, Vol. 51, Issue 1
  • DOI: 10.1109/24.994913

An evaluation of lazy fault detection based on Adaptive Redundant Multithreading
conference, September 2014

  • Hukerikar, Saurabh; Teranishi, Keita; Diniz, Pedro C.
  • 2014 IEEE High Performance Extreme Computing Conference (HPEC)
  • DOI: 10.1109/HPEC.2014.7040999

Transient fault detection via simultaneous multithreading
conference, January 2000

  • Reinhardt, Steven K.; Mukherjee, Shubhendu S.
  • Proceedings of the 27th annual international symposium on Computer architecture - ISCA '00
  • DOI: 10.1145/339647.339652

Clear: cross-layer exploration for architecting resilience combining hardware and software techniques to tolerate soft errors in processor cores
conference, January 2016

  • Cheng, Eric; Bose, Pradip; Mitra, Subhasish
  • Proceedings of the 53rd Annual Design Automation Conference on - DAC '16
  • DOI: 10.1145/2897937.2897996

Detailed design and evaluation of redundant multithreading alternatives
journal, May 2002

  • Mukherjee, Shubhendu S.; Kontz, Michael; Reinhardt, Steven K.
  • ACM SIGARCH Computer Architecture News, Vol. 30, Issue 2
  • DOI: 10.1145/545214.545227

Opportunistic application-level fault detection through adaptive redundant multithreading
conference, July 2014

  • Hukerikar, Saurabh; Diniz, Pedro C.; Lucas, Robert F.
  • 2014 International Conference on High Performance Computing & Simulation (HPCS)
  • DOI: 10.1109/HPCSim.2014.6903692

DIVA: a reliable substrate for deep submicron microarchitecture design
conference, January 1999

  • Austin, T. M.
  • MICRO-32. 32nd Annual ACM/IEEE International Symposium on Microarchitecture, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture
  • DOI: 10.1109/MICRO.1999.809458

Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection
conference, March 2007

  • Wang, Cheng; Kim, Ho-seop; Wu, Youfeng
  • International Symposium on Code Generation and Optimization (CGO'07)
  • DOI: 10.1109/CGO.2007.7

ROSE::FTTransform - A source-to-source translation framework for exascale fault-tolerance research
conference, June 2012

  • Lidman, Jacob; Quinlan, Daniel J.; Liao, Chunhua
  • 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012)
  • DOI: 10.1109/DSNW.2012.6264672

DAFT: decoupled acyclic fault tolerance
conference, January 2010

  • Zhang, Yun; Lee, Jae W.; Johnson, Nick P.
  • Proceedings of the 19th international conference on Parallel architectures and compilation techniques - PACT '10
  • DOI: 10.1145/1854273.1854289

Evaluating the viability of process replication reliability for exascale systems
conference, January 2011

  • Ferreira, Kurt; Stearley, Jon; Laros, James H.
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
  • DOI: 10.1145/2063384.2063443

Does partial replication pay off?
conference, June 2012

  • Stearley, Jon; Ferreira, Kurt; Robinson, David
  • 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012)
  • DOI: 10.1109/DSNW.2012.6264669

PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures
journal, April 2009

  • Shye, A.; Blomstedt, J.; Moseley, T.
  • IEEE Transactions on Dependable and Secure Computing, Vol. 6, Issue 2
  • DOI: 10.1109/TDSC.2008.62

Error Correction Coding
book, May 2005


SlicK: slice-based locality exploitation for efficient redundant multithreading
journal, October 2006

  • Parashar, Angshuman; Sivasubramaniam, Anand; Gurumurthi, Sudhanva
  • ACM SIGOPS Operating Systems Review, Vol. 40, Issue 5
  • DOI: 10.1145/1168917.1168870

Opportunistic Transient-Fault Detection
journal, May 2005

  • Gomaa, Mohamed A.; Vijaykumar, T. N.
  • ACM SIGARCH Computer Architecture News, Vol. 33, Issue 2
  • DOI: 10.1145/1080695.1069985

Transient-fault recovery using simultaneous multithreading
journal, May 2002

  • Vijaykumar, T. N.; Pomeranz, Irith; Cheng, Karl
  • ACM SIGARCH Computer Architecture News, Vol. 30, Issue 2
  • DOI: 10.1145/545214.545226

Balancing soft error coverage with lifetime reliability in redundantly multithreaded processors
conference, September 2009

  • Siddiqua, T.; Gurumurthi, S.
  • amp; Simulation of Computer and Telecommunication Systems (MASCOTS), 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems
  • DOI: 10.1109/mascot.2009.5363142

Works referencing / citing this record:

Multi-Threaded Mitigation of Radiation-Induced Soft Errors in Bare-Metal Embedded Systems
journal, December 2019

  • Serrano-Cases, Alejandro; Restrepo-Calle, Felipe; Cuenca-Asensi, Sergio
  • Journal of Electronic Testing, Vol. 36, Issue 1
  • DOI: 10.1007/s10836-019-05846-4