Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading

Journal Article · · International Journal of Parallel Programming
 [1];  [2];  [1];  [1]
  1. Univ. of Southern California, Marina del Rey, CA (United States)
  2. Sandia National Lab. (SNL-CA), Livermore, CA (United States)
In the presence of accelerated fault rates, which are projected to be the norm on future exascale systems, it will become increasingly difficult for high-performance computing (HPC) applications to accomplish useful computation. Due to the fault-oblivious nature of current HPC programming paradigms and execution environments, HPC applications are insufficiently equipped to deal with errors. We believe that HPC applications should be enabled with capabilities to actively search for and correct errors in their computations. The redundant multithreading (RMT) approach offers lightweight replicated execution streams of program instructions within the context of a single application process. Furthermore, the use of complete redundancy incurs significant overhead to the application performance.
Research Organization:
Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
Grant/Contract Number:
AC04-94AL85000
OSTI ID:
1411881
Report Number(s):
SAND--2017-1596J; PII: 492
Journal Information:
International Journal of Parallel Programming, Journal Name: International Journal of Parallel Programming Journal Issue: 2 Vol. 46; ISSN 0885-7458
Publisher:
SpringerCopyright Statement
Country of Publication:
United States
Language:
English

References (30)

Balancing soft error coverage with lifetime reliability in redundantly multithreaded processors
  • Siddiqua, T.; Gurumurthi, S.
  • amp; Simulation of Computer and Telecommunication Systems (MASCOTS), 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems https://doi.org/10.1109/mascot.2009.5363142
conference September 2009
Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation journal November 2005
Error Correction Coding book May 2005
Rolex: resilience-oriented language extensions for extreme-scale systems journal May 2016
Error detection by duplicated instructions in super-scalar processors journal March 2002
IBM's S/390 G5 microprocessor design journal January 1999
SWIFT: Software Implemented Fault Tolerance conference January 2005
Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection conference March 2007
Multicore soft error rate stabilization using adaptive dual modular redundancy conference March 2010
NonStopĀ® Advanced Architecture conference January 2005
Does partial replication pay off?
  • Stearley, Jon; Ferreira, Kurt; Robinson, David
  • 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012) https://doi.org/10.1109/DSNW.2012.6264669
conference June 2012
ROSE::FTTransform - A source-to-source translation framework for exascale fault-tolerance research
  • Lidman, Jacob; Quinlan, Daniel J.; Liao, Chunhua
  • 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012) https://doi.org/10.1109/DSNW.2012.6264672
conference June 2012
Opportunistic application-level fault detection through adaptive redundant multithreading conference July 2014
An evaluation of lazy fault detection based on Adaptive Redundant Multithreading conference September 2014
Transient-fault recovery using simultaneous multithreading conference January 2002
Opportunistic Transient-Fault Detection conference January 2005
Balancing soft error coverage with lifetime reliability in redundantly multithreaded processors
  • Siddiqua, T.; Gurumurthi, S.
  • amp; Simulation of Computer and Telecommunication Systems (MASCOTS), 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems https://doi.org/10.1109/MASCOT.2009.5363142
conference September 2009
DIVA: a reliable substrate for deep submicron microarchitecture design
  • Austin, T. M.
  • MICRO-32. 32nd Annual ACM/IEEE International Symposium on Microarchitecture, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture https://doi.org/10.1109/MICRO.1999.809458
conference January 1999
Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation journal November 2005
PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures journal April 2009
Opportunistic Transient-Fault Detection journal May 2005
SlicK: slice-based locality exploitation for efficient redundant multithreading journal October 2006
DAFT: decoupled acyclic fault tolerance
  • Zhang, Yun; Lee, Jae W.; Johnson, Nick P.
  • Proceedings of the 19th international conference on Parallel architectures and compilation techniques - PACT '10 https://doi.org/10.1145/1854273.1854289
conference January 2010
Evaluating the viability of process replication reliability for exascale systems
  • Ferreira, Kurt; Stearley, Jon; Laros, James H.
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063443
conference January 2011
Self-stabilizing iterative solvers conference January 2013
Clear: cross-layer exploration for architecting resilience combining hardware and software techniques to tolerate soft errors in processor cores conference January 2016
Transient fault detection via simultaneous multithreading conference January 2000
Transient-fault recovery using simultaneous multithreading journal May 2002
Detailed design and evaluation of redundant multithreading alternatives journal May 2002
The International Exascale Software Project roadmap journal January 2011

Cited By (1)

Multi-Threaded Mitigation of Radiation-Induced Soft Errors in Bare-Metal Embedded Systems journal December 2019

Similar Records

Combining Partial Redundancy and Checkpointing for HPC
Conference · Sat Dec 31 23:00:00 EST 2011 · OSTI ID:1081906

HPC application fault-tolerance using transparent redundant computation.
Conference · Sat Aug 01 00:00:00 EDT 2009 · OSTI ID:971418

Redundant Execution of HPC Applications with MR-MPI
Conference · Fri Dec 31 23:00:00 EST 2010 · OSTI ID:1081697