skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading

Journal Article · · International Journal of Parallel Programming

In the presence of accelerated fault rates, which are projected to be the norm on future exascale systems, it will become increasingly difficult for high-performance computing (HPC) applications to accomplish useful computation. Due to the fault-oblivious nature of current HPC programming paradigms and execution environments, HPC applications are insufficiently equipped to deal with errors. We believe that HPC applications should be enabled with capabilities to actively search for and correct errors in their computations. The redundant multithreading (RMT) approach offers lightweight replicated execution streams of program instructions within the context of a single application process. Furthermore, the use of complete redundancy incurs significant overhead to the application performance.

Research Organization:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
Grant/Contract Number:
AC04-94AL85000
OSTI ID:
1411881
Report Number(s):
SAND-2017-1596J; PII: 492; TRN: US1800281
Journal Information:
International Journal of Parallel Programming, Vol. 46, Issue 2; ISSN 0885-7458
Publisher:
SpringerCopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 11 works
Citation information provided by
Web of Science

References (28)

The International Exascale Software Project roadmap journal January 2011
Transient-fault recovery using simultaneous multithreading conference January 2002
Rolex: resilience-oriented language extensions for extreme-scale systems journal May 2016
Multicore soft error rate stabilization using adaptive dual modular redundancy conference March 2010
Opportunistic Transient-Fault Detection conference January 2005
SWIFT: Software Implemented Fault Tolerance conference January 2005
IBM's S/390 G5 microprocessor design journal January 1999
Self-stabilizing iterative solvers conference January 2013
NonStopĀ® Advanced Architecture conference January 2005
Error detection by duplicated instructions in super-scalar processors journal March 2002
An evaluation of lazy fault detection based on Adaptive Redundant Multithreading conference September 2014
Transient fault detection via simultaneous multithreading conference January 2000
Clear: cross-layer exploration for architecting resilience combining hardware and software techniques to tolerate soft errors in processor cores conference January 2016
Detailed design and evaluation of redundant multithreading alternatives journal May 2002
Opportunistic application-level fault detection through adaptive redundant multithreading conference July 2014
DIVA: a reliable substrate for deep submicron microarchitecture design
  • Austin, T. M.
  • MICRO-32. 32nd Annual ACM/IEEE International Symposium on Microarchitecture, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture https://doi.org/10.1109/MICRO.1999.809458
conference January 1999
Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection conference March 2007
ROSE::FTTransform - A source-to-source translation framework for exascale fault-tolerance research
  • Lidman, Jacob; Quinlan, Daniel J.; Liao, Chunhua
  • 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012) https://doi.org/10.1109/DSNW.2012.6264672
conference June 2012
DAFT: decoupled acyclic fault tolerance
  • Zhang, Yun; Lee, Jae W.; Johnson, Nick P.
  • Proceedings of the 19th international conference on Parallel architectures and compilation techniques - PACT '10 https://doi.org/10.1145/1854273.1854289
conference January 2010
Evaluating the viability of process replication reliability for exascale systems
  • Ferreira, Kurt; Stearley, Jon; Laros, James H.
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063443
conference January 2011
Does partial replication pay off?
  • Stearley, Jon; Ferreira, Kurt; Robinson, David
  • 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012) https://doi.org/10.1109/DSNW.2012.6264669
conference June 2012
PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures journal April 2009
Error Correction Coding book May 2005
SlicK: slice-based locality exploitation for efficient redundant multithreading journal October 2006
Opportunistic Transient-Fault Detection journal May 2005
Transient-fault recovery using simultaneous multithreading journal May 2002
Balancing soft error coverage with lifetime reliability in redundantly multithreaded processors
  • Siddiqua, T.; Gurumurthi, S.
  • amp; Simulation of Computer and Telecommunication Systems (MASCOTS), 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems https://doi.org/10.1109/mascot.2009.5363142
conference September 2009
Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation journal November 2005

Cited By (1)

Multi-Threaded Mitigation of Radiation-Induced Soft Errors in Bare-Metal Embedded Systems journal December 2019