RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading
- Univ. of Southern California, Marina del Rey, CA (United States)
- Sandia National Lab. (SNL-CA), Livermore, CA (United States)
In the presence of accelerated fault rates, which are projected to be the norm on future exascale systems, it will become increasingly difficult for high-performance computing (HPC) applications to accomplish useful computation. Due to the fault-oblivious nature of current HPC programming paradigms and execution environments, HPC applications are insufficiently equipped to deal with errors. We believe that HPC applications should be enabled with capabilities to actively search for and correct errors in their computations. The redundant multithreading (RMT) approach offers lightweight replicated execution streams of program instructions within the context of a single application process. Furthermore, the use of complete redundancy incurs significant overhead to the application performance.
- Research Organization:
- Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- Grant/Contract Number:
- AC04-94AL85000
- OSTI ID:
- 1411881
- Report Number(s):
- SAND-2017-1596J; PII: 492; TRN: US1800281
- Journal Information:
- International Journal of Parallel Programming, Vol. 46, Issue 2; ISSN 0885-7458
- Publisher:
- SpringerCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Web of Science
The International Exascale Software Project roadmap
|
journal | January 2011 |
Transient-fault recovery using simultaneous multithreading
|
conference | January 2002 |
Rolex: resilience-oriented language extensions for extreme-scale systems
|
journal | May 2016 |
Multicore soft error rate stabilization using adaptive dual modular redundancy
|
conference | March 2010 |
Opportunistic Transient-Fault Detection
|
conference | January 2005 |
SWIFT: Software Implemented Fault Tolerance
|
conference | January 2005 |
IBM's S/390 G5 microprocessor design
|
journal | January 1999 |
Self-stabilizing iterative solvers
|
conference | January 2013 |
NonStopĀ® Advanced Architecture
|
conference | January 2005 |
Error detection by duplicated instructions in super-scalar processors
|
journal | March 2002 |
An evaluation of lazy fault detection based on Adaptive Redundant Multithreading
|
conference | September 2014 |
Transient fault detection via simultaneous multithreading
|
conference | January 2000 |
Clear: cross-layer exploration for architecting resilience combining hardware and software techniques to tolerate soft errors in processor cores
|
conference | January 2016 |
Detailed design and evaluation of redundant multithreading alternatives
|
journal | May 2002 |
Opportunistic application-level fault detection through adaptive redundant multithreading
|
conference | July 2014 |
DIVA: a reliable substrate for deep submicron microarchitecture design
|
conference | January 1999 |
Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection
|
conference | March 2007 |
ROSE::FTTransform - A source-to-source translation framework for exascale fault-tolerance research
|
conference | June 2012 |
DAFT: decoupled acyclic fault tolerance
|
conference | January 2010 |
Evaluating the viability of process replication reliability for exascale systems
|
conference | January 2011 |
Does partial replication pay off?
|
conference | June 2012 |
PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures
|
journal | April 2009 |
Error Correction Coding
|
book | May 2005 |
SlicK: slice-based locality exploitation for efficient redundant multithreading
|
journal | October 2006 |
Opportunistic Transient-Fault Detection
|
journal | May 2005 |
Transient-fault recovery using simultaneous multithreading
|
journal | May 2002 |
Balancing soft error coverage with lifetime reliability in redundantly multithreaded processors
|
conference | September 2009 |
Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation
|
journal | November 2005 |
Multi-Threaded Mitigation of Radiation-Induced Soft Errors in Bare-Metal Embedded Systems
|
journal | December 2019 |
Similar Records
File I/O for MPI Applications in Redundant Execution Scenarios
Concurrent detection of transient faults in microprocessors