skip to main content

DOE PAGESDOE PAGES

Title: RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading

In the presence of accelerated fault rates, which are projected to be the norm on future exascale systems, it will become increasingly difficult for high-performance computing (HPC) applications to accomplish useful computation. Due to the fault-oblivious nature of current HPC programming paradigms and execution environments, HPC applications are insufficiently equipped to deal with errors. We believe that HPC applications should be enabled with capabilities to actively search for and correct errors in their computations. The redundant multithreading (RMT) approach offers lightweight replicated execution streams of program instructions within the context of a single application process. Furthermore, the use of complete redundancy incurs significant overhead to the application performance.
Authors:
ORCiD logo [1] ;  [2] ;  [1] ;  [1]
  1. Univ. of Southern California, Marina del Rey, CA (United States)
  2. Sandia National Lab. (SNL-CA), Livermore, CA (United States)
Publication Date:
Report Number(s):
SAND-2017-1596J
Journal ID: ISSN 0885-7458; PII: 492; TRN: US1800281
Grant/Contract Number:
AC04-94AL85000
Type:
Accepted Manuscript
Journal Name:
International Journal of Parallel Programming
Additional Journal Information:
Journal Volume: 46; Journal Issue: 2; Journal ID: ISSN 0885-7458
Publisher:
Springer
Research Org:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Org:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; Resilience; Exascale; Redundant multithreading; Programming models; Runtime systems; Fault tolerance
OSTI Identifier:
1411881