skip to main content

Title: Fault tolerance in an inner-outer solver: A GVR-enabled case study

Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer. Informed by these results, we design and evaluate several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates. We implement them, extending Trilinos’ solver library with the Global View Resilience (GVR) programming model, which provides multi-stream snapshots, multi-version data structures with portable and rich error checking/recovery. Lastly, experimental results validate correct execution with low performance overhead under varied error conditions.
 [1] ;  [1] ;  [2]
  1. Univ. of Chicago, Chicago, IL (United States)
  2. Sandia National Lab. (SNL-CA), Livermore, CA (United States); Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Publication Date:
OSTI Identifier:
Report Number(s):
Journal ID: ISSN 0302-9743; 562108
Grant/Contract Number:
Accepted Manuscript
Journal Name:
Lecture Notes in Computer Science
Additional Journal Information:
Journal Volume: 8969; Journal ID: ISSN 0302-9743
Research Org:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Org:
USDOE National Nuclear Security Administration (NNSA)
Country of Publication:
United States
97 MATHEMATICS AND COMPUTING; resilience; numerical solver; high performance computing