skip to main content

DOE PAGESDOE PAGES

Title: Fault tolerance in an inner-outer solver: A GVR-enabled case study

Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer. Informed by these results, we design and evaluate several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates. We implement them, extending Trilinos’ solver library with the Global View Resilience (GVR) programming model, which provides multi-stream snapshots, multi-version data structures with portable and rich error checking/recovery. Lastly, experimental results validate correct execution with low performance overhead under varied error conditions.
Authors:
 [1] ;  [1] ;  [2]
  1. Univ. of Chicago, Chicago, IL (United States)
  2. Sandia National Lab. (SNL-CA), Livermore, CA (United States); Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Publication Date:
OSTI Identifier:
1237365
Report Number(s):
SAND--2015-0174J
Journal ID: ISSN 0302-9743; 562108
Grant/Contract Number:
AC04-94AL85000
Type:
Accepted Manuscript
Journal Name:
Lecture Notes in Computer Science
Additional Journal Information:
Journal Volume: 8969; Journal ID: ISSN 0302-9743
Publisher:
Springer
Research Org:
Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States)
Sponsoring Org:
USDOE National Nuclear Security Administration (NNSA)
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING resilience; numerical solver; high performance computing