skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Fault tolerance in an inner-outer solver: A GVR-enabled case study

Abstract

Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer. Informed by these results, we design and evaluate several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates. We implement them, extending Trilinos’ solver library with the Global View Resilience (GVR) programming model, which provides multi-stream snapshots, multi-version data structures with portable and rich error checking/recovery. Lastly, experimental results validate correct execution with low performance overhead under varied error conditions.

Authors:
 [1];  [1];  [2]
  1. Univ. of Chicago, Chicago, IL (United States)
  2. Sandia National Lab. (SNL-CA), Livermore, CA (United States); Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Publication Date:
Research Org.:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1237365
Report Number(s):
SAND-2015-0174J
Journal ID: ISSN 0302-9743; 562108
Grant/Contract Number:  
AC04-94AL85000
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
Lecture Notes in Computer Science
Additional Journal Information:
Journal Volume: 8969; Journal ID: ISSN 0302-9743
Publisher:
Springer
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; resilience; numerical solver; high performance computing

Citation Formats

Zhang, Ziming, Chien, Andrew A., and Teranishi, Keita. Fault tolerance in an inner-outer solver: A GVR-enabled case study. United States: N. p., 2015. Web. doi:10.1007/978-3-319-17353-5_11.
Zhang, Ziming, Chien, Andrew A., & Teranishi, Keita. Fault tolerance in an inner-outer solver: A GVR-enabled case study. United States. https://doi.org/10.1007/978-3-319-17353-5_11
Zhang, Ziming, Chien, Andrew A., and Teranishi, Keita. Sat . "Fault tolerance in an inner-outer solver: A GVR-enabled case study". United States. https://doi.org/10.1007/978-3-319-17353-5_11. https://www.osti.gov/servlets/purl/1237365.
@article{osti_1237365,
title = {Fault tolerance in an inner-outer solver: A GVR-enabled case study},
author = {Zhang, Ziming and Chien, Andrew A. and Teranishi, Keita},
abstractNote = {Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer. Informed by these results, we design and evaluate several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates. We implement them, extending Trilinos’ solver library with the Global View Resilience (GVR) programming model, which provides multi-stream snapshots, multi-version data structures with portable and rich error checking/recovery. Lastly, experimental results validate correct execution with low performance overhead under varied error conditions.},
doi = {10.1007/978-3-319-17353-5_11},
url = {https://www.osti.gov/biblio/1237365}, journal = {Lecture Notes in Computer Science},
issn = {0302-9743},
number = ,
volume = 8969,
place = {United States},
year = {2015},
month = {4}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 1 work
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

The future of microprocessors
journal, May 2011


Soft error vulnerability of iterative linear algebra methods
conference, January 2008


Toward Exascale Resilience
journal, September 2009


Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods
conference, January 2013


High Performance Dense Linear System Solver with Soft Error Resilience
conference, September 2011


Evaluating the Impact of SDC on the GMRES Iterative Solver
conference, May 2014

  • Elliott, James; Hoemmen, Mark; Mueller, Frank
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
  • https://doi.org/10.1109/IPDPS.2014.123

An overview of the Trilinos project
journal, September 2005


Algorithm-Based Fault Tolerance for Matrix Operations
journal, June 1984


ROSE::FTTransform - A source-to-source translation framework for exascale fault-tolerance research
conference, June 2012

  • Lidman, Jacob; Quinlan, Daniel J.; Liao, Chunhua
  • 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012)
  • https://doi.org/10.1109/DSNW.2012.6264672

Fault tolerant preconditioned conjugate gradient for sparse linear system solution
conference, January 2012


    Works referencing / citing this record:

    End-to-End Resilience for HPC Applications
    book, May 2019

    • Rezaei, Arash; Khetawat, Harsh; Patil, Onkar
    • High Performance Computing: 34th International Conference, ISC High Performance 2019, Frankfurt/Main, Germany, June 16–20, 2019, Proceedings, p. 271-290
    • https://doi.org/10.1007/978-3-030-20656-7_14