Fault tolerance in an inner-outer solver: A GVR-enabled case study
Abstract
Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer. Informed by these results, we design and evaluate several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates. We implement them, extending Trilinos’ solver library with the Global View Resilience (GVR) programming model, which provides multi-stream snapshots, multi-version data structures with portable and rich error checking/recovery. Lastly, experimental results validate correct execution with low performance overhead under varied error conditions.
- Authors:
-
- Univ. of Chicago, Chicago, IL (United States)
- Sandia National Lab. (SNL-CA), Livermore, CA (United States); Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
- Publication Date:
- Research Org.:
- Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
- Sponsoring Org.:
- USDOE National Nuclear Security Administration (NNSA)
- OSTI Identifier:
- 1237365
- Report Number(s):
- SAND-2015-0174J
Journal ID: ISSN 0302-9743; 562108
- Grant/Contract Number:
- AC04-94AL85000
- Resource Type:
- Journal Article: Accepted Manuscript
- Journal Name:
- Lecture Notes in Computer Science
- Additional Journal Information:
- Journal Volume: 8969; Journal ID: ISSN 0302-9743
- Publisher:
- Springer
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING; resilience; numerical solver; high performance computing
Citation Formats
Zhang, Ziming, Chien, Andrew A., and Teranishi, Keita. Fault tolerance in an inner-outer solver: A GVR-enabled case study. United States: N. p., 2015.
Web. doi:10.1007/978-3-319-17353-5_11.
Zhang, Ziming, Chien, Andrew A., & Teranishi, Keita. Fault tolerance in an inner-outer solver: A GVR-enabled case study. United States. https://doi.org/10.1007/978-3-319-17353-5_11
Zhang, Ziming, Chien, Andrew A., and Teranishi, Keita. Sat .
"Fault tolerance in an inner-outer solver: A GVR-enabled case study". United States. https://doi.org/10.1007/978-3-319-17353-5_11. https://www.osti.gov/servlets/purl/1237365.
@article{osti_1237365,
title = {Fault tolerance in an inner-outer solver: A GVR-enabled case study},
author = {Zhang, Ziming and Chien, Andrew A. and Teranishi, Keita},
abstractNote = {Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer. Informed by these results, we design and evaluate several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates. We implement them, extending Trilinos’ solver library with the Global View Resilience (GVR) programming model, which provides multi-stream snapshots, multi-version data structures with portable and rich error checking/recovery. Lastly, experimental results validate correct execution with low performance overhead under varied error conditions.},
doi = {10.1007/978-3-319-17353-5_11},
url = {https://www.osti.gov/biblio/1237365},
journal = {Lecture Notes in Computer Science},
issn = {0302-9743},
number = ,
volume = 8969,
place = {United States},
year = {2015},
month = {4}
}
Web of Science
Works referenced in this record:
The future of microprocessors
journal, May 2011
- Borkar, Shekhar; Chien, Andrew A.
- Communications of the ACM, Vol. 54, Issue 5
Soft error vulnerability of iterative linear algebra methods
conference, January 2008
- Bronevetsky, Greg; de Supinski, Bronis
- Proceedings of the 22nd annual international conference on Supercomputing - ICS '08
Toward Exascale Resilience
journal, September 2009
- Cappello, Franck; Geist, Al; Gropp, Bill
- The International Journal of High Performance Computing Applications, Vol. 23, Issue 4
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods
conference, January 2013
- Chen, Zizhong
- Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '13
High Performance Dense Linear System Solver with Soft Error Resilience
conference, September 2011
- Du, Peng; Luszczek, Piotr; Dongarra, Jack
- 2011 IEEE International Conference on Cluster Computing (CLUSTER)
Evaluating the Impact of SDC on the GMRES Iterative Solver
conference, May 2014
- Elliott, James; Hoemmen, Mark; Mueller, Frank
- 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
An overview of the Trilinos project
journal, September 2005
- Heroux, Michael A.; Phipps, Eric T.; Salinger, Andrew G.
- ACM Transactions on Mathematical Software, Vol. 31, Issue 3
Algorithm-Based Fault Tolerance for Matrix Operations
journal, June 1984
- Kuang-Hua Huang, ; Abraham, Jacob A.
- IEEE Transactions on Computers, Vol. C-33, Issue 6
ROSE::FTTransform - A source-to-source translation framework for exascale fault-tolerance research
conference, June 2012
- Lidman, Jacob; Quinlan, Daniel J.; Liao, Chunhua
- 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012)
Fault tolerant preconditioned conjugate gradient for sparse linear system solution
conference, January 2012
- Shantharam, Manu; Srinivasmurthy, Sowmyalatha; Raghavan, Padma
- Proceedings of the 26th ACM international conference on Supercomputing - ICS '12
Works referencing / citing this record:
End-to-End Resilience for HPC Applications
book, May 2019
- Rezaei, Arash; Khetawat, Harsh; Patil, Onkar
- High Performance Computing: 34th International Conference, ISC High Performance 2019, Frankfurt/Main, Germany, June 16–20, 2019, Proceedings, p. 271-290