Performance Efficient Multiresilience Using Checkpoint Recovery in Iterative Algorithms

Ashraf, Rizwan; Engelmann, Christian

doi:10.1007/978-3-030-10549-5_63

Performance Efficient Multiresilience Using Checkpoint Recovery in Iterative Algorithms

Conference · Sat Dec 01 04:00:00 EST 2018

DOI:https://doi.org/10.1007/978-3-030-10549-5_63· OSTI ID:1493144

^[1]; ^[1]

ORNL

In this paper, we address the design challenge of building multiresilient iterative high-performance computing (HPC) applications. Multiresilience in HPC applications is the ability to tolerate and maintain forward progress in the presence of both soft errors and process failures. We address the challenge by proposing performance models which are useful to design performance efficient and resilient iterative applications. The models consider the interaction between soft error and process failure resilience solutions. We experimented with a linear solver application with two distinct kinds of soft error detectors: one detector has high overhead and high accuracy, whereas the second has low overhead and low accuracy. We show how both can be leveraged for verifying the integrity of checkpointed state used to recover from both soft errors and process failures. Our results show the performance efficiency and resiliency benefit of employing the low overhead detector with high frequency within the checkpoint interval, so that timely soft error recovery can take place, resulting in less re-computed work.

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE; USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1493144

Country of Publication:: United States

Language:: English

References (10)

Post-failure recovery of MPI communication capability: Design and rationale Bland, Wesley; Bouteiller, Aurelien; Herault, Thomas The International Journal of High Performance Computing Applications, Vol. 27, Issue 3 https://doi.org/10.1177/1094342013488238	journal	June 2013
Evaluating the Impact of SDC on the GMRES Iterative Solver Elliott, James; Hoemmen, Mark; Mueller, Frank 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.123	conference	May 2014
An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance Sloan, Joseph; Kumar, Rakesh; Bronevetsky, Greg 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2013.6575309	conference	June 2013
System-Level Scalable Checkpoint-Restart for Petascale Computing Cao, Jiajun; Arya, Kapil; Garg, Rohan 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS) https://doi.org/10.1109/ICPADS.2016.0125	conference	December 2016
A higher order estimate of the optimum checkpoint interval for restart dumps Daly, J. T. Future Generation Computer Systems, Vol. 22, Issue 3, p. 303-312 https://doi.org/10.1016/j.future.2004.11.016	journal	February 2006
Exploiting asynchrony from exact forward recovery for DUE in iterative solvers Jaulmes, Luc; Casas, Marc; Moretó, Miquel SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/2807591.2807599	conference	November 2015
Soft error vulnerability of iterative linear algebra methods Bronevetsky, Greg; de Supinski, Bronis Proceedings of the 22nd annual international conference on Supercomputing - ICS '08 https://doi.org/10.1145/1375527.1375552	conference	January 2008
Algorithm-based recovery for iterative methods without checkpointing Chen, Zizhong Proceedings of the 20th international symposium on High performance distributed computing - HPDC '11 https://doi.org/10.1145/1996130.1996142	conference	January 2011
Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing Ashraf, Rizwan A.; Hukerikar, Saurabh; Engelmann, Christian ICPE '18: ACM/SPEC International Conference on Performance Engineering, Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering https://doi.org/10.1145/3184407.3184421	conference	March 2018
Optimal Resilience Patterns to Cope with Fail-Stop and Silent Errors Benoit, Anne; Cavelan, Aurelien; Robert, Yves 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2016.39	conference	May 2016

Similar Records

New-Sum: A Novel Online ABFT Scheme For General Iterative Methods

Conference · Tue May 31 00:00:00 EDT 2016 · OSTI ID:1322529

Fault tolerance in an inner-outer solver: A GVR-enabled case study

Journal Article · Fri Apr 17 20:00:00 EDT 2015 · Lecture Notes in Computer Science · OSTI ID:1237365

McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression

Journal Article · Mon Dec 31 19:00:00 EST 2012 · Scientific Programming · OSTI ID:1197891

Performance Efficient Multiresilience Using Checkpoint Recovery in Iterative Algorithms

Citation Formats

References (10)

Similar Records

Related Subjects