skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Performance Efficient Multiresilience Using Checkpoint Recovery in Iterative Algorithms

Abstract

In this paper, we address the design challenge of building multiresilient iterative high-performance computing (HPC) applications. Multiresilience in HPC applications is the ability to tolerate and maintain forward progress in the presence of both soft errors and process failures. We address the challenge by proposing performance models which are useful to design performance efficient and resilient iterative applications. The models consider the interaction between soft error and process failure resilience solutions. We experimented with a linear solver application with two distinct kinds of soft error detectors: one detector has high overhead and high accuracy, whereas the second has low overhead and low accuracy. We show how both can be leveraged for verifying the integrity of checkpointed state used to recover from both soft errors and process failures. Our results show the performance efficiency and resiliency benefit of employing the low overhead detector with high frequency within the checkpoint interval, so that timely soft error recovery can take place, resulting in less re-computed work.

Authors:
ORCiD logo [1]; ORCiD logo [1]
  1. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
OSTI Identifier:
1493144
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Journal Volume: 11339; Conference: 11th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids - Turin, , Italy - 8/27/2018 8:00:00 AM-8/31/2018 8:00:00 AM
Country of Publication:
United States
Language:
English

Citation Formats

Ashraf, Rizwan A., and Engelmann, Christian. Performance Efficient Multiresilience Using Checkpoint Recovery in Iterative Algorithms. United States: N. p., 2018. Web. doi:10.1007/978-3-030-10549-5_63.
Ashraf, Rizwan A., & Engelmann, Christian. Performance Efficient Multiresilience Using Checkpoint Recovery in Iterative Algorithms. United States. doi:10.1007/978-3-030-10549-5_63.
Ashraf, Rizwan A., and Engelmann, Christian. Sat . "Performance Efficient Multiresilience Using Checkpoint Recovery in Iterative Algorithms". United States. doi:10.1007/978-3-030-10549-5_63. https://www.osti.gov/servlets/purl/1493144.
@article{osti_1493144,
title = {Performance Efficient Multiresilience Using Checkpoint Recovery in Iterative Algorithms},
author = {Ashraf, Rizwan A. and Engelmann, Christian},
abstractNote = {In this paper, we address the design challenge of building multiresilient iterative high-performance computing (HPC) applications. Multiresilience in HPC applications is the ability to tolerate and maintain forward progress in the presence of both soft errors and process failures. We address the challenge by proposing performance models which are useful to design performance efficient and resilient iterative applications. The models consider the interaction between soft error and process failure resilience solutions. We experimented with a linear solver application with two distinct kinds of soft error detectors: one detector has high overhead and high accuracy, whereas the second has low overhead and low accuracy. We show how both can be leveraged for verifying the integrity of checkpointed state used to recover from both soft errors and process failures. Our results show the performance efficiency and resiliency benefit of employing the low overhead detector with high frequency within the checkpoint interval, so that timely soft error recovery can take place, resulting in less re-computed work.},
doi = {10.1007/978-3-030-10549-5_63},
journal = {},
issn = {0302--9743},
number = ,
volume = 11339,
place = {United States},
year = {2018},
month = {12}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: