skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Localized Fault Recovery for Nested Fork-Join Programs

Abstract

Nested fork-join programs scheduled using work stealing can automatically balance load and adapt to changes in the execution environment. In this paper, we design an approach to efficiently recover from faults encountered by these programs. Specifically, we focus on localized recovery of the task space in the presence of fail-stop failures. We present an approach to efficiently track, under work stealing, the relationships between the work executed by various threads. This information is used to identify and schedule the tasks to be re-executed without interfering with normal task execution. The algorithm precisely computes the work lost, incurs minimal re-execution overhead, and can recover from an arbitrary number of failures. Experimental evaluation demonstrates low overheads in the absence of failures, recovery overheads on the same order as the lost work, and much lower recovery costs than alternative strategies.

Authors:
; ;
Publication Date:
Research Org.:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1379446
Report Number(s):
PNNL-SA-123481
KJ0402000
DOE Contract Number:  
AC05-76RL01830
Resource Type:
Conference
Resource Relation:
Conference: Proceedings of the 31st IEEE International Parallel & Distributed Processing Symposium (IPDPS 2017), May 29-June 2, 2017, Orlando, Florida, 397-408
Country of Publication:
United States
Language:
English

Citation Formats

Kestor, Gokcen, Krishnamoorthy, Sriram, and Ma, Wenjing. Localized Fault Recovery for Nested Fork-Join Programs. United States: N. p., 2017. Web. doi:10.1109/IPDPS.2017.75.
Kestor, Gokcen, Krishnamoorthy, Sriram, & Ma, Wenjing. Localized Fault Recovery for Nested Fork-Join Programs. United States. doi:10.1109/IPDPS.2017.75.
Kestor, Gokcen, Krishnamoorthy, Sriram, and Ma, Wenjing. Mon . "Localized Fault Recovery for Nested Fork-Join Programs". United States. doi:10.1109/IPDPS.2017.75.
@article{osti_1379446,
title = {Localized Fault Recovery for Nested Fork-Join Programs},
author = {Kestor, Gokcen and Krishnamoorthy, Sriram and Ma, Wenjing},
abstractNote = {Nested fork-join programs scheduled using work stealing can automatically balance load and adapt to changes in the execution environment. In this paper, we design an approach to efficiently recover from faults encountered by these programs. Specifically, we focus on localized recovery of the task space in the presence of fail-stop failures. We present an approach to efficiently track, under work stealing, the relationships between the work executed by various threads. This information is used to identify and schedule the tasks to be re-executed without interfering with normal task execution. The algorithm precisely computes the work lost, incurs minimal re-execution overhead, and can recover from an arbitrary number of failures. Experimental evaluation demonstrates low overheads in the absence of failures, recovery overheads on the same order as the lost work, and much lower recovery costs than alternative strategies.},
doi = {10.1109/IPDPS.2017.75},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Mon Jul 03 00:00:00 EDT 2017},
month = {Mon Jul 03 00:00:00 EDT 2017}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: