Localized Fault Recovery for Nested Fork-Join Programs
Nested fork-join programs scheduled using work stealing can automatically balance load and adapt to changes in the execution environment. In this paper, we design an approach to efficiently recover from faults encountered by these programs. Specifically, we focus on localized recovery of the task space in the presence of fail-stop failures. We present an approach to efficiently track, under work stealing, the relationships between the work executed by various threads. This information is used to identify and schedule the tasks to be re-executed without interfering with normal task execution. The algorithm precisely computes the work lost, incurs minimal re-execution overhead, and can recover from an arbitrary number of failures. Experimental evaluation demonstrates low overheads in the absence of failures, recovery overheads on the same order as the lost work, and much lower recovery costs than alternative strategies.
- Research Organization:
- Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC05-76RL01830
- OSTI ID:
- 1379446
- Report Number(s):
- PNNL-SA-123481; KJ0402000
- Resource Relation:
- Conference: Proceedings of the 31st IEEE International Parallel & Distributed Processing Symposium (IPDPS 2017), May 29-June 2, 2017, Orlando, Florida, 397-408
- Country of Publication:
- United States
- Language:
- English
Similar Records
Optimizing Data Locality for Fork/Join Programs Using Constrained Work Stealing
Fault-tolerant dynamic task graph scheduling