skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Data-driven Fault Tolerance for Work Stealing Computations

Conference ·

Checkpoint-restart approaches to fault tolerance typically roll back all the processes to the previous checkpoint in the event of a failure. Work stealing is a promising technique to dynamically tolerate variations in the execution environment, including faults, system noise, and energy constraints. In this paper, we present fault tolerance mechanisms for task parallel computations, a popular computation idiom, employing work stealing. The computation is organized as a collection of tasks with data in a global address space. The completion of data operations, rather than the actual messages, is tracked to derive an idempotent data store. This information is used to accurately identify the tasks to be re-executed, therefore to recompute only the lost data, in the presence of random work stealing. We consider three recovery schemes that present distinct trade-offs -- lazy recovery with potentially increased re-execution cost, immediate collective recovery with associated synchronization overheads, and noncollective recovery enabled by additional communication. We employ distributed work stealing to dynamically rebalance the tasks on the live processes and evaluate the three schemes using candidate application benchmarks. We demonstrate that the overheads (space and time) of the fault tolerance mechanism are low, the cost incurred due to failures are small, and the overheads decrease with per-process work at scale.

Research Organization:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-76RL01830
OSTI ID:
1239507
Report Number(s):
PNNL-SA-86484; KJ0402000
Resource Relation:
Conference: ICS 2012: Proceedings of the 26th ACM International Conference on Supercomputing, June 25-29, 2012, Venice, Italy, 79-90
Country of Publication:
United States
Language:
English