skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Data-driven Fault Tolerance for Work Stealing Computations

Abstract

Checkpoint-restart approaches to fault tolerance typically roll back all the processes to the previous checkpoint in the event of a failure. Work stealing is a promising technique to dynamically tolerate variations in the execution environment, including faults, system noise, and energy constraints. In this paper, we present fault tolerance mechanisms for task parallel computations, a popular computation idiom, employing work stealing. The computation is organized as a collection of tasks with data in a global address space. The completion of data operations, rather than the actual messages, is tracked to derive an idempotent data store. This information is used to accurately identify the tasks to be re-executed, therefore to recompute only the lost data, in the presence of random work stealing. We consider three recovery schemes that present distinct trade-offs -- lazy recovery with potentially increased re-execution cost, immediate collective recovery with associated synchronization overheads, and noncollective recovery enabled by additional communication. We employ distributed work stealing to dynamically rebalance the tasks on the live processes and evaluate the three schemes using candidate application benchmarks. We demonstrate that the overheads (space and time) of the fault tolerance mechanism are low, the cost incurred due to failures are small, and themore » overheads decrease with per-process work at scale.« less

Authors:
;
Publication Date:
Research Org.:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1239507
Report Number(s):
PNNL-SA-86484
KJ0402000
DOE Contract Number:  
AC05-76RL01830
Resource Type:
Conference
Resource Relation:
Conference: ICS 2012: Proceedings of the 26th ACM International Conference on Supercomputing, June 25-29, 2012, Venice, Italy, 79-90
Country of Publication:
United States
Language:
English
Subject:
fault tolerance, task parallelism, load balancing, work stealing

Citation Formats

Ma, Wenjing, and Krishnamoorthy, Sriram. Data-driven Fault Tolerance for Work Stealing Computations. United States: N. p., 2012. Web. doi:10.1145/2304576.2304589.
Ma, Wenjing, & Krishnamoorthy, Sriram. Data-driven Fault Tolerance for Work Stealing Computations. United States. doi:10.1145/2304576.2304589.
Ma, Wenjing, and Krishnamoorthy, Sriram. Mon . "Data-driven Fault Tolerance for Work Stealing Computations". United States. doi:10.1145/2304576.2304589.
@article{osti_1239507,
title = {Data-driven Fault Tolerance for Work Stealing Computations},
author = {Ma, Wenjing and Krishnamoorthy, Sriram},
abstractNote = {Checkpoint-restart approaches to fault tolerance typically roll back all the processes to the previous checkpoint in the event of a failure. Work stealing is a promising technique to dynamically tolerate variations in the execution environment, including faults, system noise, and energy constraints. In this paper, we present fault tolerance mechanisms for task parallel computations, a popular computation idiom, employing work stealing. The computation is organized as a collection of tasks with data in a global address space. The completion of data operations, rather than the actual messages, is tracked to derive an idempotent data store. This information is used to accurately identify the tasks to be re-executed, therefore to recompute only the lost data, in the presence of random work stealing. We consider three recovery schemes that present distinct trade-offs -- lazy recovery with potentially increased re-execution cost, immediate collective recovery with associated synchronization overheads, and noncollective recovery enabled by additional communication. We employ distributed work stealing to dynamically rebalance the tasks on the live processes and evaluate the three schemes using candidate application benchmarks. We demonstrate that the overheads (space and time) of the fault tolerance mechanism are low, the cost incurred due to failures are small, and the overheads decrease with per-process work at scale.},
doi = {10.1145/2304576.2304589},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2012},
month = {6}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: