skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Fault-tolerant dynamic task graph scheduling

Abstract

In this paper, we present an approach to fault tolerant execution of dynamic task graphs scheduled using work stealing. In particular, we focus on selective and localized recovery of tasks in the presence of soft faults. We elicit from the user the basic task graph structure in terms of successor and predecessor relationships. The work stealing-based algorithm to schedule such a task graph is augmented to enable recovery when the data and meta-data associated with a task get corrupted. We use this redundancy, and the knowledge of the task graph structure, to selectively recover from faults with low space and time overheads. We show that the fault tolerant design retains the essential properties of the underlying work stealing-based task scheduling algorithm, and that the fault tolerant execution is asymptotically optimal when task re-execution is taken into account. Experimental evaluation demonstrates the low cost of recovery under various fault scenarios.

Authors:
; ; ;
Publication Date:
Research Org.:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1178510
Report Number(s):
PNNL-SA-103739
KJ0402000
DOE Contract Number:  
AC05-76RL01830
Resource Type:
Conference
Resource Relation:
Conference: International Conference for High Performance Computing, Storage and Analysis (SC14), November 16-21, 2014, New Orleans, Louisiana, 719-730
Country of Publication:
United States
Language:
English
Subject:
task graph scheduling; fault tolerance

Citation Formats

Kurt, Mehmet C., Krishnamoorthy, Sriram, Agrawal, Kunal, and Agrawal, Gagan. Fault-tolerant dynamic task graph scheduling. United States: N. p., 2014. Web. doi:10.1109/SC.2014.64.
Kurt, Mehmet C., Krishnamoorthy, Sriram, Agrawal, Kunal, & Agrawal, Gagan. Fault-tolerant dynamic task graph scheduling. United States. https://doi.org/10.1109/SC.2014.64
Kurt, Mehmet C., Krishnamoorthy, Sriram, Agrawal, Kunal, and Agrawal, Gagan. 2014. "Fault-tolerant dynamic task graph scheduling". United States. https://doi.org/10.1109/SC.2014.64.
@article{osti_1178510,
title = {Fault-tolerant dynamic task graph scheduling},
author = {Kurt, Mehmet C. and Krishnamoorthy, Sriram and Agrawal, Kunal and Agrawal, Gagan},
abstractNote = {In this paper, we present an approach to fault tolerant execution of dynamic task graphs scheduled using work stealing. In particular, we focus on selective and localized recovery of tasks in the presence of soft faults. We elicit from the user the basic task graph structure in terms of successor and predecessor relationships. The work stealing-based algorithm to schedule such a task graph is augmented to enable recovery when the data and meta-data associated with a task get corrupted. We use this redundancy, and the knowledge of the task graph structure, to selectively recover from faults with low space and time overheads. We show that the fault tolerant design retains the essential properties of the underlying work stealing-based task scheduling algorithm, and that the fault tolerant execution is asymptotically optimal when task re-execution is taken into account. Experimental evaluation demonstrates the low cost of recovery under various fault scenarios.},
doi = {10.1109/SC.2014.64},
url = {https://www.osti.gov/biblio/1178510}, journal = {},
number = ,
volume = ,
place = {United States},
year = {Sun Nov 16 00:00:00 EST 2014},
month = {Sun Nov 16 00:00:00 EST 2014}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: