skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Design for a Soft Error Resilient Dynamic Task-Based Runtime, In: 2015 IEEE International Parallel and Distributed Processing Symposium

Abstract

As the scale of modern computing systems grows, failures will happen more frequently. On the way to Exactable a generic, low-overhead, resilient extension becomes a desired aptitude of any programming paradigm. In this paper we explore three additions to a dynamic task-based runtime to build a generic framework providing soft error resilience to task-based programming paradigms. The first recovers the application by re-executing the minimum required sub-DAG, the second takes critical checkpoints of the data flowing between tasks to minimize the necessary re-execution, while the last one takes advantage of algorithmic properties to recover the data without re-execution. These mechanisms have been implemented in the PaRSEC task-based runtime framework. Experimental results validate our approach and quantify the overhead introduced by such mechanisms.

Authors:
; ; ;
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1567397
Resource Type:
Conference
Journal Name:
2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS)
Additional Journal Information:
Conference: International Parallel and Distributed Processing Symposium, Hyderabad, India, May 25-29, 2015
Country of Publication:
United States
Language:
English
Subject:
Computer Science; Engineering

Citation Formats

Cao, Chongxiao, Herault, Thomas, Bosilca, George, and Dongarra, Jack. Design for a Soft Error Resilient Dynamic Task-Based Runtime, In: 2015 IEEE International Parallel and Distributed Processing Symposium. United States: N. p., 2015. Web. doi:10.1109/IPDPS.2015.81.
Cao, Chongxiao, Herault, Thomas, Bosilca, George, & Dongarra, Jack. Design for a Soft Error Resilient Dynamic Task-Based Runtime, In: 2015 IEEE International Parallel and Distributed Processing Symposium. United States. https://doi.org/10.1109/IPDPS.2015.81
Cao, Chongxiao, Herault, Thomas, Bosilca, George, and Dongarra, Jack. 2015. "Design for a Soft Error Resilient Dynamic Task-Based Runtime, In: 2015 IEEE International Parallel and Distributed Processing Symposium". United States. https://doi.org/10.1109/IPDPS.2015.81.
@article{osti_1567397,
title = {Design for a Soft Error Resilient Dynamic Task-Based Runtime, In: 2015 IEEE International Parallel and Distributed Processing Symposium},
author = {Cao, Chongxiao and Herault, Thomas and Bosilca, George and Dongarra, Jack},
abstractNote = {As the scale of modern computing systems grows, failures will happen more frequently. On the way to Exactable a generic, low-overhead, resilient extension becomes a desired aptitude of any programming paradigm. In this paper we explore three additions to a dynamic task-based runtime to build a generic framework providing soft error resilience to task-based programming paradigms. The first recovers the application by re-executing the minimum required sub-DAG, the second takes critical checkpoints of the data flowing between tasks to minimize the necessary re-execution, while the last one takes advantage of algorithmic properties to recover the data without re-execution. These mechanisms have been implemented in the PaRSEC task-based runtime framework. Experimental results validate our approach and quantify the overhead introduced by such mechanisms.},
doi = {10.1109/IPDPS.2015.81},
url = {https://www.osti.gov/biblio/1567397}, journal = {2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS)},
number = ,
volume = ,
place = {United States},
year = {Fri May 01 00:00:00 EDT 2015},
month = {Fri May 01 00:00:00 EDT 2015}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: