Design for a Soft Error Resilient Dynamic Task-Based Runtime, In: 2015 IEEE International Parallel and Distributed Processing Symposium
Conference
·
· 2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS)
As the scale of modern computing systems grows, failures will happen more frequently. On the way to Exactable a generic, low-overhead, resilient extension becomes a desired aptitude of any programming paradigm. In this paper we explore three additions to a dynamic task-based runtime to build a generic framework providing soft error resilience to task-based programming paradigms. The first recovers the application by re-executing the minimum required sub-DAG, the second takes critical checkpoints of the data flowing between tasks to minimize the necessary re-execution, while the last one takes advantage of algorithmic properties to recover the data without re-execution. These mechanisms have been implemented in the PaRSEC task-based runtime framework. Experimental results validate our approach and quantify the overhead introduced by such mechanisms.
- Research Organization:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
- Sponsoring Organization:
- USDOE Office of Science (SC)
- OSTI ID:
- 1567397
- Conference Information:
- Journal Name: 2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS)
- Country of Publication:
- United States
- Language:
- English
Similar Records
Clover: Compiler directed lightweight soft error resilience
Automatic Halo Management for the Uintah GPU-Heterogeneous Asynchronous Many-Task Runtime
Machine Learning Based Online Performance Prediction for Runtime Parallelization and Task Scheduling
Journal Article
·
Thu Apr 30 20:00:00 EDT 2015
· SIGPLAN
·
OSTI ID:1261518
Automatic Halo Management for the Uintah GPU-Heterogeneous Asynchronous Many-Task Runtime
Journal Article
·
Thu Dec 06 23:00:00 EST 2018
· International Journal of Parallel Programming
·
OSTI ID:1567537
Machine Learning Based Online Performance Prediction for Runtime Parallelization and Task Scheduling
Conference
·
Thu Oct 09 00:00:00 EDT 2008
·
OSTI ID:951680