Techniques for recovering from errors when executing software applications on parallel processors
Patent
·
OSTI ID:2541696
In various embodiments, a software program uses hardware features of a parallel processor to checkpoint a context associated with an execution of a software application on the parallel processor. The software program uses a preemption feature of the parallel processor to cause the parallel processor to stop executing instructions in accordance with the context. The software program then causes the parallel processor to collect state data associated with the context. After generating a checkpoint based on the state data, the software program causes the parallel processor to resume executing instructions in accordance with the context.
- Research Organization:
- Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States); NVIDIA Corporation, Santa Clara, CA (United States)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC52-07NA27344
- Assignee:
- NVIDIA Corporation (Santa Clara, CA)
- Patent Number(s):
- 11,874,742
- Application Number:
- 17/237,376
- OSTI ID:
- 2541696
- Country of Publication:
- United States
- Language:
- English
CheCUDA: A Checkpoint/Restart Tool for CUDA Applications
|
conference | December 2009 |
NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA
|
conference | May 2011 |
Similar Records
Control of multiple processors executing in parallel regions
Massively parallel processor
Optimistic execution and checkpoint comparison for error recovery in parallel and distributed systems
Patent
·
Tue May 09 00:00:00 EDT 1989
·
OSTI ID:5818574
Massively parallel processor
Book
·
Mon Dec 31 23:00:00 EST 1984
·
OSTI ID:6935716
Optimistic execution and checkpoint comparison for error recovery in parallel and distributed systems
Technical Report
·
Fri May 08 00:00:00 EDT 1992
·
OSTI ID:7026260