Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Techniques for recovering from errors when executing software applications on parallel processors

Patent ·
OSTI ID:2541696
In various embodiments, a software program uses hardware features of a parallel processor to checkpoint a context associated with an execution of a software application on the parallel processor. The software program uses a preemption feature of the parallel processor to cause the parallel processor to stop executing instructions in accordance with the context. The software program then causes the parallel processor to collect state data associated with the context. After generating a checkpoint based on the state data, the software program causes the parallel processor to resume executing instructions in accordance with the context.
Research Organization:
Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States); NVIDIA Corporation, Santa Clara, CA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC52-07NA27344
Assignee:
NVIDIA Corporation (Santa Clara, CA)
Patent Number(s):
11,874,742
Application Number:
17/237,376
OSTI ID:
2541696
Country of Publication:
United States
Language:
English

References (2)

CheCUDA: A Checkpoint/Restart Tool for CUDA Applications
  • Takizawa, Hiroyuki; Sato, Katsuto; Komatsu, Kazuhiko
  • 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT) https://doi.org/10.1109/PDCAT.2009.78
conference December 2009
NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA
  • Nukada, Akira; Takizawa, Hiroyuki; Matsuoka, Satoshi
  • 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum https://doi.org/10.1109/IPDPS.2011.131
conference May 2011

Similar Records

Control of multiple processors executing in parallel regions
Patent · Tue May 09 00:00:00 EDT 1989 · OSTI ID:5818574

Massively parallel processor
Book · Mon Dec 31 23:00:00 EST 1984 · OSTI ID:6935716

Optimistic execution and checkpoint comparison for error recovery in parallel and distributed systems
Technical Report · Fri May 08 00:00:00 EDT 1992 · OSTI ID:7026260

Related Subjects