DOE Patents title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Error containment for enabling local checkpoint and recovery

Abstract

Various embodiments include a parallel processing computer system that detects memory errors as a memory client loads data from memory and disables the memory client from storing data to memory, thereby reducing the likelihood that the memory error propagates to other memory clients. The memory client initiates a stall sequence, while other memory clients continue to execute instructions and the memory continues to service memory load and store operations. When a memory error is detected, a specific bit pattern is stored in conjunction with the data associated with the memory error. When the data is copied from one memory to another memory, the specific bit pattern is also copied, in order to identify the data as having a memory error.

Inventors:
; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
Issue Date:
Research Org.:
Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States); Nvidia Corporation, Santa Clara, CA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
2222086
Patent Number(s):
11720440
Application Number:
17/373,678
Assignee:
Nvidia Corporation (Santa Clara, CA)
DOE Contract Number:  
AC52-07NA27344; B620719
Resource Type:
Patent
Resource Relation:
Patent File Date: 07/12/2021
Country of Publication:
United States
Language:
English

Citation Formats

Cherukuri, Naveen, Hukerikar, Saurabh, Racunas, Paul, Saxena, Nirmal Raj, Patrick, David Charles, Feng, Yiyang, Ghadge, Abhijeet, Heinrich, Steven James, Hendrickson, Adam, Hirota, Gentaro, Joginipally, Praveen, Kulkarni, Vaishali, Mills, Peter C., Navada, Sandeep, Patel, Manan, and Yin, Liang. Error containment for enabling local checkpoint and recovery. United States: N. p., 2023. Web.
Cherukuri, Naveen, Hukerikar, Saurabh, Racunas, Paul, Saxena, Nirmal Raj, Patrick, David Charles, Feng, Yiyang, Ghadge, Abhijeet, Heinrich, Steven James, Hendrickson, Adam, Hirota, Gentaro, Joginipally, Praveen, Kulkarni, Vaishali, Mills, Peter C., Navada, Sandeep, Patel, Manan, & Yin, Liang. Error containment for enabling local checkpoint and recovery. United States.
Cherukuri, Naveen, Hukerikar, Saurabh, Racunas, Paul, Saxena, Nirmal Raj, Patrick, David Charles, Feng, Yiyang, Ghadge, Abhijeet, Heinrich, Steven James, Hendrickson, Adam, Hirota, Gentaro, Joginipally, Praveen, Kulkarni, Vaishali, Mills, Peter C., Navada, Sandeep, Patel, Manan, and Yin, Liang. Tue . "Error containment for enabling local checkpoint and recovery". United States. https://www.osti.gov/servlets/purl/2222086.
@article{osti_2222086,
title = {Error containment for enabling local checkpoint and recovery},
author = {Cherukuri, Naveen and Hukerikar, Saurabh and Racunas, Paul and Saxena, Nirmal Raj and Patrick, David Charles and Feng, Yiyang and Ghadge, Abhijeet and Heinrich, Steven James and Hendrickson, Adam and Hirota, Gentaro and Joginipally, Praveen and Kulkarni, Vaishali and Mills, Peter C. and Navada, Sandeep and Patel, Manan and Yin, Liang},
abstractNote = {Various embodiments include a parallel processing computer system that detects memory errors as a memory client loads data from memory and disables the memory client from storing data to memory, thereby reducing the likelihood that the memory error propagates to other memory clients. The memory client initiates a stall sequence, while other memory clients continue to execute instructions and the memory continues to service memory load and store operations. When a memory error is detected, a specific bit pattern is stored in conjunction with the data associated with the memory error. When the data is copied from one memory to another memory, the specific bit pattern is also copied, in order to identify the data as having a memory error.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2023},
month = {8}
}

Works referenced in this record:

Self-healing computer system storage
patent, March 2003


System and method for scrubbing errors in very large memories
patent, January 2005


NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA
conference, May 2011

  • Nukada, Akira; Takizawa, Hiroyuki; Matsuoka, Satoshi
  • 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum
  • https://doi.org/10.1109/IPDPS.2011.131

Apparatus and method for memory asynchronous atomic read-correct-write operation
patent, March 2010