Error containment for enabling local checkpoint and recovery
Abstract
Various embodiments include a parallel processing computer system that detects memory errors as a memory client loads data from memory and disables the memory client from storing data to memory, thereby reducing the likelihood that the memory error propagates to other memory clients. The memory client initiates a stall sequence, while other memory clients continue to execute instructions and the memory continues to service memory load and store operations. When a memory error is detected, a specific bit pattern is stored in conjunction with the data associated with the memory error. When the data is copied from one memory to another memory, the specific bit pattern is also copied, in order to identify the data as having a memory error.
- Inventors:
- Issue Date:
- Research Org.:
- Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States); Nvidia Corporation, Santa Clara, CA (United States)
- Sponsoring Org.:
- USDOE
- OSTI Identifier:
- 2222086
- Patent Number(s):
- 11720440
- Application Number:
- 17/373,678
- Assignee:
- Nvidia Corporation (Santa Clara, CA)
- DOE Contract Number:
- AC52-07NA27344; B620719
- Resource Type:
- Patent
- Resource Relation:
- Patent File Date: 07/12/2021
- Country of Publication:
- United States
- Language:
- English
Citation Formats
Cherukuri, Naveen, Hukerikar, Saurabh, Racunas, Paul, Saxena, Nirmal Raj, Patrick, David Charles, Feng, Yiyang, Ghadge, Abhijeet, Heinrich, Steven James, Hendrickson, Adam, Hirota, Gentaro, Joginipally, Praveen, Kulkarni, Vaishali, Mills, Peter C., Navada, Sandeep, Patel, Manan, and Yin, Liang. Error containment for enabling local checkpoint and recovery. United States: N. p., 2023.
Web.
Cherukuri, Naveen, Hukerikar, Saurabh, Racunas, Paul, Saxena, Nirmal Raj, Patrick, David Charles, Feng, Yiyang, Ghadge, Abhijeet, Heinrich, Steven James, Hendrickson, Adam, Hirota, Gentaro, Joginipally, Praveen, Kulkarni, Vaishali, Mills, Peter C., Navada, Sandeep, Patel, Manan, & Yin, Liang. Error containment for enabling local checkpoint and recovery. United States.
Cherukuri, Naveen, Hukerikar, Saurabh, Racunas, Paul, Saxena, Nirmal Raj, Patrick, David Charles, Feng, Yiyang, Ghadge, Abhijeet, Heinrich, Steven James, Hendrickson, Adam, Hirota, Gentaro, Joginipally, Praveen, Kulkarni, Vaishali, Mills, Peter C., Navada, Sandeep, Patel, Manan, and Yin, Liang. Tue .
"Error containment for enabling local checkpoint and recovery". United States. https://www.osti.gov/servlets/purl/2222086.
@article{osti_2222086,
title = {Error containment for enabling local checkpoint and recovery},
author = {Cherukuri, Naveen and Hukerikar, Saurabh and Racunas, Paul and Saxena, Nirmal Raj and Patrick, David Charles and Feng, Yiyang and Ghadge, Abhijeet and Heinrich, Steven James and Hendrickson, Adam and Hirota, Gentaro and Joginipally, Praveen and Kulkarni, Vaishali and Mills, Peter C. and Navada, Sandeep and Patel, Manan and Yin, Liang},
abstractNote = {Various embodiments include a parallel processing computer system that detects memory errors as a memory client loads data from memory and disables the memory client from storing data to memory, thereby reducing the likelihood that the memory error propagates to other memory clients. The memory client initiates a stall sequence, while other memory clients continue to execute instructions and the memory continues to service memory load and store operations. When a memory error is detected, a specific bit pattern is stored in conjunction with the data associated with the memory error. When the data is copied from one memory to another memory, the specific bit pattern is also copied, in order to identify the data as having a memory error.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2023},
month = {8}
}
Works referenced in this record:
Self-healing computer system storage
patent, March 2003
- Frey, Alexander
- US Patent Document 6,530,036
System and method for scrubbing errors in very large memories
patent, January 2005
- Rodeheffer, Thomas L.; Oertli, Erwin
- US Patent Document 6,848,063
Fault-tolerant computer system with online recovery and reintegration of redundant components
patent, July 2001
- Jewett, Douglas E.; Bereiter, Tom; Vetter, Bryan
- US Patent Document 6,263,452
NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA
conference, May 2011
- Nukada, Akira; Takizawa, Hiroyuki; Matsuoka, Satoshi
- 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum
Apparatus and method for memory asynchronous atomic read-correct-write operation
patent, March 2010
- Resnick, David R.; Snyder, Van L.; Higgins, Michael Francis
- US Patent Document 7,676,728
No-execute processor feature global disabling prevention system and method
patent, May 2009
- Szor, Peter; Ferrie, Peter
- US Patent Document 7,540,026