skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Evaluating operating system vulnerability to memory errors.

Abstract

Reliability is of great concern to the scalability of extreme-scale systems. Of particular concern are soft errors in main memory, which are a leading cause of failures on current systems and are predicted to be the leading cause on future systems. While great effort has gone into designing algorithms and applications that can continue to make progress in the presence of these errors without restarting, the most critical software running on a node, the operating system (OS), is currently left relatively unprotected. OS resiliency is of particular importance because, though this software typically represents a small footprint of a compute node's physical memory, recent studies show more memory errors in this region of memory than the remainder of the system. In this paper, we investigate the soft error vulnerability of two operating systems used in current and future high-performance computing systems: Kitten, the lightweight kernel developed at Sandia National Laboratories, and CLE, a high-performance Linux-based operating system developed by Cray. For each of these platforms, we outline major structures and subsystems that are vulnerable to soft errors and describe methods that could be used to reconstruct damaged state. Our results show the Kitten lightweight operating system may be an easiermore » target to harden against memory errors due to its smaller memory footprint, largely deterministic state, and simpler system structure.« less

Authors:
;  [1]; ;  [2];  [2];
  1. (University of New Mexico)
  2. (North Carolina State University)
Publication Date:
Research Org.:
Sandia National Laboratories
Sponsoring Org.:
USDOE
OSTI Identifier:
1044952
Report Number(s):
SAND2012-4060
TRN: US201214%%1049
DOE Contract Number:  
AC04-94AL85000
Resource Type:
Technical Report
Country of Publication:
United States
Language:
English
Subject:
99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; ALGORITHMS; KERNELS; RELIABILITY; SANDIA NATIONAL LABORATORIES; TARGETS; VULNERABILITY

Citation Formats

Ferreira, Kurt Brian, Bridges, Patrick G., Pedretti, Kevin Thomas Tauke, Mueller, Frank, Fiala, David, and Brightwell, Ronald Brian. Evaluating operating system vulnerability to memory errors.. United States: N. p., 2012. Web. doi:10.2172/1044952.
Ferreira, Kurt Brian, Bridges, Patrick G., Pedretti, Kevin Thomas Tauke, Mueller, Frank, Fiala, David, & Brightwell, Ronald Brian. Evaluating operating system vulnerability to memory errors.. United States. doi:10.2172/1044952.
Ferreira, Kurt Brian, Bridges, Patrick G., Pedretti, Kevin Thomas Tauke, Mueller, Frank, Fiala, David, and Brightwell, Ronald Brian. Tue . "Evaluating operating system vulnerability to memory errors.". United States. doi:10.2172/1044952. https://www.osti.gov/servlets/purl/1044952.
@article{osti_1044952,
title = {Evaluating operating system vulnerability to memory errors.},
author = {Ferreira, Kurt Brian and Bridges, Patrick G. and Pedretti, Kevin Thomas Tauke and Mueller, Frank and Fiala, David and Brightwell, Ronald Brian},
abstractNote = {Reliability is of great concern to the scalability of extreme-scale systems. Of particular concern are soft errors in main memory, which are a leading cause of failures on current systems and are predicted to be the leading cause on future systems. While great effort has gone into designing algorithms and applications that can continue to make progress in the presence of these errors without restarting, the most critical software running on a node, the operating system (OS), is currently left relatively unprotected. OS resiliency is of particular importance because, though this software typically represents a small footprint of a compute node's physical memory, recent studies show more memory errors in this region of memory than the remainder of the system. In this paper, we investigate the soft error vulnerability of two operating systems used in current and future high-performance computing systems: Kitten, the lightweight kernel developed at Sandia National Laboratories, and CLE, a high-performance Linux-based operating system developed by Cray. For each of these platforms, we outline major structures and subsystems that are vulnerable to soft errors and describe methods that could be used to reconstruct damaged state. Our results show the Kitten lightweight operating system may be an easier target to harden against memory errors due to its smaller memory footprint, largely deterministic state, and simpler system structure.},
doi = {10.2172/1044952},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2012},
month = {5}
}

Technical Report:

Save / Share: