skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Improving Memory Error Handling Using Linux

Abstract

As supercomputers continue to get faster and more powerful in the future, they will also have more nodes. If nothing is done, then the amount of memory in supercomputer clusters will soon grow large enough that memory failures will be unmanageable to deal with by manually replacing memory DIMMs. "Improving Memory Error Handling Using Linux" is a process oriented method to solve this problem by using the Linux kernel to disable (offline) faulty memory pages containing bad addresses, preventing them from being used again by a process. The process of offlining memory pages simplifies error handling and results in reducing both hardware and manpower costs required to run Los Alamos National Laboratory (LANL) clusters. This process will be necessary for the future of supercomputing to allow the development of exascale computers. It will not be feasible without memory error handling to manually replace the number of DIMMs that will fail daily on a machine consisting of 32-128 petabytes of memory. Testing reveals the process of offlining memory pages works and is relatively simple to use. As more and more testing is conducted, the entire process will be automated within the high-performance computing (HPC) monitoring software, Zenoss, at LANL.

Authors:
 [1];  [1];  [1]
  1. Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
Publication Date:
Research Org.:
Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1148313
Report Number(s):
LA-UR-14-25823
DOE Contract Number:
AC52-06NA25396
Resource Type:
Technical Report
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; COMPUTER HARDWARE; COMPUTER SCIENCE

Citation Formats

Carlton, Michael Andrew, Blanchard, Sean P., and Debardeleben, Nathan A. Improving Memory Error Handling Using Linux. United States: N. p., 2014. Web. doi:10.2172/1148313.
Carlton, Michael Andrew, Blanchard, Sean P., & Debardeleben, Nathan A. Improving Memory Error Handling Using Linux. United States. doi:10.2172/1148313.
Carlton, Michael Andrew, Blanchard, Sean P., and Debardeleben, Nathan A. Fri . "Improving Memory Error Handling Using Linux". United States. doi:10.2172/1148313. https://www.osti.gov/servlets/purl/1148313.
@article{osti_1148313,
title = {Improving Memory Error Handling Using Linux},
author = {Carlton, Michael Andrew and Blanchard, Sean P. and Debardeleben, Nathan A.},
abstractNote = {As supercomputers continue to get faster and more powerful in the future, they will also have more nodes. If nothing is done, then the amount of memory in supercomputer clusters will soon grow large enough that memory failures will be unmanageable to deal with by manually replacing memory DIMMs. "Improving Memory Error Handling Using Linux" is a process oriented method to solve this problem by using the Linux kernel to disable (offline) faulty memory pages containing bad addresses, preventing them from being used again by a process. The process of offlining memory pages simplifies error handling and results in reducing both hardware and manpower costs required to run Los Alamos National Laboratory (LANL) clusters. This process will be necessary for the future of supercomputing to allow the development of exascale computers. It will not be feasible without memory error handling to manually replace the number of DIMMs that will fail daily on a machine consisting of 32-128 petabytes of memory. Testing reveals the process of offlining memory pages works and is relatively simple to use. As more and more testing is conducted, the entire process will be automated within the high-performance computing (HPC) monitoring software, Zenoss, at LANL.},
doi = {10.2172/1148313},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Fri Jul 25 00:00:00 EDT 2014},
month = {Fri Jul 25 00:00:00 EDT 2014}
}

Technical Report:

Save / Share:
  • We present experimental results for a coordinated scheduling implementation of the Linux operating system. Results were collected on an IBM Blue Gene/L machine at scales up to 16K nodes. Our results indicate coordinated scheduling was able to provide a dramatic improvement in scaling performance for two applications characterized as bulk synchronous parallel programs.
  • ERROR obtains a dump of a controllee or drop file or defines ORDERLIB- or system-detected errors. It is a controllee designed to be run within the ORDER system as part of the ORDER error procedure, or to be run as a specialized dump routine from a teletypewriter. As a stand-alone controllee, ERROR has great versatility. It has options to take full memory dumps, or to give a brief description of errors. ERROR is used to take memory dumps in either decimal or octal, to take partial memory dumps of either software- or hardware-detected errors. It can be run on CDCmore » 7600 and 6600 computers, and uses about 15000/sub 8/ memory locations. The major change in this revision is the addition of symbolic dump capability. 1 table (RWR)« less
  • XERROR is a collection of portable FORTRAN routines which serves as a central facility for processing error messages associated with errors occurring in libraries of FORTRAN routines.
  • The MPI 2 spec contains error handling and notification mechanisms that have a number of limitations from the point of view of application fault tolerance: (1) The specification makes no demands on MPI to survive failures. Although MPI implementers are encouraged to 'circumscribe the impact of an error, so that normal processing can continue after an error handler was invoked', nothing more is specified in the standard. In particular, the defined MPI error classes are used only to clarify to the user the source of the error and do not describe the MPI functionality that is not available as amore » result of the error. (2) All errors must somehow be associated with some specific MPI call. As such, (A) It is difficult for MPI to notify users of failures in asynchronous calls, such as an MPI{_}Rsend call, which may return immediately after the message data is sent along the wire but before it is successfully delivered; (B) There is no provision for asynchronous error notification regarding errors that will affect future calls, such as notifying process p of the failure of process q before p tries to communicate with q. (3) There is no description of when error notification will happen relative to the occurrence of the error. In particular, the specification does not state whether an error that would cause MPI functions to return an error code under the MPI{_}ERRORS{_}RETURN error handler would cause a user-defined error handler to be called during the same MPI function or at some earlier or later point in time. (4) Although MPI makes it possible for libraries to define their own error classes and invoke application error handlers, it is not possible for the application to define new error notification patterns either within or across processes. This means that it is not possible for one application process to ask to be informed of errors on other processes or for the application to be informed of specific classes of errors.« less