skip to main content

SciTech ConnectSciTech Connect

Title: Improving Memory Error Handling Using Linux

As supercomputers continue to get faster and more powerful in the future, they will also have more nodes. If nothing is done, then the amount of memory in supercomputer clusters will soon grow large enough that memory failures will be unmanageable to deal with by manually replacing memory DIMMs. "Improving Memory Error Handling Using Linux" is a process oriented method to solve this problem by using the Linux kernel to disable (offline) faulty memory pages containing bad addresses, preventing them from being used again by a process. The process of offlining memory pages simplifies error handling and results in reducing both hardware and manpower costs required to run Los Alamos National Laboratory (LANL) clusters. This process will be necessary for the future of supercomputing to allow the development of exascale computers. It will not be feasible without memory error handling to manually replace the number of DIMMs that will fail daily on a machine consisting of 32-128 petabytes of memory. Testing reveals the process of offlining memory pages works and is relatively simple to use. As more and more testing is conducted, the entire process will be automated within the high-performance computing (HPC) monitoring software, Zenoss, at LANL.
Authors:
 [1] ;  [1] ;  [1]
  1. Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
Publication Date:
OSTI Identifier:
1148313
Report Number(s):
LA-UR-14-25823
DOE Contract Number:
AC52-06NA25396
Resource Type:
Technical Report
Research Org:
Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
Sponsoring Org:
USDOE
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING COMPUTER HARDWARE; COMPUTER SCIENCE