skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Improving Memory Error Handling Using Linux

Technical Report ·
DOI:https://doi.org/10.2172/1148313· OSTI ID:1148313

As supercomputers continue to get faster and more powerful in the future, they will also have more nodes. If nothing is done, then the amount of memory in supercomputer clusters will soon grow large enough that memory failures will be unmanageable to deal with by manually replacing memory DIMMs. "Improving Memory Error Handling Using Linux" is a process oriented method to solve this problem by using the Linux kernel to disable (offline) faulty memory pages containing bad addresses, preventing them from being used again by a process. The process of offlining memory pages simplifies error handling and results in reducing both hardware and manpower costs required to run Los Alamos National Laboratory (LANL) clusters. This process will be necessary for the future of supercomputing to allow the development of exascale computers. It will not be feasible without memory error handling to manually replace the number of DIMMs that will fail daily on a machine consisting of 32-128 petabytes of memory. Testing reveals the process of offlining memory pages works and is relatively simple to use. As more and more testing is conducted, the entire process will be automated within the high-performance computing (HPC) monitoring software, Zenoss, at LANL.

Research Organization:
Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC52-06NA25396
OSTI ID:
1148313
Report Number(s):
LA-UR-14-25823
Country of Publication:
United States
Language:
English

Similar Records

SBLLmalloc V1.0
Software · Fri Apr 23 00:00:00 EDT 2010 · OSTI ID:1148313

SMC 2021 : Analyzing Resource Utilization and User Behavior on Titan Supercomputer
Dataset · Fri Mar 26 00:00:00 EDT 2021 · OSTI ID:1148313

Commodity clusters: Performance comparison between PC`s and workstations
Conference · Fri Mar 01 00:00:00 EST 1996 · OSTI ID:1148313