skip to main content

DOE PAGESDOE PAGES

Title: A study of the viability of exploiting memory content similarity to improve resilience to memory errors

Building the next-generation of extreme-scale distributed systems will require overcoming several challenges related to system resilience. As the number of processors in these systems grow, the failure rate increases proportionally. One of the most common sources of failure in large-scale systems is memory. In this paper, we propose a novel runtime for transparently exploiting memory content similarity to improve system resilience by reducing the rate at which memory errors lead to node failure. We evaluate the viability of this approach by examining memory snapshots collected from eight high-performance computing (HPC) applications and two important HPC operating systems. Based on the characteristics of the similarity uncovered, we conclude that our proposed approach shows promise for addressing system resilience in large-scale systems.
Authors:
 [1] ;  [2] ;  [1] ;  [2] ;  [2]
  1. Univ. of New Mexico, Albuquerque, NM (United States)
  2. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Publication Date:
OSTI Identifier:
1111407
Report Number(s):
SAND--2013-8030J
Journal ID: ISSN 1094-3420; 476178
Grant/Contract Number:
AC04-94AL85000
Type:
Accepted Manuscript
Journal Name:
International Journal of High Performance Computing Applications
Additional Journal Information:
Journal Volume: 29; Journal Issue: 1; Journal ID: ISSN 1094-3420
Publisher:
SAGE
Research Org:
Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States)
Sponsoring Org:
USDOE National Nuclear Security Administration (NNSA)
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING