skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A study of the viability of exploiting memory content similarity to improve resilience to memory errors

Journal Article · · International Journal of High Performance Computing Applications
 [1];  [2];  [1];  [2];  [2]
  1. Univ. of New Mexico, Albuquerque, NM (United States)
  2. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

Building the next-generation of extreme-scale distributed systems will require overcoming several challenges related to system resilience. As the number of processors in these systems grow, the failure rate increases proportionally. One of the most common sources of failure in large-scale systems is memory. In this paper, we propose a novel runtime for transparently exploiting memory content similarity to improve system resilience by reducing the rate at which memory errors lead to node failure. We evaluate the viability of this approach by examining memory snapshots collected from eight high-performance computing (HPC) applications and two important HPC operating systems. Based on the characteristics of the similarity uncovered, we conclude that our proposed approach shows promise for addressing system resilience in large-scale systems.

Research Organization:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA)
Grant/Contract Number:
AC04-94AL85000
OSTI ID:
1111407
Report Number(s):
SAND-2013-8030J; 476178
Journal Information:
International Journal of High Performance Computing Applications, Vol. 29, Issue 1; ISSN 1094-3420
Publisher:
SAGECopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 1 work
Citation information provided by
Web of Science

Similar Records

CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows
Journal Article · Sun May 31 00:00:00 EDT 2020 · ACM Transactions on Parallel Computing · OSTI ID:1111407

Rolex: Resilience-oriented language extensions for extreme-scale systems
Journal Article · Thu May 26 00:00:00 EDT 2016 · Journal of Supercomputing · OSTI ID:1111407

PLEXUS: A Pattern-Oriented Runtime System Architecture for Resilient Extreme-Scale High-Performance Computing Systems
Conference · Tue Dec 01 00:00:00 EST 2020 · OSTI ID:1111407

Related Subjects