A study of the viability of exploiting memory content similarity to improve resilience to memory errors
Journal Article
·
· International Journal of High Performance Computing Applications
- Univ. of New Mexico, Albuquerque, NM (United States)
- Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Building the next-generation of extreme-scale distributed systems will require overcoming several challenges related to system resilience. As the number of processors in these systems grow, the failure rate increases proportionally. One of the most common sources of failure in large-scale systems is memory. In this paper, we propose a novel runtime for transparently exploiting memory content similarity to improve system resilience by reducing the rate at which memory errors lead to node failure. We evaluate the viability of this approach by examining memory snapshots collected from eight high-performance computing (HPC) applications and two important HPC operating systems. Based on the characteristics of the similarity uncovered, we conclude that our proposed approach shows promise for addressing system resilience in large-scale systems.
- Research Organization:
- Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States)
- Sponsoring Organization:
- USDOE National Nuclear Security Administration (NNSA)
- Grant/Contract Number:
- AC04-94AL85000
- OSTI ID:
- 1111407
- Report Number(s):
- SAND--2013-8030J; 476178
- Journal Information:
- International Journal of High Performance Computing Applications, Journal Name: International Journal of High Performance Computing Applications Journal Issue: 1 Vol. 29; ISSN 1094-3420
- Publisher:
- SAGECopyright Statement
- Country of Publication:
- United States
- Language:
- English
Similar Records
CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows
PLEXUS: A Pattern-Oriented Runtime System Architecture for Resilient Extreme-Scale High-Performance Computing Systems
Rolex: Resilience-oriented language extensions for extreme-scale systems
Journal Article
·
Sat May 30 20:00:00 EDT 2020
· ACM Transactions on Parallel Computing
·
OSTI ID:1769940
PLEXUS: A Pattern-Oriented Runtime System Architecture for Resilient Extreme-Scale High-Performance Computing Systems
Conference
·
Mon Nov 30 23:00:00 EST 2020
·
OSTI ID:1766396
Rolex: Resilience-oriented language extensions for extreme-scale systems
Journal Article
·
Wed May 25 20:00:00 EDT 2016
· Journal of Supercomputing
·
OSTI ID:1259429