Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

A study of the viability of exploiting memory content similarity to improve resilience to memory errors

Journal Article · · International Journal of High Performance Computing Applications
 [1];  [2];  [1];  [2];  [2]
  1. Univ. of New Mexico, Albuquerque, NM (United States)
  2. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Building the next-generation of extreme-scale distributed systems will require overcoming several challenges related to system resilience. As the number of processors in these systems grow, the failure rate increases proportionally. One of the most common sources of failure in large-scale systems is memory. In this paper, we propose a novel runtime for transparently exploiting memory content similarity to improve system resilience by reducing the rate at which memory errors lead to node failure. We evaluate the viability of this approach by examining memory snapshots collected from eight high-performance computing (HPC) applications and two important HPC operating systems. Based on the characteristics of the similarity uncovered, we conclude that our proposed approach shows promise for addressing system resilience in large-scale systems.
Research Organization:
Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA)
Grant/Contract Number:
AC04-94AL85000
OSTI ID:
1111407
Report Number(s):
SAND--2013-8030J; 476178
Journal Information:
International Journal of High Performance Computing Applications, Journal Name: International Journal of High Performance Computing Applications Journal Issue: 1 Vol. 29; ISSN 1094-3420
Publisher:
SAGECopyright Statement
Country of Publication:
United States
Language:
English

References (38)

The general birthday problem journal March 1995
Fast Parallel Algorithms for Short-Range Molecular Dynamics journal March 1995
libhashckpt: Hash-Based Incremental Checkpointing Using GPU’s book January 2011
Cooperative Application/OS DRAM Fault Recovery book January 2012
CTH: A three-dimensional shock wave physics code journal January 1990
Spectral neighbor analysis method for automated generation of quantum-accurate interatomic potentials journal March 2015
BoomerAMG: A parallel algebraic multigrid solver and preconditioner journal April 2002
Gaussian Approximation Potentials: The Accuracy of Quantum Mechanics, without the Electrons journal April 2010
Impact of Technology and Voltage Scaling on the Soft Error Susceptibility in Nanoscale CMOS conference October 2008
DEBAR: A scalable high-performance de-duplication storage system for backup and archiving conference January 2010
Palacios and Kitten: New high performance operating systems for scalable virtualized and native supercomputing conference January 2010
Exploiting Data Similarity to Reduce Memory Footprints conference May 2011
Investigating the Impact of the Cielo Cray XE6 Architecture on Scientific Application Codes conference May 2011
Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications
  • Guermouche, Amina; Ropars, Thomas; Brunet, Elisabeth
  • Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium https://doi.org/10.1109/ipdps.2011.95
conference May 2011
Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery journal April 2004
A Large-Scale Study of Failures in High-Performance Computing Systems journal October 2010
Memory resource management in VMware ESX server conference January 2002
Evaluating the viability of process replication reliability for exascale systems
  • Ferreira, Kurt; Stearley, Jon; Laros, James H.
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063443
conference January 2011
Cosmic rays don't strike twice journal March 2012
A case for tracking and exploiting inter-node and intra-node memory content sharing in virtualized large-scale parallel systems conference June 2012
Evaluating operating system vulnerability to memory errors conference June 2012
Disco journal November 1997
A survey of rollback-recovery protocols in message-passing systems journal September 2002
Recovery in distributed systems using asynchronous message logging and checkpointing conference January 1988
CTH: A three-dimensional shock wave physics code journal January 1990
BoomerAMG: A parallel algebraic multigrid solver and preconditioner journal April 2002
Gaussian Approximation Potentials: The Accuracy of Quantum Mechanics, without the Electrons journal April 2010
Impact of Technology and Voltage Scaling on the Soft Error Susceptibility in Nanoscale CMOS conference October 2008
Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources conference January 2006
DEBAR: A scalable high-performance de-duplication storage system for backup and archiving conference January 2010
Palacios and Kitten: New high performance operating systems for scalable virtualized and native supercomputing conference January 2010
Exploiting Data Similarity to Reduce Memory Footprints conference May 2011
Investigating the Impact of the Cielo Cray XE6 Architecture on Scientific Application Codes conference May 2011
Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications
  • Guermouche, Amina; Ropars, Thomas; Brunet, Elisabeth
  • Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2011.95
conference May 2011
Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery journal April 2004
Difference engine journal October 2010
Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design
  • Hwang, Andy A.; Stefanovici, Ioan A.; Schroeder, Bianca
  • Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '12 https://doi.org/10.1145/2150976.2150989
conference January 2012
Memory resource management in VMware ESX server journal December 2002

Similar Records

CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows
Journal Article · Sat May 30 20:00:00 EDT 2020 · ACM Transactions on Parallel Computing · OSTI ID:1769940

PLEXUS: A Pattern-Oriented Runtime System Architecture for Resilient Extreme-Scale High-Performance Computing Systems
Conference · Mon Nov 30 23:00:00 EST 2020 · OSTI ID:1766396

Rolex: Resilience-oriented language extensions for extreme-scale systems
Journal Article · Wed May 25 20:00:00 EDT 2016 · Journal of Supercomputing · OSTI ID:1259429

Related Subjects