Memory Errors in Modern Systems: The Good, The Bad, and The Ugly
- AMD, Inc., Boxborough, MA (United States)
- Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
- Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
- Advanced Micro Devices, Inc., Boxborough, MA (United States)
Several recent publications have shown that hardware faults in the memory subsystem are commonplace. These faults are predicted to become more frequent in future systems that contain orders of magnitude more DRAM and SRAM than found in current memory subsystems. These memory subsystems will need to provide resilience techniques to tolerate these faults when deployed in high-performance computing systems and data centers containing tens of thousands of nodes. Therefore, it is critical to understand the efficacy of current hardware resilience techniques to determine whether they will be suitable for future systems. In this paper, we present a study of DRAM and SRAM faults and errors from the field. We use data from two leadership-class high-performance computer systems to analyze the reliability impact of hardware resilience schemes that are deployed in current systems. Our study has several key findings about the efficacy of many currently deployed reliability techniques such as DRAM ECC, DDR address/command parity, and SRAM ECC and parity. We also perform a methodological study, and find that counting errors instead of faults, a common practice among researchers and data center operators, can lead to incorrect conclusions about system reliability. Lastly, we use our data to project the needs of future large-scale systems. We find that SRAM faults are unlikely to pose a significantly larger reliability threat in the future, while DRAM faults will be a major concern and stronger DRAM resilience schemes will be needed to maintain acceptable failure rates similar to those found on today's systems.
- Research Organization:
- Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- DOE Contract Number:
- AC04-94AL85000
- OSTI ID:
- 1497665
- Report Number(s):
- SAND-2014-16515J; 672040
- Journal Information:
- ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, Vol. 15; Conference: Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, Istanbul, Turkey, 14-18 Mar 2015
- Country of Publication:
- United States
- Language:
- English
A large-scale study of failures in high-performance computing systems
|
conference | January 2006 |
Impact of Scaling on Neutron-Induced Soft Error in SRAMs From a 250 nm to a 22 nm Design Rule
|
journal | July 2010 |
Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design
|
conference | January 2012 |
Radiation-induced soft errors in advanced semiconductor technologies
|
journal | September 2005 |
Trends and challenges in VLSI circuit reliability
|
journal | July 2003 |
Flipping bits in memory without accessing them: an experimental study of DRAM disturbance errors
|
journal | October 2014 |
Alpha-particle-induced soft errors in dynamic memories
|
journal | January 1979 |
Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters
|
conference | June 2014 |
Real-world design and evaluation of compiler-managed GPU redundant multithreading
|
journal | October 2014 |
A study of DRAM failures in the field
|
conference | November 2012 |
Susceptibility of commodity systems and software to memory soft errors
|
journal | December 2004 |
Reducing cache power with low-cost, multi-bit error-correcting codes
|
journal | June 2010 |
LOT-ECC: localized and tiered reliability mechanisms for commodity memory systems
|
journal | September 2012 |
Low-power, low-storage-overhead chipkill correct via multi-line error correction
|
conference | January 2013 |
Resilient die-stacked DRAM caches
|
journal | July 2013 |
Basic concepts and taxonomy of dependable and secure computing
|
journal | January 2004 |
The Los Alamos Neutron Science Center
|
journal | June 2006 |
Temperature management in data centers: why some (might) like it hot
|
journal | June 2012 |
Comparison of accelerated DRAM soft error rates measured at component and system level
|
conference | April 2008 |
DRAM errors in the wild: a large-scale field study
|
journal | February 2011 |
Impact of deep submicron technology on dependability of VLSI circuits
|
conference | January 2002 |
Temperature management in data centers: why some (might) like it hot
|
conference | January 2012 |
SEEs Induced by High-Energy Protons and Neutrons in SDRAM
|
conference | July 2011 |
Resilient die-stacked DRAM caches
|
conference | June 2013 |
Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults
|
conference | January 2013 |
Reducing cache power with low-cost, multi-bit error-correcting codes
|
conference | January 2010 |
End-to-End Resilience for HPC Applications
|
book | May 2019 |
Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization
|
journal | June 2016 |
Breaking the Boundaries in Heterogeneous-ISA Datacenters
|
journal | April 2017 |
Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization
|
conference | June 2016 |
The Reach Profiler (REAPER): Enabling the Mitigation of DRAM Retention Failures via Profiling at Aggressive Conditions
|
conference | June 2017 |
The Reach Profiler (REAPER): Enabling the Mitigation of DRAM Retention Failures via Profiling at Aggressive Conditions
|
journal | September 2017 |
Towards a More Complete Understanding of SDC Propagation
|
conference | June 2017 |
F ault S im: A Fast, Configurable Memory-Reliability Simulator for Conventional and 3D-Stacked Systems
|
journal | January 2016 |
Breaking the Boundaries in Heterogeneous-ISA Datacenters
|
conference | April 2017 |
Similar Records
Data Movement Dominates: Final Report
Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (V.2.0)