Memory Errors in Modern Systems: The Good, The Bad, and The Ugly
Conference
·
· ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems
- AMD, Inc., Boxborough, MA (United States)
- Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
- Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
- Advanced Micro Devices, Inc., Boxborough, MA (United States)
Several recent publications have shown that hardware faults in the memory subsystem are commonplace. These faults are predicted to become more frequent in future systems that contain orders of magnitude more DRAM and SRAM than found in current memory subsystems. These memory subsystems will need to provide resilience techniques to tolerate these faults when deployed in high-performance computing systems and data centers containing tens of thousands of nodes. Therefore, it is critical to understand the efficacy of current hardware resilience techniques to determine whether they will be suitable for future systems. In this paper, we present a study of DRAM and SRAM faults and errors from the field. We use data from two leadership-class high-performance computer systems to analyze the reliability impact of hardware resilience schemes that are deployed in current systems. Our study has several key findings about the efficacy of many currently deployed reliability techniques such as DRAM ECC, DDR address/command parity, and SRAM ECC and parity. We also perform a methodological study, and find that counting errors instead of faults, a common practice among researchers and data center operators, can lead to incorrect conclusions about system reliability. Lastly, we use our data to project the needs of future large-scale systems. We find that SRAM faults are unlikely to pose a significantly larger reliability threat in the future, while DRAM faults will be a major concern and stronger DRAM resilience schemes will be needed to maintain acceptable failure rates similar to those found on today's systems.
- Research Organization:
- Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
- DOE Contract Number:
- AC04-94AL85000
- OSTI ID:
- 1497665
- Report Number(s):
- SAND--2014-16515J; 672040; ISBN: 978-1-4503-2835-7
- Conference Information:
- Journal Name: ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems Journal Volume: 15
- Country of Publication:
- United States
- Language:
- English
The Los Alamos Neutron Science Center
|
journal | June 2006 |
LOT-ECC: localized and tiered reliability mechanisms for commodity memory systems
|
journal | September 2012 |
Reducing cache power with low-cost, multi-bit error-correcting codes
|
journal | June 2010 |
Reducing cache power with low-cost, multi-bit error-correcting codes
|
conference | January 2010 |
Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults
|
conference | January 2013 |
Comparison of accelerated DRAM soft error rates measured at component and system level
|
conference | April 2008 |
A study of DRAM failures in the field
|
conference | November 2012 |
DRAM errors in the wild: a large-scale field study
|
journal | February 2011 |
Impact of deep submicron technology on dependability of VLSI circuits
|
conference | January 2002 |
Real-world design and evaluation of compiler-managed GPU redundant multithreading
|
journal | October 2014 |
Temperature management in data centers: why some (might) like it hot
|
conference | January 2012 |
SEEs Induced by High-Energy Protons and Neutrons in SDRAM
|
conference | July 2011 |
Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design
|
conference | January 2012 |
Radiation-induced soft errors in advanced semiconductor technologies
|
journal | September 2005 |
Resilient die-stacked DRAM caches
|
conference | June 2013 |
Temperature management in data centers: why some (might) like it hot
|
journal | June 2012 |
Alpha-particle-induced soft errors in dynamic memories
|
journal | January 1979 |
Susceptibility of commodity systems and software to memory soft errors
|
journal | December 2004 |
Impact of Scaling on Neutron-Induced Soft Error in SRAMs From a 250 nm to a 22 nm Design Rule
|
journal | July 2010 |
Low-power, low-storage-overhead chipkill correct via multi-line error correction
|
conference | January 2013 |
A large-scale study of failures in high-performance computing systems
|
conference | January 2006 |
Trends and challenges in VLSI circuit reliability
|
journal | July 2003 |
Resilient die-stacked DRAM caches
|
journal | July 2013 |
Flipping bits in memory without accessing them: an experimental study of DRAM disturbance errors
|
journal | October 2014 |
Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters
|
conference | June 2014 |
Basic concepts and taxonomy of dependable and secure computing
|
journal | January 2004 |
Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization
|
journal | June 2016 |
F ault S im: A Fast, Configurable Memory-Reliability Simulator for Conventional and 3D-Stacked Systems
|
journal | January 2016 |
The Reach Profiler (REAPER): Enabling the Mitigation of DRAM Retention Failures via Profiling at Aggressive Conditions
|
conference | June 2017 |
Breaking the Boundaries in Heterogeneous-ISA Datacenters
|
journal | April 2017 |
End-to-End Resilience for HPC Applications
|
book | May 2019 |
Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization
|
conference | June 2016 |
The Reach Profiler (REAPER): Enabling the Mitigation of DRAM Retention Failures via Profiling at Aggressive Conditions
|
journal | September 2017 |
Towards a More Complete Understanding of SDC Propagation
|
conference | June 2017 |
Breaking the Boundaries in Heterogeneous-ISA Datacenters
|
conference | April 2017 |
Similar Records
Havens: Explicit Reliable Memory Regions for HPC Applications
Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults. In: SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Conference
·
Thu Dec 31 23:00:00 EST 2015
·
OSTI ID:1330545
Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults. In: SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Conference
·
Mon Dec 31 23:00:00 EST 2012
·
OSTI ID:1567632