skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Memory Errors in Modern Systems: The Good, The Bad, and The Ugly

Conference · · ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems
 [1];  [2];  [2];  [3];  [3];  [4];  [5]
  1. AMD, Inc., Boxborough, MA (United States)
  2. Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
  3. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
  4. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
  5. Advanced Micro Devices, Inc., Boxborough, MA (United States)

Several recent publications have shown that hardware faults in the memory subsystem are commonplace. These faults are predicted to become more frequent in future systems that contain orders of magnitude more DRAM and SRAM than found in current memory subsystems. These memory subsystems will need to provide resilience techniques to tolerate these faults when deployed in high-performance computing systems and data centers containing tens of thousands of nodes. Therefore, it is critical to understand the efficacy of current hardware resilience techniques to determine whether they will be suitable for future systems. In this paper, we present a study of DRAM and SRAM faults and errors from the field. We use data from two leadership-class high-performance computer systems to analyze the reliability impact of hardware resilience schemes that are deployed in current systems. Our study has several key findings about the efficacy of many currently deployed reliability techniques such as DRAM ECC, DDR address/command parity, and SRAM ECC and parity. We also perform a methodological study, and find that counting errors instead of faults, a common practice among researchers and data center operators, can lead to incorrect conclusions about system reliability. Lastly, we use our data to project the needs of future large-scale systems. We find that SRAM faults are unlikely to pose a significantly larger reliability threat in the future, while DRAM faults will be a major concern and stronger DRAM resilience schemes will be needed to maintain acceptable failure rates similar to those found on today's systems.

Research Organization:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
DOE Contract Number:
AC04-94AL85000
OSTI ID:
1497665
Report Number(s):
SAND-2014-16515J; 672040
Journal Information:
ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, Vol. 15; Conference: Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, Istanbul, Turkey, 14-18 Mar 2015
Country of Publication:
United States
Language:
English

References (26)

A large-scale study of failures in high-performance computing systems conference January 2006
Impact of Scaling on Neutron-Induced Soft Error in SRAMs From a 250 nm to a 22 nm Design Rule journal July 2010
Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design
  • Hwang, Andy A.; Stefanovici, Ioan A.; Schroeder, Bianca
  • Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '12 https://doi.org/10.1145/2150976.2150989
conference January 2012
Radiation-induced soft errors in advanced semiconductor technologies journal September 2005
Trends and challenges in VLSI circuit reliability journal July 2003
Flipping bits in memory without accessing them: an experimental study of DRAM disturbance errors journal October 2014
Alpha-particle-induced soft errors in dynamic memories journal January 1979
Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters
  • Martino, Catello Di; Kalbarczyk, Zbigniew; Iyer, Ravishankar K.
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.62
conference June 2014
Real-world design and evaluation of compiler-managed GPU redundant multithreading journal October 2014
A study of DRAM failures in the field
  • Sridharan, Vilas; Liberty, Dean
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.13
conference November 2012
Susceptibility of commodity systems and software to memory soft errors journal December 2004
Reducing cache power with low-cost, multi-bit error-correcting codes journal June 2010
LOT-ECC: localized and tiered reliability mechanisms for commodity memory systems journal September 2012
Low-power, low-storage-overhead chipkill correct via multi-line error correction
  • Jian, Xun; Duwe, Henry; Sartori, John
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503243
conference January 2013
Resilient die-stacked DRAM caches journal July 2013
Basic concepts and taxonomy of dependable and secure computing journal January 2004
The Los Alamos Neutron Science Center
  • Lisowski, Paul W.; Schoenberg, Kurt F.
  • Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, Vol. 562, Issue 2 https://doi.org/10.1016/j.nima.2006.02.178
journal June 2006
Temperature management in data centers: why some (might) like it hot journal June 2012
Comparison of accelerated DRAM soft error rates measured at component and system level conference April 2008
DRAM errors in the wild: a large-scale field study journal February 2011
Impact of deep submicron technology on dependability of VLSI circuits conference January 2002
Temperature management in data centers: why some (might) like it hot
  • El-Sayed, Nosayba; Stefanovici, Ioan A.; Amvrosiadis, George
  • Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems - SIGMETRICS '12 https://doi.org/10.1145/2254756.2254778
conference January 2012
SEEs Induced by High-Energy Protons and Neutrons in SDRAM conference July 2011
Resilient die-stacked DRAM caches
  • Sim, Jaewoong; Loh, Gabriel H.; Sridharan, Vilas
  • ISCA'13: The 40th Annual International Symposium on Computer Architecture, Proceedings of the 40th Annual International Symposium on Computer Architecture https://doi.org/10.1145/2485922.2485958
conference June 2013
Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults
  • Sridharan, Vilas; Stearley, Jon; DeBardeleben, Nathan
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503257
conference January 2013
Reducing cache power with low-cost, multi-bit error-correcting codes conference January 2010

Cited By (9)

End-to-End Resilience for HPC Applications
  • Rezaei, Arash; Khetawat, Harsh; Patil, Onkar
  • High Performance Computing: 34th International Conference, ISC High Performance 2019, Frankfurt/Main, Germany, June 16–20, 2019, Proceedings, p. 271-290 https://doi.org/10.1007/978-3-030-20656-7_14
book May 2019
Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization journal June 2016
Breaking the Boundaries in Heterogeneous-ISA Datacenters journal April 2017
Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization
  • Chang, Kevin K.; Kashyap, Abhijith; Hassan, Hasan
  • SIGMETRICS '16: SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems, Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science https://doi.org/10.1145/2896377.2901453
conference June 2016
The Reach Profiler (REAPER): Enabling the Mitigation of DRAM Retention Failures via Profiling at Aggressive Conditions
  • Patel, Minesh; Kim, Jeremie S.; Mutlu, Onur
  • ISCA '17: The 44th Annual International Symposium on Computer Architecture, Proceedings of the 44th Annual International Symposium on Computer Architecture https://doi.org/10.1145/3079856.3080242
conference June 2017
The Reach Profiler (REAPER): Enabling the Mitigation of DRAM Retention Failures via Profiling at Aggressive Conditions journal September 2017
Towards a More Complete Understanding of SDC Propagation
  • Calhoun, Jon; Snir, Marc; Olson, Luke N.
  • HPDC '17: The 26th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing https://doi.org/10.1145/3078597.3078617
conference June 2017
F ault S im: A Fast, Configurable Memory-Reliability Simulator for Conventional and 3D-Stacked Systems
  • Nair, Prashant J.; Roberts, David A.; Qureshi, Moinuddin K.
  • ACM Transactions on Architecture and Code Optimization, Vol. 12, Issue 4 https://doi.org/10.1145/2831234
journal January 2016
Breaking the Boundaries in Heterogeneous-ISA Datacenters
  • Barbalace, Antonio; Lyerly, Robert; Jelesnianski, Christopher
  • ASPLOS '17: Architectural Support for Programming Languages and Operating Systems, Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems https://doi.org/10.1145/3037697.3037738
conference April 2017

Figures / Tables (18)


Similar Records

Blackcomb: Hardware-Software Co-design for Non-Volatile Memory in Exascale Systems
Technical Report · Wed Nov 26 00:00:00 EST 2014 · OSTI ID:1497665

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (V.2.0)
Technical Report · Fri Dec 16 00:00:00 EST 2022 · OSTI ID:1497665

Data Movement Dominates: Final Report
Technical Report · Wed Jul 01 00:00:00 EDT 2015 · OSTI ID:1497665

Related Subjects