Memory Errors in Modern Systems: The Good, The Bad, and The Ugly

Sridharan, Vilas; DeBardeleben, Nathan; Blanchard, Sean; Ferreira, Kurt B.; Stearley, Jon; Shalf, John; Gurumurthi, Sudhanva

doi:10.1145/2694344.2694348

Memory Errors in Modern Systems: The Good, The Bad, and The Ugly

Conference · Sat Mar 14 04:00:00 EDT 2015 · ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

DOI:https://doi.org/10.1145/2694344.2694348· OSTI ID:1497665

Sridharan, Vilas ^[1]; DeBardeleben, Nathan ^[2]; Blanchard, Sean ^[2]; Ferreira, Kurt B. ^[3]; Stearley, Jon ^[3]; Shalf, John ^[4]; Gurumurthi, Sudhanva ^[5]

AMD, Inc., Boxborough, MA (United States)
Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Advanced Micro Devices, Inc., Boxborough, MA (United States)

Several recent publications have shown that hardware faults in the memory subsystem are commonplace. These faults are predicted to become more frequent in future systems that contain orders of magnitude more DRAM and SRAM than found in current memory subsystems. These memory subsystems will need to provide resilience techniques to tolerate these faults when deployed in high-performance computing systems and data centers containing tens of thousands of nodes. Therefore, it is critical to understand the efficacy of current hardware resilience techniques to determine whether they will be suitable for future systems. In this paper, we present a study of DRAM and SRAM faults and errors from the field. We use data from two leadership-class high-performance computer systems to analyze the reliability impact of hardware resilience schemes that are deployed in current systems. Our study has several key findings about the efficacy of many currently deployed reliability techniques such as DRAM ECC, DDR address/command parity, and SRAM ECC and parity. We also perform a methodological study, and find that counting errors instead of faults, a common practice among researchers and data center operators, can lead to incorrect conclusions about system reliability. Lastly, we use our data to project the needs of future large-scale systems. We find that SRAM faults are unlikely to pose a significantly larger reliability threat in the future, while DRAM faults will be a major concern and stronger DRAM resilience schemes will be needed to maintain acceptable failure rates similar to those found on today's systems.

View Conference

Research Organization:: Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States)

Sponsoring Organization:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)

DOE Contract Number:: AC04-94AL85000

OSTI ID:: 1497665

Report Number(s):: SAND--2014-16515J; 672040; ISBN: 978-1-4503-2835-7

Conference Information:: Journal Name: ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems Journal Volume: 15

Country of Publication:: United States

Language:: English

References (26)

The Los Alamos Neutron Science Center Lisowski, Paul W.; Schoenberg, Kurt F. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, Vol. 562, Issue 2 https://doi.org/10.1016/j.nima.2006.02.178	journal	June 2006
LOT-ECC: localized and tiered reliability mechanisms for commodity memory systems Udipi, Aniruddha N.; Muralimanohar, Naveen; Balsubramonian, Rajeev ACM SIGARCH Computer Architecture News, Vol. 40, Issue 3 https://doi.org/10.1145/2366231.2337192	journal	September 2012
Reducing cache power with low-cost, multi-bit error-correcting codes Wilkerson, Chris; Alameldeen, Alaa R.; Chishti, Zeshan ACM SIGARCH Computer Architecture News, Vol. 38, Issue 3 https://doi.org/10.1145/1816038.1815973	journal	June 2010
Reducing cache power with low-cost, multi-bit error-correcting codes Wilkerson, Chris; Alameldeen, Alaa R.; Chishti, Zeshan Proceedings of the 37th annual international symposium on Computer architecture - ISCA '10 https://doi.org/10.1145/1815961.1815973	conference	January 2010
Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults Sridharan, Vilas; Stearley, Jon; DeBardeleben, Nathan Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503257	conference	January 2013
Comparison of accelerated DRAM soft error rates measured at component and system level Borucki, Ludger; Schindlbeck, Guenter; Slayman, Charles 2008 IEEE International Reliability Physics Symposium (IRPS) https://doi.org/10.1109/RELPHY.2008.4558933	conference	April 2008
A study of DRAM failures in the field Sridharan, Vilas; Liberty, Dean 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.13	conference	November 2012
DRAM errors in the wild: a large-scale field study Schroeder, Bianca; Pinheiro, Eduardo; Weber, Wolf-Dietrich Communications of the ACM, Vol. 54, Issue 2 https://doi.org/10.1145/1897816.1897844	journal	February 2011
Impact of deep submicron technology on dependability of VLSI circuits Constantinescu, C. Proceedings International Conference on Dependable Systems and Networks https://doi.org/10.1109/DSN.2002.1028901	conference	January 2002
Real-world design and evaluation of compiler-managed GPU redundant multithreading Wadden, Jack; Lyashevsky, Alexander; Gurumurthi, Sudhanva ACM SIGARCH Computer Architecture News, Vol. 42, Issue 3 https://doi.org/10.1145/2678373.2665686	journal	October 2014
Temperature management in data centers: why some (might) like it hot El-Sayed, Nosayba; Stefanovici, Ioan A.; Amvrosiadis, George Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems - SIGMETRICS '12 https://doi.org/10.1145/2254756.2254778	conference	January 2012
SEEs Induced by High-Energy Protons and Neutrons in SDRAM Quinn, Heather; Graham, Paul; Fairbanks, Tom 2011 IEEE Radiation Effects Data Workshop https://doi.org/10.1109/REDW.2010.6062524	conference	July 2011
Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design Hwang, Andy A.; Stefanovici, Ioan A.; Schroeder, Bianca Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '12 https://doi.org/10.1145/2150976.2150989	conference	January 2012
Radiation-induced soft errors in advanced semiconductor technologies Baumann, R. C. IEEE Transactions on Device and Materials Reliability, Vol. 5, Issue 3 https://doi.org/10.1109/TDMR.2005.853449	journal	September 2005
Resilient die-stacked DRAM caches Sim, Jaewoong; Loh, Gabriel H.; Sridharan, Vilas ISCA'13: The 40th Annual International Symposium on Computer Architecture, Proceedings of the 40th Annual International Symposium on Computer Architecture https://doi.org/10.1145/2485922.2485958	conference	June 2013
Temperature management in data centers: why some (might) like it hot El-Sayed, Nosayba; Stefanovici, Ioan A.; Amvrosiadis, George ACM SIGMETRICS Performance Evaluation Review, Vol. 40, Issue 1 https://doi.org/10.1145/2318857.2254778	journal	June 2012
Alpha-particle-induced soft errors in dynamic memories May, T. C.; Woods, M. H. IEEE Transactions on Electron Devices, Vol. 26, Issue 1 https://doi.org/10.1109/T-ED.1979.19370	journal	January 1979
Susceptibility of commodity systems and software to memory soft errors Messer, A.; Bernadat, P.; Fu, G. IEEE Transactions on Computers, Vol. 53, Issue 12 https://doi.org/10.1109/TC.2004.119	journal	December 2004
Impact of Scaling on Neutron-Induced Soft Error in SRAMs From a 250 nm to a 22 nm Design Rule Ibe, Eishi; Taniguchi, Hitoshi; Yahagi, Yasuo IEEE Transactions on Electron Devices, Vol. 57, Issue 7 https://doi.org/10.1109/TED.2010.2047907	journal	July 2010
Low-power, low-storage-overhead chipkill correct via multi-line error correction Jian, Xun; Duwe, Henry; Sartori, John Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503243	conference	January 2013
A large-scale study of failures in high-performance computing systems Schroeder, B.; Gibson, G. A. International Conference on Dependable Systems and Networks (DSN'06) https://doi.org/10.1109/DSN.2006.5	conference	January 2006
Trends and challenges in VLSI circuit reliability Constantinescu, C. IEEE Micro, Vol. 23, Issue 4 https://doi.org/10.1109/MM.2003.1225959	journal	July 2003
Resilient die-stacked DRAM caches Sim, Jaewoong; Loh, Gabriel H.; Sridharan, Vilas ACM SIGARCH Computer Architecture News, Vol. 41, Issue 3 https://doi.org/10.1145/2508148.2485958	journal	July 2013
Flipping bits in memory without accessing them: an experimental study of DRAM disturbance errors Kim, Yoongu; Daly, Ross; Kim, Jeremie ACM SIGARCH Computer Architecture News, Vol. 42, Issue 3 https://doi.org/10.1145/2678373.2665726	journal	October 2014
Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters Martino, Catello Di; Kalbarczyk, Zbigniew; Iyer, Ravishankar K. 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.62	conference	June 2014
Basic concepts and taxonomy of dependable and secure computing Avizienis, A.; Laprie, J. -C.; Randell, B. IEEE Transactions on Dependable and Secure Computing, Vol. 1, Issue 1 https://doi.org/10.1109/TDSC.2004.2	journal	January 2004

Cited By (9)

Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization Chang, Kevin K.; Kashyap, Abhijith; Hassan, Hasan ACM SIGMETRICS Performance Evaluation Review, Vol. 44, Issue 1 https://doi.org/10.1145/2964791.2901453	journal	June 2016
F ault S im: A Fast, Configurable Memory-Reliability Simulator for Conventional and 3D-Stacked Systems Nair, Prashant J.; Roberts, David A.; Qureshi, Moinuddin K. ACM Transactions on Architecture and Code Optimization, Vol. 12, Issue 4 https://doi.org/10.1145/2831234	journal	January 2016
The Reach Profiler (REAPER): Enabling the Mitigation of DRAM Retention Failures via Profiling at Aggressive Conditions Patel, Minesh; Kim, Jeremie S.; Mutlu, Onur ISCA '17: The 44th Annual International Symposium on Computer Architecture, Proceedings of the 44th Annual International Symposium on Computer Architecture https://doi.org/10.1145/3079856.3080242	conference	June 2017
Breaking the Boundaries in Heterogeneous-ISA Datacenters Barbalace, Antonio; Lyerly, Robert; Jelesnianski, Christopher ACM SIGOPS Operating Systems Review, Vol. 51, Issue 2 https://doi.org/10.1145/3093315.3037738	journal	April 2017
End-to-End Resilience for HPC Applications Rezaei, Arash; Khetawat, Harsh; Patil, Onkar High Performance Computing: 34th International Conference, ISC High Performance 2019, Frankfurt/Main, Germany, June 16–20, 2019, Proceedings, p. 271-290 https://doi.org/10.1007/978-3-030-20656-7_14	book	May 2019
Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization Chang, Kevin K.; Kashyap, Abhijith; Hassan, Hasan SIGMETRICS '16: SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems, Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science https://doi.org/10.1145/2896377.2901453	conference	June 2016
The Reach Profiler (REAPER): Enabling the Mitigation of DRAM Retention Failures via Profiling at Aggressive Conditions Patel, Minesh; Kim, Jeremie S.; Mutlu, Onur ACM SIGARCH Computer Architecture News, Vol. 45, Issue 2 https://doi.org/10.1145/3140659.3080242	journal	September 2017
Towards a More Complete Understanding of SDC Propagation Calhoun, Jon; Snir, Marc; Olson, Luke N. HPDC '17: The 26th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing https://doi.org/10.1145/3078597.3078617	conference	June 2017
Breaking the Boundaries in Heterogeneous-ISA Datacenters Barbalace, Antonio; Lyerly, Robert; Jelesnianski, Christopher ASPLOS '17: Architectural Support for Programming Languages and Operating Systems, Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems https://doi.org/10.1145/3037697.3037738	conference	April 2017