skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications

Conference ·

Resilience is one of the key challenges in maintaining high efficiency of future extreme scale supercomputers. Researchers and system practitioners rely on field-data studies to understand reliability characteristics and plan for future HPC systems. In this work, we compare and contrast the reliability characteristics of multiple large-scale HPC production systems. Our study covers more than one billion compute node hours across five different systems over a period of 8 years. We confirm previous findings which continue to be valid, discover new findings, and discuss their implications.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1423066
Resource Relation:
Conference: 30th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2017 - Denver, Colorado, United States of America - 11/12/2017 5:00:00 AM-11/17/2017 5:00:00 AM
Country of Publication:
United States
Language:
English

References (30)

Combining Partial Redundancy and Checkpointing for HPC conference June 2012
Understanding GPU errors on large-scale HPC systems and the implications for system design and operation conference February 2015
Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults
  • Sridharan, Vilas; Stearley, Jon; DeBardeleben, Nathan
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503257
conference January 2013
A higher order estimate of the optimum checkpoint interval for restart dumps journal February 2006
Exascale Computing Technology Challenges book January 2010
Understanding the Spatial Characteristics of DRAM Errors in HPC Clusters conference January 2017
Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems
  • Gainaru, Ana; Cappello, Franck; Kramer, William
  • 2012 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2012 IEEE 26th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2012.107
conference May 2012
LOGAIDER: A Tool for Mining Potential Correlations of HPC Log Events conference May 2017
Improving Log-based Field Failure Data Analysis of multi-node computing systems conference June 2011
Big omics data experience
  • Kovatch, Patricia; Costa, Anthony; Giles, Zachary
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15 https://doi.org/10.1145/2807591.2807595
conference January 2015
Addressing failures in exascale computing journal March 2014
Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility
  • Tiwari, Devesh; Gupta, Saurabh; Gallarno, George
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15 https://doi.org/10.1145/2807591.2807666
conference January 2015
Toward Exascale Resilience journal September 2009
Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters
  • Martino, Catello Di; Kalbarczyk, Zbigniew; Iyer, Ravishankar K.
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.62
conference June 2014
Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems
  • Gupta, Saurabh; Tiwari, Devesh; Jantzi, Christopher
  • 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2015.52
conference June 2015
Application monitoring and checkpointing in HPC: looking towards exascale systems conference January 2012
Reducing Waste in Extreme Scale Systems through Introspective Analysis conference May 2016
Reading between the lines of failure logs: Understanding how HPC systems fail conference June 2013
Fault prediction under the microscope: A closer look into HPC systems
  • Gainaru, Ana; Cappello, Franck; Snir, Marc
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.57
conference November 2012
A Large-Scale Study of Flash Memory Failures in the Field
  • Meza, Justin; Wu, Qiang; Kumar, Sanjev
  • Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems - SIGMETRICS '15 https://doi.org/10.1145/2745844.2745848
conference January 2015
DRAM errors in the wild: a large-scale field study
  • Schroeder, Bianca; Pinheiro, Eduardo; Weber, Wolf-Dietrich
  • Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems - SIGMETRICS '09 https://doi.org/10.1145/1555349.1555372
conference January 2009
Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5,000,000 HPC Application Runs
  • Martino, Catello Di; Kramer, William; Kalbarczyk, Zbigniew
  • 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2015.50
conference June 2015
Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems
  • Tiwari, Devesh; Gupta, Saurabh; Vazhkudai, Sudharshan S.
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.101
conference June 2014
Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design journal March 2012
The Malthusian Catastrophe Is Upon Us! Are the Largest HPC Machines Ever Up? book January 2012
A Large-Scale Study of Flash Memory Failures in the Field journal June 2015
A large-scale study of soft-errors on GPUs in the field conference March 2016
What Supercomputers Say: A Study of Five System Logs conference June 2007
A Large-Scale Study of Failures in High-Performance Computing Systems journal October 2010
DRAM errors in the wild: a large-scale field study journal June 2009

Similar Records

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation
Conference · Sun Feb 01 00:00:00 EST 2015 · 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA); 7-11 Feb. 2015; Burlingame, CA, USA · OSTI ID:1423066

Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System
Conference · Fri Jun 01 00:00:00 EDT 2018 · OSTI ID:1423066

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (V.2.0)
Technical Report · Fri Dec 16 00:00:00 EST 2022 · OSTI ID:1423066

Related Subjects