Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications
- Intel Corporation
- Northeastern University, Boston
- ORNL
Resilience is one of the key challenges in maintaining high efficiency of future extreme scale supercomputers. Researchers and system practitioners rely on field-data studies to understand reliability characteristics and plan for future HPC systems. In this work, we compare and contrast the reliability characteristics of multiple large-scale HPC production systems. Our study covers more than one billion compute node hours across five different systems over a period of 8 years. We confirm previous findings which continue to be valid, discover new findings, and discuss their implications.
- Research Organization:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- DOE Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1423066
- Resource Relation:
- Conference: 30th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2017 - Denver, Colorado, United States of America - 11/12/2017 5:00:00 AM-11/17/2017 5:00:00 AM
- Country of Publication:
- United States
- Language:
- English
Combining Partial Redundancy and Checkpointing for HPC
|
conference | June 2012 |
Understanding GPU errors on large-scale HPC systems and the implications for system design and operation
|
conference | February 2015 |
Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults
|
conference | January 2013 |
A higher order estimate of the optimum checkpoint interval for restart dumps
|
journal | February 2006 |
Exascale Computing Technology Challenges
|
book | January 2010 |
Understanding the Spatial Characteristics of DRAM Errors in HPC Clusters
|
conference | January 2017 |
Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems
|
conference | May 2012 |
LOGAIDER: A Tool for Mining Potential Correlations of HPC Log Events
|
conference | May 2017 |
Improving Log-based Field Failure Data Analysis of multi-node computing systems
|
conference | June 2011 |
Big omics data experience
|
conference | January 2015 |
Addressing failures in exascale computing
|
journal | March 2014 |
Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility
|
conference | January 2015 |
Toward Exascale Resilience
|
journal | September 2009 |
Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters
|
conference | June 2014 |
Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems
|
conference | June 2015 |
Application monitoring and checkpointing in HPC: looking towards exascale systems
|
conference | January 2012 |
Reducing Waste in Extreme Scale Systems through Introspective Analysis
|
conference | May 2016 |
Reading between the lines of failure logs: Understanding how HPC systems fail
|
conference | June 2013 |
Fault prediction under the microscope: A closer look into HPC systems
|
conference | November 2012 |
A Large-Scale Study of Flash Memory Failures in the Field
|
conference | January 2015 |
DRAM errors in the wild: a large-scale field study
|
conference | January 2009 |
Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5,000,000 HPC Application Runs
|
conference | June 2015 |
Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems
|
conference | June 2014 |
Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design
|
journal | March 2012 |
The Malthusian Catastrophe Is Upon Us! Are the Largest HPC Machines Ever Up?
|
book | January 2012 |
A Large-Scale Study of Flash Memory Failures in the Field
|
journal | June 2015 |
A large-scale study of soft-errors on GPUs in the field
|
conference | March 2016 |
What Supercomputers Say: A Study of Five System Logs
|
conference | June 2007 |
A Large-Scale Study of Failures in High-Performance Computing Systems
|
journal | October 2010 |
DRAM errors in the wild: a large-scale field study
|
journal | June 2009 |
Similar Records
Understanding GPU errors on large-scale HPC systems and the implications for system design and operation
Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System
Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (V.2.0)
Conference
·
Sun Feb 01 00:00:00 EST 2015
· 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA); 7-11 Feb. 2015; Burlingame, CA, USA
·
OSTI ID:1423066
+9 more
Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System
Conference
·
Fri Jun 01 00:00:00 EDT 2018
·
OSTI ID:1423066
+5 more
Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (V.2.0)
Technical Report
·
Fri Dec 16 00:00:00 EST 2022
·
OSTI ID:1423066
+2 more