Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications

Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian; Tiwari, Devesh

doi:10.1145/3126908.3126937

Title: Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications

Conference · Wed Nov 01 00:00:00 EDT 2017

DOI:https://doi.org/10.1145/3126908.3126937· OSTI ID:1423066

Gupta, Saurabh ^[1]; Patel, Tirthak ^[2];

^[3]; Tiwari, Devesh ^[2]

Intel Corporation
Northeastern University, Boston
ORNL

Resilience is one of the key challenges in maintaining high efficiency of future extreme scale supercomputers. Researchers and system practitioners rely on field-data studies to understand reliability characteristics and plan for future HPC systems. In this work, we compare and contrast the reliability characteristics of multiple large-scale HPC production systems. Our study covers more than one billion compute node hours across five different systems over a period of 8 years. We confirm previous findings which continue to be valid, discover new findings, and discuss their implications.

View Conference

Cite

Export

Save

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1423066

Resource Relation:: Conference: 30th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2017 - Denver, Colorado, United States of America - 11/12/2017 5:00:00 AM-11/17/2017 5:00:00 AM

Country of Publication:: United States

Language:: English

References (30)

Combining Partial Redundancy and Checkpointing for HPC Elliott, James; Kharbas, Kishor; Fiala, David 2012 IEEE 32nd International Conference on Distributed Computing Systems (ICDCS) https://doi.org/10.1109/ICDCS.2012.56	conference	June 2012
Understanding GPU errors on large-scale HPC systems and the implications for system design and operation Tiwari, Devesh; Gupta, Saurabh; Rogers, James 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA) https://doi.org/10.1109/HPCA.2015.7056044	conference	February 2015
Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults Sridharan, Vilas; Stearley, Jon; DeBardeleben, Nathan Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503257	conference	January 2013
A higher order estimate of the optimum checkpoint interval for restart dumps Daly, J. T. Future Generation Computer Systems, Vol. 22, Issue 3, p. 303-312 https://doi.org/10.1016/j.future.2004.11.016	journal	February 2006
Exascale Computing Technology Challenges Shalf, John; Dosanjh, Sudip; Morrison, John Lecture Notes in Computer Science https://doi.org/10.1007/978-3-642-19328-6_1	book	January 2010
Understanding the Spatial Characteristics of DRAM Errors in HPC Clusters Patwari, Ayush; Laguna, Ignacio; Schulz, Martin Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale - FTXS '17 https://doi.org/10.1145/3086157.3086164	conference	January 2017
Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems Gainaru, Ana; Cappello, Franck; Kramer, William 2012 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2012 IEEE 26th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2012.107	conference	May 2012
LOGAIDER: A Tool for Mining Potential Correlations of HPC Log Events Di, Sheng; Gupta, Rinku; Snir, Marc 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) https://doi.org/10.1109/CCGRID.2017.18	conference	May 2017
Improving Log-based Field Failure Data Analysis of multi-node computing systems Pecchia, Antonio; Cotroneo, Domenico; Kalbarczyk, Zbigniew 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN) https://doi.org/10.1109/DSN.2011.5958210	conference	June 2011
Big omics data experience Kovatch, Patricia; Costa, Anthony; Giles, Zachary Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15 https://doi.org/10.1145/2807591.2807595	conference	January 2015
Addressing failures in exascale computing Snir, Marc; Wisniewski, Robert W.; Abraham, Jacob A. The International Journal of High Performance Computing Applications, Vol. 28, Issue 2 https://doi.org/10.1177/1094342014522573	journal	March 2014
Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility Tiwari, Devesh; Gupta, Saurabh; Gallarno, George Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15 https://doi.org/10.1145/2807591.2807666	conference	January 2015
Toward Exascale Resilience Cappello, Franck; Geist, Al; Gropp, Bill The International Journal of High Performance Computing Applications, Vol. 23, Issue 4 https://doi.org/10.1177/1094342009347767	journal	September 2009
Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters Martino, Catello Di; Kalbarczyk, Zbigniew; Iyer, Ravishankar K. 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.62	conference	June 2014
Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems Gupta, Saurabh; Tiwari, Devesh; Jantzi, Christopher 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2015.52	conference	June 2015
Application monitoring and checkpointing in HPC: looking towards exascale systems Jones, William M.; Daly, John T.; DeBardeleben, Nathan Proceedings of the 50th Annual Southeast Regional Conference on - ACM-SE '12 https://doi.org/10.1145/2184512.2184574	conference	January 2012
Reducing Waste in Extreme Scale Systems through Introspective Analysis Bautista-Gomez, Leonardo; Gainaru, Ana; Perarnau, Swann 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2016.100	conference	May 2016
Reading between the lines of failure logs: Understanding how HPC systems fail El-Sayed, Nosayba; Schroeder, Bianca 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2013.6575356	conference	June 2013
Fault prediction under the microscope: A closer look into HPC systems Gainaru, Ana; Cappello, Franck; Snir, Marc 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.57	conference	November 2012
A Large-Scale Study of Flash Memory Failures in the Field Meza, Justin; Wu, Qiang; Kumar, Sanjev Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems - SIGMETRICS '15 https://doi.org/10.1145/2745844.2745848	conference	January 2015
DRAM errors in the wild: a large-scale field study Schroeder, Bianca; Pinheiro, Eduardo; Weber, Wolf-Dietrich Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems - SIGMETRICS '09 https://doi.org/10.1145/1555349.1555372	conference	January 2009
Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5,000,000 HPC Application Runs Martino, Catello Di; Kramer, William; Kalbarczyk, Zbigniew 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2015.50	conference	June 2015
Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems Tiwari, Devesh; Gupta, Saurabh; Vazhkudai, Sudharshan S. 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.101	conference	June 2014
Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design Hwang, Andy A.; Stefanovici, Ioan A.; Schroeder, Bianca ACM SIGPLAN Notices, Vol. 47, Issue 4 https://doi.org/10.1145/2248487.2150989	journal	March 2012
The Malthusian Catastrophe Is Upon Us! Are the Largest HPC Machines Ever Up? Kovatch, Patricia; Ezell, Matthew; Braby, Ryan Euro-Par 2011: Parallel Processing Workshops https://doi.org/10.1007/978-3-642-29740-3_25	book	January 2012
A Large-Scale Study of Flash Memory Failures in the Field Meza, Justin; Wu, Qiang; Kumar, Sanjev ACM SIGMETRICS Performance Evaluation Review, Vol. 43, Issue 1 https://doi.org/10.1145/2796314.2745848	journal	June 2015
A large-scale study of soft-errors on GPUs in the field Nie, Bin; Tiwari, Devesh; Gupta, Saurabh 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) https://doi.org/10.1109/HPCA.2016.7446091	conference	March 2016
What Supercomputers Say: A Study of Five System Logs Oliner, Adam; Stearley, Jon 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07) https://doi.org/10.1109/DSN.2007.103	conference	June 2007
A Large-Scale Study of Failures in High-Performance Computing Systems Schroeder, Bianca; Gibson, Garth A. IEEE Transactions on Dependable and Secure Computing, Vol. 7, Issue 4 https://doi.org/10.1109/TDSC.2009.4	journal	October 2010
DRAM errors in the wild: a large-scale field study Schroeder, Bianca; Pinheiro, Eduardo; Weber, Wolf-Dietrich ACM SIGMETRICS Performance Evaluation Review, Vol. 37, Issue 1 https://doi.org/10.1145/2492101.1555372	journal	June 2009

Similar Records

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation

Conference · Sun Feb 01 00:00:00 EST 2015 · 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA); 7-11 Feb. 2015; Burlingame, CA, USA · OSTI ID:1423066

Tiwari, Devesh; Gupta, Saurabh; Rogers, James; +9 more

Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System

Conference · Fri Jun 01 00:00:00 EDT 2018 · OSTI ID:1423066

Kumar, Mohit; Gupta, Saurabh; Patel, Tirthak; +5 more

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (V.2.0)

Technical Report · Fri Dec 16 00:00:00 EST 2022 · OSTI ID:1423066

Engelmann, Christian; Ashraf, Rizwan; Hukerikar, Saurabh; +2 more

Title: Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications

Citation Formats

References (30)

Similar Records

Related Subjects