skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications

Abstract

Resilience is one of the key challenges in maintaining high efficiency of future extreme scale supercomputers. Researchers and system practitioners rely on field-data studies to understand reliability characteristics and plan for future HPC systems. In this work, we compare and contrast the reliability characteristics of multiple large-scale HPC production systems. Our study covers more than one billion compute node hours across five different systems over a period of 8 years. We confirm previous findings which continue to be valid, discover new findings, and discuss their implications.

Authors:
 [1];  [2]; ORCiD logo [3];  [2]
  1. Intel Corporation
  2. Northeastern University, Boston
  3. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
OSTI Identifier:
1423066
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: 30th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2017 - Denver, Colorado, United States of America - 11/12/2017 5:00:00 AM-11/17/2017 5:00:00 AM
Country of Publication:
United States
Language:
English

Citation Formats

Gupta, Saurabh, Patel, Tirthak, Engelmann, Christian, and Tiwari, Devesh. Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications. United States: N. p., 2017. Web. doi:10.1145/3126908.3126937.
Gupta, Saurabh, Patel, Tirthak, Engelmann, Christian, & Tiwari, Devesh. Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications. United States. doi:10.1145/3126908.3126937.
Gupta, Saurabh, Patel, Tirthak, Engelmann, Christian, and Tiwari, Devesh. Wed . "Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications". United States. doi:10.1145/3126908.3126937. https://www.osti.gov/servlets/purl/1423066.
@article{osti_1423066,
title = {Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications},
author = {Gupta, Saurabh and Patel, Tirthak and Engelmann, Christian and Tiwari, Devesh},
abstractNote = {Resilience is one of the key challenges in maintaining high efficiency of future extreme scale supercomputers. Researchers and system practitioners rely on field-data studies to understand reliability characteristics and plan for future HPC systems. In this work, we compare and contrast the reliability characteristics of multiple large-scale HPC production systems. Our study covers more than one billion compute node hours across five different systems over a period of 8 years. We confirm previous findings which continue to be valid, discover new findings, and discuss their implications.},
doi = {10.1145/3126908.3126937},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2017},
month = {11}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:

Works referenced in this record:

A higher order estimate of the optimum checkpoint interval for restart dumps
journal, February 2006


Toward Exascale Resilience
journal, September 2009

  • Cappello, Franck; Geist, Al; Gropp, Bill
  • The International Journal of High Performance Computing Applications, Vol. 23, Issue 4
  • DOI: 10.1177/1094342009347767

Addressing failures in exascale computing
journal, March 2014

  • Snir, Marc; Wisniewski, Robert W.; Abraham, Jacob A.
  • The International Journal of High Performance Computing Applications, Vol. 28, Issue 2
  • DOI: 10.1177/1094342014522573