skip to main content

SciTech ConnectSciTech Connect

Title: A Report on Simulation-Driven Reliability and Failure Analysis of Large-Scale Storage Systems

High-performance computing (HPC) storage systems provide data availability and reliability using various hardware and software fault tolerance techniques. Usually, reliability and availability are calculated at the subsystem or component level using limited metrics such as, mean time to failure (MTTF) or mean time to data loss (MTTDL). This often means settling on simple and disconnected failure models (such as exponential failure rate) to achieve tractable and close-formed solutions. However, such models have been shown to be insufficient in assessing end-to-end storage system reliability and availability. We propose a generic simulation framework aimed at analyzing the reliability and availability of storage systems at scale, and investigating what-if scenarios. The framework is designed for an end-to-end storage system, accommodating the various components and subsystems, their interconnections, failure patterns and propagation, and performs dependency analysis to capture a wide-range of failure cases. We evaluate the framework against a large-scale storage system that is in production and analyze its failure projections toward and beyond the end of lifecycle. We also examine the potential operational impact by studying how different types of components affect the overall system reliability and availability, and present the preliminary results
 [1] ;  [1] ;  [1] ;  [1] ;  [2]
  1. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
  2. Univ. of Tennessee, Knoxville, TN (United States)
Publication Date:
OSTI Identifier:
Report Number(s):
KJ0502000; ERKJZN1
DOE Contract Number:
Resource Type:
Technical Report
Research Org:
Oak Ridge National Laboratory. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Org:
USDOE Office of Science (SC)
Country of Publication:
United States