skip to main content

Title: Reducing Waste in Extreme Scale Systems through Introspective Analysis

Resilience is an important challenge for extreme- scale supercomputers. Today, failures in supercomputers are assumed to be uniformly distributed in time. However, recent studies show that failures in high-performance computing systems are partially correlated in time, generating periods of higher failure density. Our study of the failure logs of multiple supercomputers show that periods of higher failure density occur with up to three times more than the average. We design a monitoring system that listens to hardware events and forwards important events to the runtime to detect those regime changes. We implement a runtime capable of receiving notifications and adapt dynamically. In addition, we build an analytical model to predict the gains that such dynamic approach could achieve. We demonstrate that in some systems, our approach can reduce the wasted time by over 30%.
 [1] ;  [2] ;  [1] ;  [3] ;  [3] ;  [3] ;  [1] ;  [1]
  1. Argonne National Laboratory (ANL)
  2. University of Illinois at Urbana-Champaign, National Center for Supercomputing Applications
  3. ORNL
Publication Date:
OSTI Identifier:
DOE Contract Number:
Resource Type:
Resource Relation:
Conference: 30th IEEE International Parallel & Distributed Processing Symposium, Chicago, IL, USA, 20160523, 20160527
Research Org:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Org:
USDOE Office of Science (SC)
Country of Publication:
United States