skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems, In: 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks

Abstract

As we approach exascale, the scientific simulations are expected to experience more interruptions due to increased system failures. Designing better HPC resilience techniques requires understanding the key characteristics of system failures on these systems. While temporal properties of system failures on HPC systems have been well-investigated, there is limited understanding about the spatial characteristics of system failures and its impact on the resilience mechanisms. Therefore, we examine the spatial characteristics and behavior of system failures. We investigate the interaction between spatial and temporal characteristics of failures and its implications for system operations and resilience mechanisms on large-scale HPC systems. We show that system failures have "spatial locality" at different granularity in the system, study impact of different failure-types, and investigate the correlation among different failure-types. Finally, we propose a novel scheme that exploits the spatial locality in failures to improve application and system performance. Our evaluation shows that the proposed scheme significantly improves the system performance in a dynamic and production-level HPC system.

Authors:
; ; ; ;
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1567393
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Journal Name:
2015 45TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS
Additional Journal Information:
Conference: International Conference on Dependable Systems and Networks, Rio de Janeiro, Brazil, June 22-25, 2015
Country of Publication:
United States
Language:
English
Subject:
Computer Science; Engineering

Citation Formats

Gupta, Saurabh, Tiwari, Devesh, Jantzi, Christopher, Rogers, James, and Maxwell, Don. Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems, In: 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. United States: N. p., 2015. Web. doi:10.1109/DSN.2015.52.
Gupta, Saurabh, Tiwari, Devesh, Jantzi, Christopher, Rogers, James, & Maxwell, Don. Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems, In: 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. United States. doi:10.1109/DSN.2015.52.
Gupta, Saurabh, Tiwari, Devesh, Jantzi, Christopher, Rogers, James, and Maxwell, Don. Mon . "Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems, In: 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks". United States. doi:10.1109/DSN.2015.52.
@article{osti_1567393,
title = {Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems, In: 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks},
author = {Gupta, Saurabh and Tiwari, Devesh and Jantzi, Christopher and Rogers, James and Maxwell, Don},
abstractNote = {As we approach exascale, the scientific simulations are expected to experience more interruptions due to increased system failures. Designing better HPC resilience techniques requires understanding the key characteristics of system failures on these systems. While temporal properties of system failures on HPC systems have been well-investigated, there is limited understanding about the spatial characteristics of system failures and its impact on the resilience mechanisms. Therefore, we examine the spatial characteristics and behavior of system failures. We investigate the interaction between spatial and temporal characteristics of failures and its implications for system operations and resilience mechanisms on large-scale HPC systems. We show that system failures have "spatial locality" at different granularity in the system, study impact of different failure-types, and investigate the correlation among different failure-types. Finally, we propose a novel scheme that exploits the spatial locality in failures to improve application and system performance. Our evaluation shows that the proposed scheme significantly improves the system performance in a dynamic and production-level HPC system.},
doi = {10.1109/DSN.2015.52},
journal = {2015 45TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS},
number = ,
volume = ,
place = {United States},
year = {2015},
month = {6}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: