skip to main content

Title: Scalable Energy Efficiency with Resilience for High Performance Computing Systems: A Quantitative Methodology

Energy efficiency and resilience are two crucial challenges for HPC systems to reach exascale. While energy efficiency and resilience issues have been extensively studied individually, little has been done to understand the interplay between energy efficiency and resilience for HPC systems. Decreasing the supply voltage associated with a given operating frequency for processors and other CMOS-based components can significantly reduce power consumption. However, this often raises system failure rates and consequently increases application execution time. In this work, we present an energy saving undervolting approach that leverages the mainstream resilience techniques to tolerate the increased failures caused by undervolting.
Authors:
 [1] ;  [1] ;  [2]
  1. Univ. of California, Riverside, CA (United States)
  2. Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Publication Date:
OSTI Identifier:
1253867
Report Number(s):
PNNL-SA--113322
Journal ID: ISSN 1544-3566; KJ0402000
DOE Contract Number:
AC05-76RL01830
Resource Type:
Journal Article
Resource Relation:
Journal Name: ACM Transactions on Architecture and Code Optimization; Journal Volume: 12; Journal Issue: 4
Publisher:
Association for Computing Machinery
Research Org:
Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
Sponsoring Org:
USDOE
Country of Publication:
United States
Language:
English
Subject:
32 ENERGY CONSERVATION, CONSUMPTION, AND UTILIZATION; 97 MATHEMATICS AND COMPUTING energy, resilience; failures; iso-energy-efficiency model, HPC