Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

COOLR: A New System for Dynamic Thermal-Aware Computing

Technical Report ·
DOI:https://doi.org/10.2172/1488442· OSTI ID:1488442

Extreme-scale systems need to walk a fine line between the amount of cooling they receive and the thermal-induced performance and reliability degradation they can sustain. System managers are extremely motivated to battle cooling-energy cost: the largest line item in their total operating cost. On the other hand, pushing the nodes to peak performance in tightened cooling regimes places the burden on the dynamic thermal management (DTM) to protect the hardware. DTM schemes throttle performance to relieve heat accumulation within nodes when cooling cannot mitigate the problem. These interventions prevent fatal failures but introduce inevitable performance degradations and variations. Such variations can have drastic consequences on the performance of extreme-scale systems. Furthermore, distribution of heat across different system nodes depends not only on the amount of workload. Even if a strictly equal share of computational load is assigned to all nodes, there are physical attributes and topological features of each system that inherently cause uneven accumulation of heat. These can cause one subcomponent to trigger DTM prematurely and penalize the overall system. Our goal in this project is to create a holistic thermal-aware view of the system, capturing both inherent physical attributes and dynamic system state. We developed COOLR, a dynamic system with the ability to evaluate the interplay between management of computation, data, power dissipation, and thermal state. Thereby, COOLR will achieve higher overall performance at the same energy and cooling cost. First, we performed a systematic power and thermal modeling and of high performance computing architectures. Analytical and empirical techniques were blended to generate a light-weight thermal model of target systems. Second, we developed a thermal-aware OS and runtime system. Policies for managing resources and schedules for processes has been designed to co-optimize the thermal state and performance of the system. The main outcome of our proposed thermal-aware dynamic computing system are scheduling and allocation of processes and memory accesses leading to an appropriately “skewed” distribution of activity across the system. The ultimate result of this new arrangement resulted in reduced occurrences of hotspots and reduced episodes of DTM intervention, leading to improvements in performance. The overall thermal characterization and modeling methodology (thermal instrumentation techniques, model building, and systematic model reduction) resulting from this project can be applied to systems beyond our immediate scope and will be applicable to future scaling. The thermal-aware dynamic computing paradigm will impact the efficiency of extreme-scale systems. The gained “thermal slack” can be given back to further tighten the cooling budget, benefiting the management cost of next generation extreme-scale systems.

Research Organization:
Northwestern Univ., Evanston, IL (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
DOE Contract Number:
SC0012531
OSTI ID:
1488442
Report Number(s):
DOE-12531
Resource Relation:
Related Information: 1.Machine Learning-Based Temperature Prediction for Runtime Thermal Management Across System Components, K Zhang, A Guliani, S Ogrenci-Memik, G Memik, K Yoshii, R Sankaran, IEEE Transactions on Parallel and Distributed Systems 29 (2), 405-419, 20182. Evaluating irregular memory access on OpenCL FPGA platforms: A case study with XSBench, Y Luo, X Wen, K Yoshii, S Ogrenci-Memik, G Memik, H Finkel, F Cappello, International Symposium on Field Programmable Logic and Applications (FPL), 20173. Addressing Thermal and Performance Variability Issues in Dynamic Processors, K Yoshii, P Llopis, K Zhang, Y Luo, S Ogrenci-Memik, 3/1/2017, Report ANL/MCS-TM-3684. Minimizing thermal variation across system components, K Zhang, S Ogrenci-Memik, G Memik, K Yoshii, R Sankaran, P Beckman, IEEE Parallel and Distributed Processing Symposium (IPDPS), 2015
Country of Publication:
United States
Language:
English


Similar Records

Data Locality Enhancement of Dynamic Simulations for Exascale Computing (Final Report)
Technical Report · 2019 · OSTI ID:1576175

Thermal Management for FPGA Nodes in HPC Systems
Journal Article · 2020 · ACM Transactions on Design Automation of Electronic Systems · OSTI ID:1775451

Load‐balanced and locality‐aware scheduling for data‐intensive workloads at extreme scales
Journal Article · 2015 · Concurrency and Computation. Practice and Experience · OSTI ID:1786148