COOLR: A New System for Dynamic Thermal-Aware Computing
- Northwestern Univ., Evanston, IL (United States)
Extreme-scale systems need to walk a fine line between the amount of cooling they receive and the thermal-induced performance and reliability degradation they can sustain. System managers are extremely motivated to battle cooling-energy cost: the largest line item in their total operating cost. On the other hand, pushing the nodes to peak performance in tightened cooling regimes places the burden on the dynamic thermal management (DTM) to protect the hardware. DTM schemes throttle performance to relieve heat accumulation within nodes when cooling cannot mitigate the problem. These interventions prevent fatal failures but introduce inevitable performance degradations and variations. Such variations can have drastic consequences on the performance of extreme-scale systems. Furthermore, distribution of heat across different system nodes depends not only on the amount of workload. Even if a strictly equal share of computational load is assigned to all nodes, there are physical attributes and topological features of each system that inherently cause uneven accumulation of heat. These can cause one subcomponent to trigger DTM prematurely and penalize the overall system. Our goal in this project is to create a holistic thermal-aware view of the system, capturing both inherent physical attributes and dynamic system state. We developed COOLR, a dynamic system with the ability to evaluate the interplay between management of computation, data, power dissipation, and thermal state. Thereby, COOLR will achieve higher overall performance at the same energy and cooling cost. First, we performed a systematic power and thermal modeling and of high performance computing architectures. Analytical and empirical techniques were blended to generate a light-weight thermal model of target systems. Second, we developed a thermal-aware OS and runtime system. Policies for managing resources and schedules for processes has been designed to co-optimize the thermal state and performance of the system. The main outcome of our proposed thermal-aware dynamic computing system are scheduling and allocation of processes and memory accesses leading to an appropriately “skewed” distribution of activity across the system. The ultimate result of this new arrangement resulted in reduced occurrences of hotspots and reduced episodes of DTM intervention, leading to improvements in performance. The overall thermal characterization and modeling methodology (thermal instrumentation techniques, model building, and systematic model reduction) resulting from this project can be applied to systems beyond our immediate scope and will be applicable to future scaling. The thermal-aware dynamic computing paradigm will impact the efficiency of extreme-scale systems. The gained “thermal slack” can be given back to further tighten the cooling budget, benefiting the management cost of next generation extreme-scale systems.
- Research Organization:
- Northwestern Univ., Evanston, IL (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- DOE Contract Number:
- SC0012531
- OSTI ID:
- 1488442
- Report Number(s):
- DOE-12531
- Resource Relation:
- Related Information: 1.Machine Learning-Based Temperature Prediction for Runtime Thermal Management Across System Components, K Zhang, A Guliani, S Ogrenci-Memik, G Memik, K Yoshii, R Sankaran, IEEE Transactions on Parallel and Distributed Systems 29 (2), 405-419, 20182. Evaluating irregular memory access on OpenCL FPGA platforms: A case study with XSBench, Y Luo, X Wen, K Yoshii, S Ogrenci-Memik, G Memik, H Finkel, F Cappello, International Symposium on Field Programmable Logic and Applications (FPL), 20173. Addressing Thermal and Performance Variability Issues in Dynamic Processors, K Yoshii, P Llopis, K Zhang, Y Luo, S Ogrenci-Memik, 3/1/2017, Report ANL/MCS-TM-3684. Minimizing thermal variation across system components, K Zhang, S Ogrenci-Memik, G Memik, K Yoshii, R Sankaran, P Beckman, IEEE Parallel and Distributed Processing Symposium (IPDPS), 2015
- Country of Publication:
- United States
- Language:
- English
Similar Records
Thermal Management for FPGA Nodes in HPC Systems
Load‐balanced and locality‐aware scheduling for data‐intensive workloads at extreme scales