skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A Strawman for an HPC PowerStack

Technical Report ·
DOI:https://doi.org/10.2172/1466153· OSTI ID:1466153
 [1];  [1];  [1];  [2];  [3];  [4];  [4];  [4];  [2];  [5];  [5]
  1. Intel Corporation, Santa Clara, CA (United States)
  2. Univ. of Tokyo (Japan)
  3. Ludwig Maximilian Univ., Munich (Germany)
  4. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
  5. Technische Univ. of Munich (Germany)

The landscape of High-Performance Computing (HPC) is changing as we enter the exascale era and power and energy management are key design points for any next generation of supercomputers. Efficiently utilizing procured power and optimizing performance of scientific applications under power and energy constraints is challenging due to several reasons including dynamic phase behavior, processor manufacturing variability and increasing heterogeneity of node-level components. Extending the scope from the node- and application-level up to the system-level introduces further challenges on top of power-unaware job scheduling, which is known to be NP-hard on its own. While there exists several individual efforts across the community to research automatic techniques for managing power and energy better, the majority of these techniques have been specialized to meet the needs of a specific HPC center or specific optimization goals and provide little support to connect them to each other. Some projects, most notably the PowerAPI efforts [1], discuss interfaces that form a good starting point for full stack integration, but these interfaces still need to be hooked up to the wide range of software components offered by academic partners, developers or vendors. Furthermore, a recent survey conducted by the EE HPC WG concluded that the majority of such techniques have lacked the application-awareness required to achieve the best system performance and throughput. Other observations were that each technique tended to target management of power and energy for a different subset of the site or system hardware and that the different techniques tended to perform management at different (and often conflicting) granularities. Unfortunately, the existing techniques have not been designed to coexist simultaneously on one site and cooperate on management in an integrated manner [2]. The lack of application-awareness, lack of coordinated management across different granularities, and the lack of widely accepted interfaces and consequent limited connectivity between modules, can result in substantially underutilized Watts and FLOPS. To address these gaps, the HPC community needs a holistic stack for power and energy management and none currently exists. In our view, a holistic stack is one that is extensible enough to support the present and anticipated future needs of various different HPC centers, one that achieves best system performance and throughput through application-awareness, one that is designed to coordinate management at different granularities, and one that enables the seamless integration of software components from different developers and vendors. In this seminar, our goal is bring together experts from academia, research laboratories, and industry in order to take stock of existing approaches, design points and power management concepts and requirements at the various participant’s institutions, and to design concepts for a holistic power and energy management stack, which we refer to as the HPC PowerStack. Further, we hope to take steps toward defining the interfaces necessary for providing a complete prototype shared among many groups. The intention is to align development and research efforts across the community so that we may share development resources, avoid duplicating effort, agree on common interfaces and reap the rewards together as a community. The intended final outcome of this collaboration will be a holistic, flexible and extensible concept of a software stack ecosystem that allows us to combine product-grade open-source software components to enable runtime optimization of system power, energy, and performance.

Research Organization:
Intel Corporation (United States); Lawrence Livermore National Lab. (LLNL) (United States); Technische Universität München (Germany); Univ. of Tokyo (Japan)
Sponsoring Organization:
Intel Corporation (United States); Lawrence Livermore National Lab. (LLNL) (United States); Technische Universität München (Germany); University of Tokyo (Japan); USDOE National Nuclear Security Administration (NNSA)
DOE Contract Number:
AC52-07NA27344
OSTI ID:
1466153
Report Number(s):
LLNL-TR-756268; 943759
Country of Publication:
United States
Language:
English