Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

A Case For Intra-rack Resource Disaggregation in HPC

Journal Article · · ACM Transactions on Architecture and Code Optimization
DOI:https://doi.org/10.1145/3514245· OSTI ID:1878112
The expected halt of traditional technology scaling is motivating increased heterogeneity in high-performance computing (HPC) systems with the emergence of numerous specialized accelerators. As heterogeneity increases, so does the risk of underutilizing expensive hardware resources if we preserve today’s rigid node configuration and reservation strategies. This has sparked interest in resource disaggregation to enable finer-grain allocation of hardware resources to applications. However, there is currently no data-driven study of what range of disaggregation is appropriate in HPC. To that end, we perform a detailed analysis of key metrics sampled in NERSC’s Cori, a production HPC system that executes a diverse open-science HPC workload. In addition, we profile a variety of deep-learning applications to represent an emerging workload. We show that for a rack (cabinet) configuration and applications similar to Cori, a central processing unit with intra-rack disaggregation has a 99.5% probability to find all resources it requires inside its rack. In addition, ideal intra-rack resource disaggregation in Cori could reduce memory and NIC resources by 5.36% to 69.01% and still satisfy the worst-case average rack utilization.
Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE Advanced Research Projects Agency - Energy (ARPA-E); USDOE Office of Science (SC)
Grant/Contract Number:
AC02-05CH11231
OSTI ID:
1878112
Journal Information:
ACM Transactions on Architecture and Code Optimization, Journal Name: ACM Transactions on Architecture and Code Optimization Journal Issue: 2 Vol. 19; ISSN 1544-3566
Publisher:
Association for Computing Machinery (ACM)Copyright Statement
Country of Publication:
United States
Language:
English

References (54)

Preparing NERSC users for Cori, a Cray XC40 system with Intel many integrated cores journal August 2017
A Hierarchical Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous Multi-Accelerator NUMA Nodes journal January 2020
EMF: Disaggregated GPUs in Datacenters for Efficiency, Modularity and Flexibility conference September 2019
TensorFlow on State-of-the-Art HPC Clusters: A Machine Learning use Case conference May 2019
Towards Understanding Job Heterogeneity in HPC: A NERSC Case Study conference May 2016
Transitioning HPC software to exascale heterogeneous computing conference July 2015
QuADD: QUantifying Accelerator Disaggregated Datacenter Efficiency conference July 2019
Effective Running of End-to-End HPC Workflows on Emerging Heterogeneous Architectures conference September 2017
Evaluating Burst Buffer Placement in HPC Systems conference September 2019
HPC Accelerators with 3D Memory
  • Ujaldon, Manuel
  • 2016 19th IEEE Intl Conference on Computational Science and Engineering (CSE), IEEE 14th Intl Conference on Embedded and Ubiquitous Computing (EUC), and 15th Intl Symposium on Distributed Computing and Applications for Business Engineering (DCABES), 2016 IEEE Intl Conference on Computational Science and Engineering (CSE) and IEEE Intl Conference on Embedded and Ubiquitous Computing (EUC) and 15th Intl Symposium on Distributed Computing and Applications for Business Engineering (DCABES) https://doi.org/10.1109/CSE-EUC-DCABES.2016.203
conference August 2016
Deep Residual Learning for Image Recognition conference June 2016
Resource Disaggregation Versus Integrated Servers in Data Centers: Impact of Internal Transmission Capacity Limitation conference September 2018
Performance Analysis of Communication Networks in Multi-Cluster Systems under Bursty Traffic with Communication Locality conference November 2009
The Benefits of a Disaggregated Data Centre: A Resource Allocation Approach
  • Papaioannou, Antonios D.; Nejabati, Reza; Simeonidou, Dimitra
  • GLOBECOM 2016 - 2016 IEEE Global Communications Conference, 2016 IEEE Global Communications Conference (GLOBECOM) https://doi.org/10.1109/GLOCOM.2016.7842314
conference December 2016
Knights landing (KNL): 2nd Generation Intel® Xeon Phi processor conference August 2015
Zion: Facebook Next- Generation Large Memory Training Platform conference August 2019
Understanding GPU errors on large-scale HPC systems and the implications for system design and operation conference February 2015
High level programming of FPGAs for HPC and data centric applications conference September 2014
Benchmarking Heterogeneous HPC Systems Including Reconfigurable Fabrics: Community Aspirations for Ideal Comparisons conference September 2018
GPU Resource Sharing and Virtualization on High Performance Computing Systems conference September 2011
SharP: Towards Programming Extreme-Scale Systems with Hierarchical Heterogeneous Memory conference August 2017
Workload Estimation for Improving Resource Management Decisions in the Cloud
  • Patel, Jemishkumar; Jindal, Vasu; Yen, I-Ling
  • 2015 IEEE Twelfth International Symposium on Autonomous Decentralized System (ISADS), 2015 IEEE Twelfth International Symposium on Autonomous Decentralized Systems https://doi.org/10.1109/ISADS.2015.17
conference March 2015
Evaluating and mitigating bandwidth bottlenecks across the memory hierarchy in GPUs conference April 2017
Investigating Fairness in Disaggregated Non-Volatile Memories conference July 2019
Silicon Photonic Switch Topologies and Routing Strategies for Disaggregated Data Centers journal March 2020
Accelerators for Artificial Intelligence and High-Performance Computing journal February 2020
Disaggregated Data Centers: Challenges and Trade-offs journal February 2020
MLPerf: An Industry Standard Benchmark Suite for Machine Learning Performance journal March 2020
NVIDIA A100 Tensor Core GPU: Performance and Innovation journal March 2021
Comparative study of deep learning framework in HPC environments conference August 2017
Accelerating High Performance Computing Applications: Using CPUs, GPUs, Hybrid CPU/GPU, and FPGAs
  • Liu, Bin; Zydek, Dawid; Selvaraj, Henry
  • 2012 13th International Conference on Parallel and Distributed Computing Applications and Technologies (PDCAT), 2012 13th International Conference on Parallel and Distributed Computing, Applications and Technologies https://doi.org/10.1109/PDCAT.2012.34
conference December 2012
Optically Connected Memory for Disaggregated Data Centers conference September 2020
On the Memory Underutilization: Exploring Disaggregated Memory on HPC Systems conference September 2020
A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters conference January 2000
The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications
  • Agelastos, Anthony; Allan, Benjamin; Brandt, Jim
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.18
conference November 2014
Reliable and Efficient Performance Monitoring in Linux
  • Dimakopoulou, Maria; Eranian, Stephane; Koziris, Nectarios
  • SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.33
conference November 2016
Disaggregated Cloud Memory with Elastic Block Management journal January 2019
Unaligned Burst-Aware Memory Subsystem journal October 2019
The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions journal April 1998
Decoupled DIMM: building high-bandwidth memory system using low-speed DRAM devices journal June 2009
A Survey of CPU-GPU Heterogeneous Computing Techniques journal July 2015
Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility
  • Tiwari, Devesh; Gupta, Saurabh; Gallarno, George
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15 https://doi.org/10.1145/2807591.2807666
conference January 2015
Main Memory in HPC: Do We Need More or Could We Live with Less?
  • Zivanovic, Darko; Pavlovic, Milan; Radulovic, Milan
  • ACM Transactions on Architecture and Code Optimization, Vol. 14, Issue 1 https://doi.org/10.1145/3023362
journal March 2017
Operating and Runtime Systems Challenges for HPC Systems
  • Maccabe, Arthur B.
  • ROSS '17: International Workshop on Runtime and Operating Systems for Supercomputers ROSS 2017, Proceedings of the 7th International Workshop on Runtime and Operating Systems for Supercomputers ROSS 2017 https://doi.org/10.1145/3095770.3095771
conference June 2017
Managing Heterogeneous Resources in HPC Systems
  • Agosta, Giovanni; Fornaciari, William; Massari, Giuseppe
  • Proceedings of the 9th Workshop and 7th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms - PARMA-DITAM '18 https://doi.org/10.1145/3183767.3183769
conference January 2018
Bandwidth steering in HPC using silicon nanophotonics
  • Michelogiannakis, George; Shen, Yiwen; Teh, Min Yee
  • SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3295500.3356145
conference November 2019
Scheduling Beyond CPUs for HPC
  • Fan, Yuping; Lan, Zhiling; Rich, Paul
  • HPDC '19: The 28th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing https://doi.org/10.1145/3307681.3325401
conference June 2019
Who limits the resource efficiency of my datacenter: an analysis of Alibaba datacenter traces
  • Guo, Jing; Chang, Zihao; Wang, Sa
  • IWQoS '19: IEEE/ACM International Symposium on Quality of Service, Proceedings of the International Symposium on Quality of Service https://doi.org/10.1145/3326285.3329074
conference June 2019
DSPatch: Dual Spatial Pattern Prefetcher
  • Bera, Rahul; Nori, Anant V.; Mutlu, Onur
  • MICRO '52: The 52nd Annual IEEE/ACM International Symposium on Microarchitecture, Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture https://doi.org/10.1145/3352460.3358325
conference October 2019
DRMaestro: orchestrating disaggregated resources on virtualized data-centers journal March 2021
Optically Disaggregated Data Centers With Minimal Remote Memory Latency: Technologies, Architectures, and Resource Allocation [Invited] journal January 2018
Survey of Photonic Switching Architectures and Technologies in Support of Spatially and Spectrally Flexible Optical Networking [Invited] journal December 2016
Photonic switching in high performance datacenters [Invited] journal January 2018
Facebook’s Data Center Infrastructure: Open Compute, Disaggregated Rack, and Beyond conference January 2015

Similar Records

Evaluating the potential of disaggregated memory systems for HPC applications
Journal Article · Thu May 30 20:00:00 EDT 2024 · Concurrency and Computation. Practice and Experience · OSTI ID:2369149

Towards understanding HPC users and systems: A NERSC case study
Journal Article · Wed Sep 13 20:00:00 EDT 2017 · Journal of Parallel and Distributed Computing · OSTI ID:1439236

Towards understanding HPC users and systems: A NERSC case study
Journal Article · Sun Dec 31 23:00:00 EST 2017 · Journal of Parallel and Distributed Computing · OSTI ID:1463670