A Case For Intra-rack Resource Disaggregation in HPC
Journal Article
·
· ACM Transactions on Architecture and Code Optimization
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
- NVIDIA, Santa Clara, CA (United States)
- Columbia University, New York, NY (United States)
The expected halt of traditional technology scaling is motivating increased heterogeneity in high-performance computing (HPC) systems with the emergence of numerous specialized accelerators. As heterogeneity increases, so does the risk of underutilizing expensive hardware resources if we preserve today’s rigid node configuration and reservation strategies. This has sparked interest in resource disaggregation to enable finer-grain allocation of hardware resources to applications. However, there is currently no data-driven study of what range of disaggregation is appropriate in HPC. To that end, we perform a detailed analysis of key metrics sampled in NERSC’s Cori, a production HPC system that executes a diverse open-science HPC workload. In addition, we profile a variety of deep-learning applications to represent an emerging workload. We show that for a rack (cabinet) configuration and applications similar to Cori, a central processing unit with intra-rack disaggregation has a 99.5% probability to find all resources it requires inside its rack. In addition, ideal intra-rack resource disaggregation in Cori could reduce memory and NIC resources by 5.36% to 69.01% and still satisfy the worst-case average rack utilization.
- Research Organization:
- Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
- Sponsoring Organization:
- USDOE Advanced Research Projects Agency - Energy (ARPA-E); USDOE Office of Science (SC)
- Grant/Contract Number:
- AC02-05CH11231
- OSTI ID:
- 1878112
- Journal Information:
- ACM Transactions on Architecture and Code Optimization, Journal Name: ACM Transactions on Architecture and Code Optimization Journal Issue: 2 Vol. 19; ISSN 1544-3566
- Publisher:
- Association for Computing Machinery (ACM)Copyright Statement
- Country of Publication:
- United States
- Language:
- English
Preparing NERSC users for Cori, a Cray XC40 system with Intel many integrated cores
|
journal | August 2017 |
A Hierarchical Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous Multi-Accelerator NUMA Nodes
|
journal | January 2020 |
EMF: Disaggregated GPUs in Datacenters for Efficiency, Modularity and Flexibility
|
conference | September 2019 |
TensorFlow on State-of-the-Art HPC Clusters: A Machine Learning use Case
|
conference | May 2019 |
Towards Understanding Job Heterogeneity in HPC: A NERSC Case Study
|
conference | May 2016 |
Transitioning HPC software to exascale heterogeneous computing
|
conference | July 2015 |
QuADD: QUantifying Accelerator Disaggregated Datacenter Efficiency
|
conference | July 2019 |
Effective Running of End-to-End HPC Workflows on Emerging Heterogeneous Architectures
|
conference | September 2017 |
Evaluating Burst Buffer Placement in HPC Systems
|
conference | September 2019 |
HPC Accelerators with 3D Memory
|
conference | August 2016 |
Deep Residual Learning for Image Recognition
|
conference | June 2016 |
Resource Disaggregation Versus Integrated Servers in Data Centers: Impact of Internal Transmission Capacity Limitation
|
conference | September 2018 |
Performance Analysis of Communication Networks in Multi-Cluster Systems under Bursty Traffic with Communication Locality
|
conference | November 2009 |
The Benefits of a Disaggregated Data Centre: A Resource Allocation Approach
|
conference | December 2016 |
Knights landing (KNL): 2nd Generation Intel® Xeon Phi processor
|
conference | August 2015 |
Zion: Facebook Next- Generation Large Memory Training Platform
|
conference | August 2019 |
Understanding GPU errors on large-scale HPC systems and the implications for system design and operation
|
conference | February 2015 |
High level programming of FPGAs for HPC and data centric applications
|
conference | September 2014 |
Benchmarking Heterogeneous HPC Systems Including Reconfigurable Fabrics: Community Aspirations for Ideal Comparisons
|
conference | September 2018 |
GPU Resource Sharing and Virtualization on High Performance Computing Systems
|
conference | September 2011 |
SharP: Towards Programming Extreme-Scale Systems with Hierarchical Heterogeneous Memory
|
conference | August 2017 |
Workload Estimation for Improving Resource Management Decisions in the Cloud
|
conference | March 2015 |
Evaluating and mitigating bandwidth bottlenecks across the memory hierarchy in GPUs
|
conference | April 2017 |
Investigating Fairness in Disaggregated Non-Volatile Memories
|
conference | July 2019 |
Silicon Photonic Switch Topologies and Routing Strategies for Disaggregated Data Centers
|
journal | March 2020 |
Accelerators for Artificial Intelligence and High-Performance Computing
|
journal | February 2020 |
Disaggregated Data Centers: Challenges and Trade-offs
|
journal | February 2020 |
MLPerf: An Industry Standard Benchmark Suite for Machine Learning Performance
|
journal | March 2020 |
NVIDIA A100 Tensor Core GPU: Performance and Innovation
|
journal | March 2021 |
Comparative study of deep learning framework in HPC environments
|
conference | August 2017 |
Accelerating High Performance Computing Applications: Using CPUs, GPUs, Hybrid CPU/GPU, and FPGAs
|
conference | December 2012 |
Optically Connected Memory for Disaggregated Data Centers
|
conference | September 2020 |
On the Memory Underutilization: Exploring Disaggregated Memory on HPC Systems
|
conference | September 2020 |
A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters
|
conference | January 2000 |
The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications
|
conference | November 2014 |
Reliable and Efficient Performance Monitoring in Linux
|
conference | November 2016 |
Disaggregated Cloud Memory with Elastic Block Management
|
journal | January 2019 |
Unaligned Burst-Aware Memory Subsystem
|
journal | October 2019 |
The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions
|
journal | April 1998 |
Decoupled DIMM: building high-bandwidth memory system using low-speed DRAM devices
|
journal | June 2009 |
A Survey of CPU-GPU Heterogeneous Computing Techniques
|
journal | July 2015 |
Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility
|
conference | January 2015 |
Main Memory in HPC: Do We Need More or Could We Live with Less?
|
journal | March 2017 |
Operating and Runtime Systems Challenges for HPC Systems
|
conference | June 2017 |
Managing Heterogeneous Resources in HPC Systems
|
conference | January 2018 |
Bandwidth steering in HPC using silicon nanophotonics
|
conference | November 2019 |
Scheduling Beyond CPUs for HPC
|
conference | June 2019 |
Who limits the resource efficiency of my datacenter: an analysis of Alibaba datacenter traces
|
conference | June 2019 |
DSPatch: Dual Spatial Pattern Prefetcher
|
conference | October 2019 |
DRMaestro: orchestrating disaggregated resources on virtualized data-centers
|
journal | March 2021 |
Optically Disaggregated Data Centers With Minimal Remote Memory Latency: Technologies, Architectures, and Resource Allocation [Invited]
|
journal | January 2018 |
Survey of Photonic Switching Architectures and Technologies in Support of Spatially and Spectrally Flexible Optical Networking [Invited]
|
journal | December 2016 |
Photonic switching in high performance datacenters [Invited]
|
journal | January 2018 |
Facebook’s Data Center Infrastructure: Open Compute, Disaggregated Rack, and Beyond
|
conference | January 2015 |
Similar Records
Evaluating the potential of disaggregated memory systems for HPC applications
Towards understanding HPC users and systems: A NERSC case study
Towards understanding HPC users and systems: A NERSC case study
Journal Article
·
Thu May 30 20:00:00 EDT 2024
· Concurrency and Computation. Practice and Experience
·
OSTI ID:2369149
Towards understanding HPC users and systems: A NERSC case study
Journal Article
·
Wed Sep 13 20:00:00 EDT 2017
· Journal of Parallel and Distributed Computing
·
OSTI ID:1439236
Towards understanding HPC users and systems: A NERSC case study
Journal Article
·
Sun Dec 31 23:00:00 EST 2017
· Journal of Parallel and Distributed Computing
·
OSTI ID:1463670