Strategies to Deploy and Scale Deep Learning on the Summit Supercomputer
- ORNL
The rapid growth and wide applicability of Deep Learning (DL) frameworks poses challenges to computing centers which need to deploy and support the software, and also to domain scientists who have to keep up with the system environment and scale up scientific exploration through DL. We offer recommendations for deploying and scaling DL frameworks on the Summit supercomputer, currently atop the Top500 list, at the Oak Ridge National Laboratory Leadership Computing Facility (OLCF). We discuss DL software deployment in the form of containers, and compare performance of native-built frameworks and containerized deployment. Software containers show no noticeable negative performance impact and exhibit faster Python loading times and promise easier maintenance. To explore strategies for scaling up DL model training campaigns, we assess DL compute kernel performance, discuss and recommend I/O data formats and staging, and identify communication needs for scalable message exchange for DL runs at scale. We recommend that users take a step-wise tuning approach beginning with algorithmic kernel choice, node I/O configuration, and communications tuning as best-practice. We present baseline examples of scaling efficiency 87% for a DL run of ResNet50 running on 1024 nodes (6144 V100 GPUs).
- Research Organization:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
- DOE Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1606652
- Country of Publication:
- United States
- Language:
- English
Similar Records
Scaling the Summit: Deploying the World's Fastest Supercomputer
Anderson Acceleration for Distributed Training of Deep Learning Models
Comparative evaluation of deep learning workloads for leadership-class systems
Conference
·
Sat Jun 01 00:00:00 EDT 2019
·
OSTI ID:1561654
Anderson Acceleration for Distributed Training of Deep Learning Models
Conference
·
Mon Feb 28 23:00:00 EST 2022
·
OSTI ID:1866678
Comparative evaluation of deep learning workloads for leadership-class systems
Conference
·
Fri Oct 01 00:00:00 EDT 2021
·
OSTI ID:1838972