Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Elastic distributed training with fast convergence and efficient resource utilization

Conference ·
Distributed learning is now routinely conducted on cloud as well as dedicated clusters. Training with elastic resources brings new challenges and design choices. Prior studies focus on runtime performance and assume a static algorithmic behavior. In this work, by analyzing the impact of of resource scaling on convergence, we introduce schedules for synchronous stochastic gradient descent that proactively adapt the number of learners to reduce training time and improve convergence. Our approach no longer assumes a constant number of processors throughout training. In our experiment, distributed stochastic gradient descent with dynamic schedules and reduction momentum achieves better convergence and significant speedups over prior static ones. Numerous distributed training jobs running on cloud may benefit from our approach.
Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1843691
Country of Publication:
United States
Language:
English

Similar Records

Deep Generative Models that Solve PDEs: Distributed Computing for Training Large Data-Free Models
Conference · 2020 · Workshop on Machine Learning in HPC Environments (Online) · OSTI ID:1648524

Probability Convergence in a Multithreaded Counting Application
Conference · 2007 · OSTI ID:910005

Stochastic Spectral Descent for Discrete Graphical Models
Journal Article · 2015 · IEEE Journal of Selected Topics in Signal Processing · OSTI ID:1367144

Related Subjects