Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Exploring flexible communications for streamlining DNN ensemble training pipelines

Conference ·
OSTI ID:1509557
Parallel training of a Deep Neural Network (DNN) ensemble on a cluster of nodes is a common practice to train multiple models in order to construct a model with a higher prediction accuracy, or to quickly tune the parameters of a training model. Existing ensemble training pipelines perform a great deal of redundant operations, resulting in unnecessary CPU usage, or even poor pipeline performance. In order to remove these redundancies, we need pipelines with more communication flexibility than existing DNN frameworks can provide. This project investigates a series of designs to improve pipeline flexibility and adaptivity, while also increasing performance. We implement our designs using Tensorflow with Horovod, and test it using several large DNNs in a large scale GPU cluster, the Titan supercomputer at Oak Ridge National Lab. Our results show that with the new flexible communication schemes, the CPU time spent during training is reduced by 2-11X. Furthermore, our implementation can achieve up to 10X speedups when CPU core limits are imposed. Our best pipeline also reduces the average power draw of the ensemble training process by 5--16% when compared to the baseline.
Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1509557
Country of Publication:
United States
Language:
English

Similar Records

Exploring Flexible Communications for Streamlining DNN Ensemble Training Pipelines
Technical Report · Wed Mar 28 00:00:00 EDT 2018 · OSTI ID:1435221

FLEET: Flexible Efficient Ensemble Training for Heterogeneous Deep Neural Networks
Conference · Sat Feb 29 23:00:00 EST 2020 · OSTI ID:1761655

Adaptive Neuron Apoptosis for Accelerating Deep Learning on Large Scale Systems
Conference · Sun Feb 05 23:00:00 EST 2017 · OSTI ID:1440683

Related Subjects