Exploring Flexible Communications for Streamlining DNN Ensemble Training Pipelines

Pittman, Randall; Shen, Xipeng; Patton, Robert M.; Lim, Seung-Hwan

doi:10.2172/1435221

Title: Exploring Flexible Communications for Streamlining DNN Ensemble Training Pipelines

Technical Report · Wed Mar 28 00:00:00 EDT 2018

DOI:https://doi.org/10.2172/1435221· OSTI ID:1435221

Pittman, Randall ^[1]; Shen, Xipeng ^[1]; Patton, Robert M. ^[2]; Lim, Seung-Hwan ^[2]

North Carolina State Univ., Raleigh, NC (United States)
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Computer Science and Mathematics Division

Parallel training of a Deep Neural Network (DNN) ensemble on a cluster of nodes is a common practice to train multiple models in order to construct a model with a higher prediction accuracy. Existing ensemble training pipelines can perform a great deal of redundant operations, resulting in unnecessary CPU usage, or even poor pipeline performance. In order to remove these redundancies, we need pipelines with more communication flexibility than existing DNN frameworks can provide. This project investigates a series of designs to improve pipeline flexibility and adaptivity, while also increasing performance. We implement our designs using Tensorflow with Horovod, and test it using several large DNNs in a large scale GPU cluster, Titan supercomputer at Oak Ridge National Lab. Our results show that the CPU time spent during training is reduced by 2-11X. Furthermore, our implementation can achieve up to 10X speedups when CPU core limits are imposed. Our best pipeline also reduces the average power draw of the ensemble training process by 5-16% when compared to the baseline.

View Technical Report

Cite

Export

Save

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1435221

Report Number(s):: ORNL/TM-2018/817

Country of Publication:: United States

Language:: English

Similar Records

Exploring flexible communications for streamlining DNN ensemble training pipelines

Conference · Thu Nov 01 00:00:00 EDT 2018 · OSTI ID:1435221

Pittman, Randall; Guan, Hui; Shen, Xipeng; +2 more

FLEET: Flexible Efficient Ensemble Training for Heterogeneous Deep Neural Networks

Conference · Sun Mar 01 00:00:00 EST 2020 · OSTI ID:1435221

Guan, Hui; Kishor mokadam, Laxmikant; Shen, Xipeng; +2 more

Quantum Monte Carlo Endstation for Petascale Computing

Technical Report · Wed Mar 02 00:00:00 EST 2011 · OSTI ID:1435221

Ceperley, David

Related Subjects

97 MATHEMATICS AND COMPUTING

Title: Exploring Flexible Communications for Streamlining DNN Ensemble Training Pipelines

Citation Formats

Similar Records

Related Subjects