skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Exploring Flexible Communications for Streamlining DNN Ensemble Training Pipelines

Abstract

Parallel training of a Deep Neural Network (DNN) ensemble on a cluster of nodes is a common practice to train multiple models in order to construct a model with a higher prediction accuracy. Existing ensemble training pipelines can perform a great deal of redundant operations, resulting in unnecessary CPU usage, or even poor pipeline performance. In order to remove these redundancies, we need pipelines with more communication flexibility than existing DNN frameworks can provide. This project investigates a series of designs to improve pipeline flexibility and adaptivity, while also increasing performance. We implement our designs using Tensorflow with Horovod, and test it using several large DNNs in a large scale GPU cluster, Titan supercomputer at Oak Ridge National Lab. Our results show that the CPU time spent during training is reduced by 2-11X. Furthermore, our implementation can achieve up to 10X speedups when CPU core limits are imposed. Our best pipeline also reduces the average power draw of the ensemble training process by 5-16% when compared to the baseline.

Authors:
 [1];  [1];  [2];  [2]
  1. North Carolina State Univ., Raleigh, NC (United States)
  2. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Computer Science and Mathematics Division
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1435221
Report Number(s):
ORNL/TM-2018/817
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Technical Report
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING

Citation Formats

Pittman, Randall, Shen, Xipeng, Patton, Robert M., and Lim, Seung-Hwan. Exploring Flexible Communications for Streamlining DNN Ensemble Training Pipelines. United States: N. p., 2018. Web. doi:10.2172/1435221.
Pittman, Randall, Shen, Xipeng, Patton, Robert M., & Lim, Seung-Hwan. Exploring Flexible Communications for Streamlining DNN Ensemble Training Pipelines. United States. doi:10.2172/1435221.
Pittman, Randall, Shen, Xipeng, Patton, Robert M., and Lim, Seung-Hwan. Wed . "Exploring Flexible Communications for Streamlining DNN Ensemble Training Pipelines". United States. doi:10.2172/1435221. https://www.osti.gov/servlets/purl/1435221.
@article{osti_1435221,
title = {Exploring Flexible Communications for Streamlining DNN Ensemble Training Pipelines},
author = {Pittman, Randall and Shen, Xipeng and Patton, Robert M. and Lim, Seung-Hwan},
abstractNote = {Parallel training of a Deep Neural Network (DNN) ensemble on a cluster of nodes is a common practice to train multiple models in order to construct a model with a higher prediction accuracy. Existing ensemble training pipelines can perform a great deal of redundant operations, resulting in unnecessary CPU usage, or even poor pipeline performance. In order to remove these redundancies, we need pipelines with more communication flexibility than existing DNN frameworks can provide. This project investigates a series of designs to improve pipeline flexibility and adaptivity, while also increasing performance. We implement our designs using Tensorflow with Horovod, and test it using several large DNNs in a large scale GPU cluster, Titan supercomputer at Oak Ridge National Lab. Our results show that the CPU time spent during training is reduced by 2-11X. Furthermore, our implementation can achieve up to 10X speedups when CPU core limits are imposed. Our best pipeline also reduces the average power draw of the ensemble training process by 5-16% when compared to the baseline.},
doi = {10.2172/1435221},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Wed Mar 28 00:00:00 EDT 2018},
month = {Wed Mar 28 00:00:00 EDT 2018}
}

Technical Report:

Save / Share: