skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: SMALE: Enhancing Scalability of Machine Learning Algorithms on Extreme Scale Computing Platforms

Abstract

Deployment and execution of machine learning tasks on extreme-scale computing platforms face several significant technical challenges: 1) High computing cost incurred by dense networks – The computing workload of deep networks with densely-connected topology increases rapidly with the network size, imposing a non-scalable computing model of extreme-scale computing platforms; 2) Non-optimized workload distribution – Many advanced deep learning algorithms, e.g., sparsification and irregular net-work topology, produce very unbalanced workload distribution on extreme-scale computing platforms. The computation efficiency is greatly hindered by the incurred data and computation redundancies as well as long tails of the node with extensive workload; 3) Constraints in data movement and I/O bottle-neck – Inter-node data movement in extreme-scale computing platforms are associated with high energy and latency costs, and subject to the constraints of I/O bandwidth; and 4) Generalization of algorithm realization and acceleration on computing platforms – The large varieties of machine learning algorithms and structures of extreme-scale computing platforms make the derivation of a generalized algorithm realization and acceleration method very challenging, which, however, is the requirement by domain scientists and interested users. We call the above challenges Smale’s Problems in Machine Learning and Understanding for High-Performance Computing Scientific Discovery. The objective of ourmore » three-year research project is to develop a holistic innovation set at structure, assembly, and acceleration layers of machine learning algorithms to address the above challenges in algorithm deployment and execution. Three tasks are particularly performed, including: – At the algorithm structure level, we investigate the techniques that can structurally sparsify on the topology of deep networks for computing workload reduction. We also study clustering and pruning techniques that can optimize the workload distributions over the extreme-scale computing platforms; – At the algorithm assembly level, we derive a unified learning framework for unsupervised transfer learning and dynamic growing capabilities. Novel training methods are also exploited to enhance the training efficiency of the proposed framework; – At the algorithm acceleration level, we will develop a series of techniques that can accelerate the computation of sparse matrix operations, which are one of the core executions in deep learning and optimize memory access of the concerned platforms. Our proposed techniques attack the fundamental problems in machine learning algorithms running on extreme-scale computing platforms by vertically integrating the solutions at three closely entangled layers, paving the long-term scaling path of machine learning applications under DOE context. Three tasks corresponding to the above respective research orientations are performed during the three-year project period with our collaborators at ORNL. The outcome of the proposed project is anticipated to form a holistic solution set of novel algorithms and network topologies, efficient training techniques, and fast acceleration methods to promote the computing scalability of the machine learning applications of particular interest to DOE.« less

Authors:
ORCiD logo [1]; ORCiD logo [1]
  1. Duke Univ., Durham, NC (United States)
Publication Date:
Research Org.:
Yiran Chen/Duke University
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
Contributing Org.:
Duke University
OSTI Identifier:
1846568
Report Number(s):
DOE-Duke-18064-1
DOE Contract Number:  
SC0018064
Resource Type:
Technical Report
Resource Relation:
Related Information: 1. F. Chen, L. Song, and Y. Chen, “ReGAN: A Pipelined ReRAM-Based Accelerator for Generative Adversarial Networks,” Asia and South Pacific Design Automation Conference (ASP-DAC), Jan. 2018, pp. 178-183. DOI: 10.1109/ASPDAC.2018.8297302.2. L. Song, Y. Zhuo, X. Qian, H. Li, and Y. Chen, “GraphR: Accelerating Graph Processing Using ReRAM,” International Symposium on High-Performance Computer Architecture (HPCA), Feb. 2018, pp. 531-543. DOI: 10.1109/HPCA.2018.00052.3. B. Li, L. Song, F. Chen, X. Qian, Y. Chen, and H. Li, “ReRAM-based Accelerator for Deep Learning,” Design, Automation & Test in Europe (DATE), Mar. 2018, pp. 815-820. DOI: 10.23919/DATE.2018.8342118.4. C. Min, J. Mao, H. Li, and Y. Chen, “NeuralHMC: An Efficient HMC-based Accelerator for Deep Neural Networks,” International Conference on Computer Aided Design (ASP-DAC), Jan. 2019, pp. 394-399. DOI: 10.1145/3287624.3287642.5. F. Chen, L. Song, H. Li, and Y. Chen, “ZARA: A Novel Zero-free Dataflow Accelerator for Generative Adversarial Networks in 3D ReRAM,” Design Automation Conference (DAC), Jun. 2019, Article no. 133. DOI: 10.1145/3316781.3317936.6. J. Mao, Q. Yang, Ang. Li, H. Li, and Y. Chen, “MobiEye: An Efficient Cloud-based Video Detection System for Real-time Mobile Applications,” Design Automation Conference (DAC), Jun. 2019, Article no. 102. DOI: 10.1145/3316781.3317865.7. L. Song, J. Mao, Y. Zhuo, X. Qian, H. Li, and Y. Chen, “HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array,” International Symposium on High-Performance Computer Architecture (HPCA), Feb. 2019, pp. 56-68. DOI: 10.1109/HPCA.2019.00027.8. W. Wen, F. Yan, Y. Chen, and H. Li, “AutoGrow: Automatic Layer Growing in Deep Convolutional Networks,” ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), Aug. 2020, pp. 833-841. DOI: 10.1145/3394486.3403126.9. X. Liu, M. Mao, X. Bi, H. Li, and Y. Chen, “Exploring Applications of STT-RAM-based in GPU Architectures,” IEEE Transactions on Circuits and Systems I (TCAS-I), vol. 68, no. 1, Jan. 2021, pp. 238-249. DOI: 10.1109/TCSI.2020.3031895.
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; machine learning, efficiency, extreme-scale

Citation Formats

Chen, Yiran, and Li, Hai. SMALE: Enhancing Scalability of Machine Learning Algorithms on Extreme Scale Computing Platforms. United States: N. p., 2022. Web. doi:10.2172/1846568.
Chen, Yiran, & Li, Hai. SMALE: Enhancing Scalability of Machine Learning Algorithms on Extreme Scale Computing Platforms. United States. https://doi.org/10.2172/1846568
Chen, Yiran, and Li, Hai. 2022. "SMALE: Enhancing Scalability of Machine Learning Algorithms on Extreme Scale Computing Platforms". United States. https://doi.org/10.2172/1846568. https://www.osti.gov/servlets/purl/1846568.
@article{osti_1846568,
title = {SMALE: Enhancing Scalability of Machine Learning Algorithms on Extreme Scale Computing Platforms},
author = {Chen, Yiran and Li, Hai},
abstractNote = {Deployment and execution of machine learning tasks on extreme-scale computing platforms face several significant technical challenges: 1) High computing cost incurred by dense networks – The computing workload of deep networks with densely-connected topology increases rapidly with the network size, imposing a non-scalable computing model of extreme-scale computing platforms; 2) Non-optimized workload distribution – Many advanced deep learning algorithms, e.g., sparsification and irregular net-work topology, produce very unbalanced workload distribution on extreme-scale computing platforms. The computation efficiency is greatly hindered by the incurred data and computation redundancies as well as long tails of the node with extensive workload; 3) Constraints in data movement and I/O bottle-neck – Inter-node data movement in extreme-scale computing platforms are associated with high energy and latency costs, and subject to the constraints of I/O bandwidth; and 4) Generalization of algorithm realization and acceleration on computing platforms – The large varieties of machine learning algorithms and structures of extreme-scale computing platforms make the derivation of a generalized algorithm realization and acceleration method very challenging, which, however, is the requirement by domain scientists and interested users. We call the above challenges Smale’s Problems in Machine Learning and Understanding for High-Performance Computing Scientific Discovery. The objective of our three-year research project is to develop a holistic innovation set at structure, assembly, and acceleration layers of machine learning algorithms to address the above challenges in algorithm deployment and execution. Three tasks are particularly performed, including: – At the algorithm structure level, we investigate the techniques that can structurally sparsify on the topology of deep networks for computing workload reduction. We also study clustering and pruning techniques that can optimize the workload distributions over the extreme-scale computing platforms; – At the algorithm assembly level, we derive a unified learning framework for unsupervised transfer learning and dynamic growing capabilities. Novel training methods are also exploited to enhance the training efficiency of the proposed framework; – At the algorithm acceleration level, we will develop a series of techniques that can accelerate the computation of sparse matrix operations, which are one of the core executions in deep learning and optimize memory access of the concerned platforms. Our proposed techniques attack the fundamental problems in machine learning algorithms running on extreme-scale computing platforms by vertically integrating the solutions at three closely entangled layers, paving the long-term scaling path of machine learning applications under DOE context. Three tasks corresponding to the above respective research orientations are performed during the three-year project period with our collaborators at ORNL. The outcome of the proposed project is anticipated to form a holistic solution set of novel algorithms and network topologies, efficient training techniques, and fast acceleration methods to promote the computing scalability of the machine learning applications of particular interest to DOE.},
doi = {10.2172/1846568},
url = {https://www.osti.gov/biblio/1846568}, journal = {},
number = ,
volume = ,
place = {United States},
year = {2022},
month = {2}
}

Works referenced in this record:

AutoGrow: Automatic Layer Growing in Deep Convolutional Networks
conference, August 2020

  • Wen, Wei; Yan, Feng; Chen, Yiran
  • KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
  • https://doi.org/10.1145/3394486.3403126

NeuralHMC: an efficient HMC-based accelerator for deep neural networks
conference, January 2019

  • Min, Chuhan; Mao, Jiachen; Li, Hai
  • ASPDAC '19: 24th Asia and South Pacific Design Automation Conference, Proceedings of the 24th Asia and South Pacific Design Automation Conference
  • https://doi.org/10.1145/3287624.3287642

Exploring Applications of STT-RAM in GPU Architectures
journal, January 2021


ReGAN: A pipelined ReRAM-based accelerator for generative adversarial networks
conference, January 2018


ZARA: A Novel Zero-free Dataflow Accelerator for Generative Adversarial Networks in 3D ReRAM
conference, June 2019

  • Chen, Fan; Song, Linghao; Li, Hai Helen
  • DAC '19: The 56th Annual Design Automation Conference 2019, Proceedings of the 56th Annual Design Automation Conference 2019
  • https://doi.org/10.1145/3316781.3317936

HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array
conference, February 2019


GraphR: Accelerating Graph Processing Using ReRAM
conference, February 2018


MobiEye: An Efficient Cloud-based Video Detection System for Real-time Mobile Applications
conference, June 2019

  • Mao, Jiachen; Yang, Qing; Li, Ang
  • DAC '19: The 56th Annual Design Automation Conference 2019, Proceedings of the 56th Annual Design Automation Conference 2019
  • https://doi.org/10.1145/3316781.3317865

ReRAM-based accelerator for deep learning
conference, March 2018