skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Large Scale Caching and Streaming of Training Data for Online Deep Learning

Conference ·

The training of deep neural network models on large data remains a difficult problem, despite progress towards scalable techniques. In particular, there is a mismatch between the random but predetermined order in which AI flows select training samples and the streaming I/O patterns for which traditional HPC data storage (e.g., parallel file systems) are designed. In addition, as more data are obtained, it is feasible neither simply to train learning models incrementally, due to catastrophic forgetting (i.e., bias towards new samples), nor to train frequently from scratch, due to prohibitive time and/or resource constraints. In this paper, we study data management techniques that combine caching and streaming with rehearsal support in order to enable efficient access to training samples in both offline training and continual learning. We revisit state-of-art streaming approaches based on data pipelines that transparently handle prefetching, caching, shuffling, and data augmentation, and discuss the challenges and opportunities that arise when combining these methods with data-parallel training techniques. We also report on preliminary experiments that evaluate the I/O overheads involved in accessing the training samples from a parallel file system (PFS) under several concurrency scenarios, highlighting the impact of the PFS on the design of the data pipelines.

Research Organization:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Organization:
USDOE Office of Science - Office of Advanced Scientific Computing Research (ASCR)
DOE Contract Number:
Resource Relation:
Conference: 12th Workshop on AI and Scientific Computing at Scale Using Flexible Computing Infrastructures, 07/01/22 - 07/01/22, Minneapolis, MN, US
Country of Publication:
United States

References (16)

I/O Characterization and Performance Evaluation of BeeGFS for Deep Learning conference August 2019
NV-group: link-efficient reduction for distributed deep learning on modern dense GPU systems
  • Chu, Ching-Hsiang; Kousha, Pouya; Awan, Ammar Ahmad
  • ICS '20: 2020 International Conference on Supercomputing, Proceedings of the 34th ACM International Conference on Supercomputing
conference June 2020
Clairvoyant prefetching for distributed machine learning I/O
  • Dryden, Nikoli; Böhringer, Roman; Ben-Nun, Tal
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
conference November 2021
Deep Residual Learning for Image Recognition conference June 2016
Generative Feature Replay For Class-Incremental Learning conference June 2020 journal July 2021
Incremental learning model inspired in Rehearsal for deep convolutional networks journal November 2020
PipeDream: generalized pipeline parallelism for DNN training
  • Narayanan, Deepak; Harlap, Aaron; Phanishayee, Amar
  • SOSP '19: ACM SIGOPS 27th Symposium on Operating Systems Principles, Proceedings of the 27th ACM Symposium on Operating Systems Principles
conference October 2019
VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale conference May 2019
Continual lifelong learning with neural networks: A review journal May 2019
Scalable Deep Learning via I/O Analysis and Optimization journal September 2019
iCaRL: Incremental Classifier and Representation Learning conference July 2017
Exascale computing and big data journal June 2015
Accelerating Data Loading in Deep Neural Network Training conference December 2019
ImageNet Training in Minutes conference January 2018
Efficient I/O for Neural Network Training with Compressed Data conference May 2020