Large Scale Caching and Streaming of Training Data for Online Deep Learning
The training of deep neural network models on large data remains a difficult problem, despite progress towards scalable techniques. In particular, there is a mismatch between the random but predetermined order in which AI flows select training samples and the streaming I/O patterns for which traditional HPC data storage (e.g., parallel file systems) are designed. In addition, as more data are obtained, it is feasible neither simply to train learning models incrementally, due to catastrophic forgetting (i.e., bias towards new samples), nor to train frequently from scratch, due to prohibitive time and/or resource constraints. In this paper, we study data management techniques that combine caching and streaming with rehearsal support in order to enable efficient access to training samples in both offline training and continual learning. We revisit state-of-art streaming approaches based on data pipelines that transparently handle prefetching, caching, shuffling, and data augmentation, and discuss the challenges and opportunities that arise when combining these methods with data-parallel training techniques. We also report on preliminary experiments that evaluate the I/O overheads involved in accessing the training samples from a parallel file system (PFS) under several concurrency scenarios, highlighting the impact of the PFS on the design of the data pipelines.
- Research Organization:
- Argonne National Laboratory (ANL)
- Sponsoring Organization:
- USDOE Office of Science - Office of Advanced Scientific Computing Research (ASCR)
- DOE Contract Number:
- AC02-06CH11357
- OSTI ID:
- 1887185
- Country of Publication:
- United States
- Language:
- English
Exascale computing and big data
|
journal | June 2015 |
PipeDream: generalized pipeline parallelism for DNN training
|
conference | October 2019 |
Clairvoyant prefetching for distributed machine learning I/O
|
conference | November 2021 |
I/O Characterization and Performance Evaluation of BeeGFS for Deep Learning
|
conference | August 2019 |
Generative Feature Replay For Class-Incremental Learning
|
conference | June 2020 |
Efficient I/O for Neural Network Training with Compressed Data
|
conference | May 2020 |
ImageNet Training in Minutes
|
conference | January 2018 |
Incremental learning model inspired in Rehearsal for deep convolutional networks
|
journal | November 2020 |
Deep Residual Learning for Image Recognition
|
conference | June 2016 |
VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale
|
conference | May 2019 |
iCaRL: Incremental Classifier and Representation Learning
|
conference | July 2017 |
tf.data
|
journal | July 2021 |
Continual lifelong learning with neural networks: A review
|
journal | May 2019 |
NV-group: link-efficient reduction for distributed deep learning on modern dense GPU systems
|
conference | June 2020 |
Accelerating Data Loading in Deep Neural Network Training
|
conference | December 2019 |
Scalable Deep Learning via I/O Analysis and Optimization
|
journal | September 2019 |
Similar Records
Using program structure to achieve prefetching for cache memories
Mitigating Catastrophic Forgetting in Deep Learning in a Streaming Setting Using Historical Summary