Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

I/O in Machine Learning Applications on HPC Systems: A 360-degree Survey

Journal Article · · ACM Computing Surveys
DOI:https://doi.org/10.1145/3722215· OSTI ID:2544251
 [1];  [2];  [3]
  1. Louisiana State Univ., Baton Rouge, LA (United States); The Ohio State Univ., Columbus, OH (United States)
  2. Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
  3. The Ohio State Univ., Columbus, OH (United States); Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Growing interest in Artificial Intelligence (AI) has resulted in a surge in demand for faster methods of Machine Learning (ML) model training and inference. This demand for speed has prompted the use of high performance computing (HPC) systems that excel in managing distributed workloads. Because data is the main fuel for AI applications, the performance of the storage and I/O subsystem of HPC systems is critical. In the past, HPC applications accessed large portions of data written by simulations or experiments or ingested data for visualizations or analysis tasks. ML workloads perform small reads spread across a large number of random files. This shift of I/O access patterns poses several challenges to modern parallel storage systems. In this paper, we survey I/O in ML applications on HPC systems, and target literature within a 6-year time window from 2019 to 2024. We define the scope of the survey, provide an overview of the common phases of ML, review available profilers and benchmarks, examine the I/O patterns encountered during offline data preparation, training, and inference, and explore I/O optimizations utilized in modern ML frameworks and proposed in recent literature. Lastly, we seek to expose research gaps that could spawn further R&D.
Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States). National Energy Research Scientific Computing Center (NERSC)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
Grant/Contract Number:
AC02-05CH11231
OSTI ID:
2544251
Journal Information:
ACM Computing Surveys, Journal Name: ACM Computing Surveys Journal Issue: 10 Vol. 57; ISSN 0360-0300
Publisher:
Association for Computing Machinery (ACM)Copyright Statement
Country of Publication:
United States
Language:
English

References (99)

Analyzing the I/O Patterns of Deep Learning Applications book January 2021
Understanding and Leveraging the I/O Patterns of Emerging Machine Learning Analytics book January 2022
Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding book January 2016
3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation conference January 2016
Stochastic Gradient Descent Tricks book January 2012
Data discretization unification journal May 2008
ExaHDF5: Delivering Efficient Parallel I/O on Exascale Computing Systems journal January 2020
A conceptual basis for feature engineering journal December 1999
An efficient algorithm for data parallelism based on stochastic optimization journal December 2022
Data smoothing and numerical differentiation by a regularization method journal April 2010
Asynchronous federated learning on heterogeneous devices: A survey journal November 2023
The M4 Competition: 100,000 time series and 61 forecasting methods journal January 2020
Theoretical analysis of batch and on-line training for gradient descent learning in neural networks journal December 2009
Convergence analysis of distributed stochastic gradient descent with shuffling journal April 2019
Algorithmic Splitting: A Method for Dataset Preparation journal January 2021
Efficient Data Loading for Deep Neural Network Training conference August 2023
High Performance I/O For Large Scale Deep Learning conference December 2019
DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models conference May 2020
DLIO: A Data-Centric Benchmark for Scientific Deep Learning Applications conference May 2021
Accelerating Deep Learning Training Through Transparent Storage Tiering conference May 2022
A Case Study of Data Management Challenges Presented in Large-Scale Machine Learning Workflows conference May 2023
tf-Darshan: Understanding Fine-grained I/O Performance in Machine Learning Workloads conference September 2020
A Survey of Distributed Data Aggregation Algorithms journal January 2015
ImageNet: A large-scale hierarchical image database
  • Deng, Jia; Dong, Wei; Socher, Richard
  • 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), 2009 IEEE Conference on Computer Vision and Pattern Recognition https://doi.org/10.1109/CVPR.2009.5206848
conference June 2009
FFCV: Accelerating Training by Removing Data Bottlenecks conference June 2023
Cost-Effective HPC: The Community or the Cloud? conference November 2010
Streamlining distributed Deep Learning I/O with ad hoc file systems conference September 2021
PipeEdge: Pipeline Parallelism for Large-Scale Model Inference on Heterogeneous Edge Devices conference August 2022
Modular HPC I/O Characterization with Darshan conference November 2016
iCache: An Importance-Sampling-Informed Cache for Accelerating I/O-Bound DNN Model Training conference February 2023
Smart-Infinity: Fast Large Language Model Training using Near-Storage Processing on a Real System conference March 2024
Inference Benchmarking on HPC Systems conference September 2020
iBench: a Distributed Inference Simulation and Benchmark Suite conference September 2020
Understanding HPC Application I/O Behavior Using System Level Statistics conference December 2020
Asynchronous I/O Strategy for Large-Scale Deep Learning Applications conference December 2021
Librispeech: An ASR corpus based on public domain audio books conference April 2015
Audio Set: An ontology and human-labeled dataset for audio events conference March 2017
HMDB: A large video database for human motion recognition conference November 2011
Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books conference December 2015
Using GPUs for machine learning algorithms conference January 2005
ASRDataset: A Multi-granularity Shuffle System for Preparing Large-scale ASR Training Data conference December 2023
Evaluating Real-Time Anomaly Detection Algorithms -- The Numenta Anomaly Benchmark conference December 2015
The Impact of Data Pre-Processing Techniques and Dimensionality Reduction on the Accuracy of Machine Learning conference March 2019
BenchNN: On the broad potential application scope of hardware neural network accelerators conference November 2012
Scheduling the I/O of HPC Applications Under Congestion conference May 2015
VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale conference May 2019
Why Globally Re-shuffle? Revisiting Data Shuffling in Large Scale Deep Learning conference May 2022
Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning conference May 2020
Resource Allocation With Edge Computing in IoT Networks via Machine Learning journal April 2020
Entropy-Aware I/O Pipelining for Large-Scale Deep Learning on HPC Systems
  • Zhu, Yue; Chowdhury, Fahim; Fu, Huansong
  • 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) https://doi.org/10.1109/MASCOTS.2018.00023
conference September 2018
Characterizing Machine Learning I/O Workloads on Leadership Scale HPC Systems conference November 2021
MLPerf™ HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems conference November 2021
MLPerf: An Industry Standard Benchmark Suite for Machine Learning Performance journal March 2020
The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the Web] journal October 2012
Analyzing the distributed training of deep-learning models via data locality
  • Alonso-Monsalve, Saul; Calderon, Alejandro; Garcia-Carballeira, Felix
  • 2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP) https://doi.org/10.1109/PDP52278.2021.00026
conference March 2021
I/O Bottleneck Detection and Tuning: Connecting the Dots using Interactive Log Analysis conference November 2021
Data-Aware Storage Tiering for Deep Learning conference November 2021
CosmoFlow: Using Deep Learning to Learn the Universe at Scale
  • Mathuriya, Amrita; Bard, Deborah; Mendygral, Peter
  • SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00068
conference November 2018
STRONGHOLD: Fast and Affordable Billion-Scale Deep Learning Model Training conference November 2022
Evaluating Multi-Level Checkpointing for Distributed Deep Neural Network Training conference November 2021
MMDataLoader: Reusing Preprocessed Data Among Concurrent Model Training Tasks journal February 2024
Multimodal Machine Learning: A Survey and Taxonomy journal February 2019
DIESEL+: Accelerating Distributed Deep Learning Tasks on Image Datasets journal May 2022
Performance Evaluation and Optimization of HBM-Enabled GPU for Data-Intensive Applications journal May 2018
Edge AI: On-Demand Accelerating Deep Neural Network Inference via Edge Computing journal January 2020
Recent Trends in Stochastic Gradient Descent for Machine Learning and big data conference December 2018
Leveraging burst buffer coordination to prevent I/O interference conference October 2016
Evaluating Real-Time Anomaly Detection Algorithms -- The Numenta Anomaly Benchmark conference December 2015
Optimization Methods for Large-Scale Machine Learning journal January 2018
Large-scale analysis of disease pathways in the human interactome conference November 2017
Column-stores vs. row-stores conference June 2008
Defining and evaluating network communities based on ground-truth conference August 2012
A comparative study of high-performance computing on the cloud
  • Marathe, Aniruddha; Harris, Rachel; Lowenthal, David K.
  • Proceedings of the 22nd international symposium on High-performance parallel and distributed computing - HPDC '13 https://doi.org/10.1145/2493123.2462919
conference January 2013
A Dataset and Taxonomy for Urban Sound Research conference November 2014
In-Datacenter Performance Analysis of a Tensor Processing Unit conference January 2017
I/O Characterization and Performance Evaluation of BeeGFS for Deep Learning conference August 2019
Accelerating Machine Learning I/O by Overlapping Data Staging and Mini-batch Generations conference December 2019
Deep compressive offloading conference November 2020
Overview and Importance of Data Quality for Machine Learning Tasks conference August 2020
DeepSpeed conference August 2020
Clairvoyant prefetching for distributed machine learning I/O
  • Dryden, Nikoli; Böhringer, Roman; Ben-Nun, Tal
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3458817.3476181
conference November 2021
ZeRO-infinity
  • Rajbhandari, Samyam; Ruwase, Olatunji; Rasley, Jeff
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3458817.3476205
conference November 2021
tf.data service conference October 2023
DDStore: Distributed Data Store for Scalable Training of Graph Neural Networks on Large Atomistic Modeling Datasets
  • Choi, Jong Youl; Lupo Pasini, Massimiliano; Zhang, Pei
  • Proceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis https://doi.org/10.1145/3624062.3624171
conference November 2023
DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models
  • Maurya, Avinash; Underwood, Robert; Rafique, M. Mustafa
  • Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing https://doi.org/10.1145/3625549.3658685
conference June 2024
Data Readiness for AI: A 360-Degree Survey journal April 2025
A survey on Image Data Augmentation for Deep Learning journal July 2019
Text Data Augmentation for Deep Learning journal July 2021
Towards accelerating model parallelism in distributed deep learning systems journal November 2023
Progressive compressed records journal July 2021
AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training journal April 2018
Fast WordPiece Tokenization conference January 2021
Cora Dataset dataset January 2017
Understanding Lustre Internals. Second Edition report September 2021
Analyzing Data Reference Characteristics of Deep Learning Workloads for Improving Buffer Cache Performance journal November 2023
Adaptively Periodic I/O Scheduling for Concurrent HPC Applications journal April 2022
AuroraGPT: A Large-Scale Foundation Model for Advancing Science text January 2024
Erratum for Discovering Order Dependencies through Order Compatibility (EDBT 2019) dataset January 2020
Hyrise Re-engineered: An Extensible Database System for Research in Relational In-Memory Data Management dataset January 2019