I/O in Machine Learning Applications on HPC Systems: A 360-degree Survey

Lewis, Noah; Bez, Jean Luca; Byna, Surendra

doi:10.1145/3722215

I/O in Machine Learning Applications on HPC Systems: A 360-degree Survey

Journal Article · Fri Mar 07 00:00:00 EST 2025 · ACM Computing Surveys

DOI:https://doi.org/10.1145/3722215· OSTI ID:2544251

^[1]; ^[2]; ^[3]

Louisiana State Univ., Baton Rouge, LA (United States); The Ohio State Univ., Columbus, OH (United States)
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
The Ohio State Univ., Columbus, OH (United States); Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Growing interest in Artificial Intelligence (AI) has resulted in a surge in demand for faster methods of Machine Learning (ML) model training and inference. This demand for speed has prompted the use of high performance computing (HPC) systems that excel in managing distributed workloads. Because data is the main fuel for AI applications, the performance of the storage and I/O subsystem of HPC systems is critical. In the past, HPC applications accessed large portions of data written by simulations or experiments or ingested data for visualizations or analysis tasks. ML workloads perform small reads spread across a large number of random files. This shift of I/O access patterns poses several challenges to modern parallel storage systems. In this paper, we survey I/O in ML applications on HPC systems, and target literature within a 6-year time window from 2019 to 2024. We define the scope of the survey, provide an overview of the common phases of ML, review available profilers and benchmarks, examine the I/O patterns encountered during offline data preparation, training, and inference, and explore I/O optimizations utilized in modern ML frameworks and proposed in recent literature. Lastly, we seek to expose research gaps that could spawn further R&D.

View Accepted Manuscript (DOE)

Research Organization:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States). National Energy Research Scientific Computing Center (NERSC)

Sponsoring Organization:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)

Grant/Contract Number:: AC02-05CH11231

OSTI ID:: 2544251

Journal Information:: ACM Computing Surveys, Journal Name: ACM Computing Surveys Journal Issue: 10 Vol. 57; ISSN 0360-0300

Publisher:: Association for Computing Machinery (ACM)Copyright Statement

Country of Publication:: United States

Language:: English

References (99)

Analyzing the I/O Patterns of Deep Learning Applications Párraga, Edixon; León, Betzabeth; Bond, Román Cloud Computing, Big Data & Emerging Topics https://doi.org/10.1007/978-3-030-84825-5_1	book	January 2021
Understanding and Leveraging the I/O Patterns of Emerging Machine Learning Analytics Gainaru, Ana; Ganyushin, Dmitry; Xie, Bing Driving Scientific and Engineering Discoveries Through the Integration of Experiment, Big Data, and Modeling and Simulation https://doi.org/10.1007/978-3-030-96498-6_7	book	January 2022
Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding Sigurdsson, Gunnar A.; Varol, Gül; Wang, Xiaolong Computer Vision – ECCV 2016 https://doi.org/10.1007/978-3-319-46448-0_31	book	January 2016
3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation Çiçek, Özgün; Abdulkadir, Ahmed; Lienkamp, Soeren S. Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016 https://doi.org/10.1007/978-3-319-46723-8_49	conference	January 2016
Stochastic Gradient Descent Tricks Bottou, Léon Lecture Notes in Computer Science https://doi.org/10.1007/978-3-642-35289-8_25	book	January 2012
Data discretization unification Jin, Ruoming; Breitbart, Yuri; Muoh, Chibuike Knowledge and Information Systems, Vol. 19, Issue 1 https://doi.org/10.1007/s10115-008-0142-6	journal	May 2008
ExaHDF5: Delivering Efficient Parallel I/O on Exascale Computing Systems Byna, Suren; Breitenfeld, M. Scot; Dong, Bin Journal of Computer Science and Technology, Vol. 35, Issue 1 https://doi.org/10.1007/s11390-020-9822-9	journal	January 2020
A conceptual basis for feature engineering Reid Turner, C.; Fuggetta, Alfonso; Lavazza, Luigi Journal of Systems and Software, Vol. 49, Issue 1 https://doi.org/10.1016/S0164-1212(99)00062-X	journal	December 1999
An efficient algorithm for data parallelism based on stochastic optimization Abdulaziz Alnowibet, Khalid; Khan, Imran; Sallam, Karam M. Alexandria Engineering Journal, Vol. 61, Issue 12 https://doi.org/10.1016/j.aej.2022.05.052	journal	December 2022
Data smoothing and numerical differentiation by a regularization method Stickel, Jonathan J. Computers & Chemical Engineering, Vol. 34, Issue 4 https://doi.org/10.1016/j.compchemeng.2009.10.007	journal	April 2010
Asynchronous federated learning on heterogeneous devices: A survey Xu, Chenhao; Qu, Youyang; Xiang, Yong Computer Science Review, Vol. 50 https://doi.org/10.1016/j.cosrev.2023.100595	journal	November 2023
The M4 Competition: 100,000 time series and 61 forecasting methods Makridakis, Spyros; Spiliotis, Evangelos; Assimakopoulos, Vassilios International Journal of Forecasting, Vol. 36, Issue 1 https://doi.org/10.1016/j.ijforecast.2019.04.014	journal	January 2020
Theoretical analysis of batch and on-line training for gradient descent learning in neural networks Nakama, Takéhiko Neurocomputing, Vol. 73, Issue 1-3 https://doi.org/10.1016/j.neucom.2009.05.017	journal	December 2009
Convergence analysis of distributed stochastic gradient descent with shuffling Meng, Qi; Chen, Wei; Wang, Yue Neurocomputing, Vol. 337 https://doi.org/10.1016/j.neucom.2019.01.037	journal	April 2019
Algorithmic Splitting: A Method for Dataset Preparation Kahloot, Khalid M.; Ekler, Peter IEEE Access, Vol. 9 https://doi.org/10.1109/ACCESS.2021.3110745	journal	January 2021
Efficient Data Loading for Deep Neural Network Training Liu, Chengjian; Shi, Pengcheng; Li, Yihong 2023 9th International Conference on Big Data Computing and Communications (BigCom) https://doi.org/10.1109/BIGCOM61073.2023.00036	conference	August 2023
High Performance I/O For Large Scale Deep Learning Aizman, Alex; Maltby, Gavin; Breuel, Thomas 2019 IEEE International Conference on Big Data (Big Data) https://doi.org/10.1109/BigData47090.2019.9005703	conference	December 2019
DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models Nicolae, Bogdan; Li, Jiali; Wozniak, Justin M. 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID) https://doi.org/10.1109/CCGrid49817.2020.00-76	conference	May 2020
DLIO: A Data-Centric Benchmark for Scientific Deep Learning Applications Devarajan, Hariharan; Zheng, Huihuo; Kougkas, Anthony 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid) https://doi.org/10.1109/CCGrid51090.2021.00018	conference	May 2021
Accelerating Deep Learning Training Through Transparent Storage Tiering Dantas, Marco; Leitao, Diogo; Cui, Peter 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid) https://doi.org/10.1109/CCGrid54584.2022.00011	conference	May 2022
A Case Study of Data Management Challenges Presented in Large-Scale Machine Learning Workflows Lee, Claire Songhyun; Hewes, V.; Cerati, Giuseppe 2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid) https://doi.org/10.1109/CCGrid57682.2023.00017	conference	May 2023
tf-Darshan: Understanding Fine-grained I/O Performance in Machine Learning Workloads Chien, Steven W. D.; Podobas, Artur; Peng, Ivy B. 2020 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER49012.2020.00046	conference	September 2020
A Survey of Distributed Data Aggregation Algorithms Jesus, Paulo; Baquero, Carlos; Almeida, Paulo Sergio IEEE Communications Surveys & Tutorials, Vol. 17, Issue 1 https://doi.org/10.1109/COMST.2014.2354398	journal	January 2015
ImageNet: A large-scale hierarchical image database Deng, Jia; Dong, Wei; Socher, Richard 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), 2009 IEEE Conference on Computer Vision and Pattern Recognition https://doi.org/10.1109/CVPR.2009.5206848	conference	June 2009
FFCV: Accelerating Training by Removing Data Bottlenecks Leclerc, Guillaume; Ilyas, Andrew; Engstrom, Logan 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/CVPR52729.2023.01156	conference	June 2023
Cost-Effective HPC: The Community or the Cloud? Carlyle, Adam G.; Harrell, Stephen L.; Smith, Preston M. 2010 IEEE Second International Conference on Cloud Computing Technology and Science https://doi.org/10.1109/CloudCom.2010.115	conference	November 2010
Streamlining distributed Deep Learning I/O with ad hoc file systems Schimmelpfennig, Frederic; Vef, Marc-Andre; Salkhordeh, Reza 2021 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/Cluster48925.2021.00062	conference	September 2021
PipeEdge: Pipeline Parallelism for Large-Scale Model Inference on Heterogeneous Edge Devices Hu, Yang; Imes, Connor; Zhao, Xuanang 2022 25th Euromicro Conference on Digital System Design (DSD) https://doi.org/10.1109/DSD57027.2022.00048	conference	August 2022
Modular HPC I/O Characterization with Darshan Snyder, Shane; Carns, Philip; Harms, Kevin 2016 5th Workshop on Extreme-Scale Programming Tools (ESPT) https://doi.org/10.1109/ESPT.2016.006	conference	November 2016
iCache: An Importance-Sampling-Informed Cache for Accelerating I/O-Bound DNN Model Training Chen, Weijian; He, Shuibing; Xu, Yaowen 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA) https://doi.org/10.1109/HPCA56546.2023.10070964	conference	February 2023
Smart-Infinity: Fast Large Language Model Training using Near-Storage Processing on a Real System Jang, Hongsun; Song, Jaeyong; Jung, Jaewon 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA) https://doi.org/10.1109/HPCA57654.2024.00034	conference	March 2024
Inference Benchmarking on HPC Systems Brewer, Wesley; Behm, Greg; Scheinine, Alan 2020 IEEE High Performance Extreme Computing Conference (HPEC) https://doi.org/10.1109/HPEC43674.2020.9286138	conference	September 2020
iBench: a Distributed Inference Simulation and Benchmark Suite Brewer, Wesley; Behm, Greg; Scheinine, Alan 2020 IEEE High Performance Extreme Computing Conference (HPEC) https://doi.org/10.1109/HPEC43674.2020.9286169	conference	September 2020
Understanding HPC Application I/O Behavior Using System Level Statistics Paul, Arnab K.; Faaland, Olaf; Moody, Adam 2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC) https://doi.org/10.1109/HiPC50609.2020.00034	conference	December 2020
Asynchronous I/O Strategy for Large-Scale Deep Learning Applications Lee, Sunwoo; Kang, Qiao; Wang, Kewei 2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC) https://doi.org/10.1109/HiPC53243.2021.00046	conference	December 2021
Librispeech: An ASR corpus based on public domain audio books Panayotov, Vassil; Chen, Guoguo; Povey, Daniel 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) https://doi.org/10.1109/ICASSP.2015.7178964	conference	April 2015
Audio Set: An ontology and human-labeled dataset for audio events Gemmeke, Jort F.; Ellis, Daniel P. W.; Freedman, Dylan 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) https://doi.org/10.1109/ICASSP.2017.7952261	conference	March 2017
HMDB: A large video database for human motion recognition Kuehne, H.; Jhuang, H.; Garrote, E. 2011 International Conference on Computer Vision https://doi.org/10.1109/ICCV.2011.6126543	conference	November 2011
Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books Zhu, Yukun; Kiros, Ryan; Zemel, Rich 2015 IEEE International Conference on Computer Vision (ICCV) https://doi.org/10.1109/ICCV.2015.11	conference	December 2015
Using GPUs for machine learning algorithms Steinkraus, D.; Buck, I.; Simard, P. Y. Eighth International Conference on Document Analysis and Recognition (ICDAR'05) https://doi.org/10.1109/ICDAR.2005.251	conference	January 2005
ASRDataset: A Multi-granularity Shuffle System for Preparing Large-scale ASR Training Data Jie, Fei; Zhang, Haijun; Wang, Jinxiang 2023 IEEE International Conference on Knowledge Graph (ICKG) https://doi.org/10.1109/ICKG59574.2023.00014	conference	December 2023
Evaluating Real-Time Anomaly Detection Algorithms -- The Numenta Anomaly Benchmark Lavin, Alexander; Ahmad, Subutai 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA) https://doi.org/10.1109/ICMLA.2015.141	conference	December 2015
The Impact of Data Pre-Processing Techniques and Dimensionality Reduction on the Accuracy of Machine Learning Obaid, Hadeel S.; Dheyab, Saad Ahmed; Sabry, Sana Sabah 2019 9th Annual Information Technology, Electromechanical Engineering and Microelectronics Conference (IEMECON) https://doi.org/10.1109/IEMECONX.2019.8877011	conference	March 2019
BenchNN: On the broad potential application scope of hardware neural network accelerators Chen, Tianshi; Chen, Yunji; Duranton, Marc 2012 IEEE International Symposium on Workload Characterization (IISWC) https://doi.org/10.1109/IISWC.2012.6402898	conference	November 2012
Scheduling the I/O of HPC Applications Under Congestion Gainaru, Ana; Aupy, Guillaume; Benoit, Anne 2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2015.116	conference	May 2015
VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale Nicolae, Bogdan; Moody, Adam; Gonsiorowski, Elsa 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2019.00099	conference	May 2019
Why Globally Re-shuffle? Revisiting Data Shuffling in Large Scale Deep Learning Nguyen, Truong Thao; Trahay, Francois; Domke, Jens 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS53621.2022.00109	conference	May 2022
Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning Dey, Tonmoy; Sato, Kento; Nicolae, Bogdan 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) https://doi.org/10.1109/IPDPSW50202.2020.00174	conference	May 2020
Resource Allocation With Edge Computing in IoT Networks via Machine Learning Liu, Xiaolan; Yu, Jiadong; Wang, Jian IEEE Internet of Things Journal, Vol. 7, Issue 4 https://doi.org/10.1109/JIOT.2020.2970110	journal	April 2020
Entropy-Aware I/O Pipelining for Large-Scale Deep Learning on HPC Systems Zhu, Yue; Chowdhury, Fahim; Fu, Huansong 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) https://doi.org/10.1109/MASCOTS.2018.00023	conference	September 2018
Characterizing Machine Learning I/O Workloads on Leadership Scale HPC Systems Paul, Arnab K.; Karimi, Ahmad Maroof; Wang, Feiyi 2021 29th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) https://doi.org/10.1109/MASCOTS53633.2021.9614303	conference	November 2021
MLPerf™ HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems Farrell, Steven; Emani, Murali; Balma, Jacob 2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) https://doi.org/10.1109/MLHPC54614.2021.00009	conference	November 2021
MLPerf: An Industry Standard Benchmark Suite for Machine Learning Performance Mattson, Peter; Reddi, Vijay Janapa; Cheng, Christine IEEE Micro, Vol. 40, Issue 2 https://doi.org/10.1109/MM.2020.2974843	journal	March 2020
The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the Web] Deng, Li IEEE Signal Processing Magazine, Vol. 29, Issue 6, p. 141-142 https://doi.org/10.1109/MSP.2012.2211477	journal	October 2012
Analyzing the distributed training of deep-learning models via data locality Alonso-Monsalve, Saul; Calderon, Alejandro; Garcia-Carballeira, Felix 2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP) https://doi.org/10.1109/PDP52278.2021.00026	conference	March 2021
I/O Bottleneck Detection and Tuning: Connecting the Dots using Interactive Log Analysis Bez, Jean Luca; Tang, Houjun; Xie, Bing 2021 IEEE/ACM Sixth International Parallel Data Systems Workshop (PDSW) https://doi.org/10.1109/PDSW54622.2021.00008	conference	November 2021
Data-Aware Storage Tiering for Deep Learning Xu, Cong; Bhattacharya, Suparna; Foltin, Martin 2021 IEEE/ACM Sixth International Parallel Data Systems Workshop (PDSW) https://doi.org/10.1109/PDSW54622.2021.00009	conference	November 2021
CosmoFlow: Using Deep Learning to Learn the Universe at Scale Mathuriya, Amrita; Bard, Deborah; Mendygral, Peter SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00068	conference	November 2018
STRONGHOLD: Fast and Affordable Billion-Scale Deep Learning Model Training Sun, Xiaoyang; Wang, Wei; Qiu, Shenghao SC22: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC41404.2022.00076	conference	November 2022
Evaluating Multi-Level Checkpointing for Distributed Deep Neural Network Training Anthony, Quentin; Dai, Donglai 2021 SC Workshops Supplementary Proceedings (SCWS) https://doi.org/10.1109/SCWS55283.2021.00018	conference	November 2021
MMDataLoader: Reusing Preprocessed Data Among Concurrent Model Training Tasks Jin, Hai; Zhu, Zhanyang; He, Ligang IEEE Transactions on Computers, Vol. 73, Issue 2 https://doi.org/10.1109/TC.2023.3336161	journal	February 2024
Multimodal Machine Learning: A Survey and Taxonomy Baltrusaitis, Tadas; Ahuja, Chaitanya; Morency, Louis-Philippe IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, Issue 2 https://doi.org/10.1109/TPAMI.2018.2798607	journal	February 2019
DIESEL+: Accelerating Distributed Deep Learning Tasks on Image Datasets Wang, Lipeng; Luo, Qiong; Yan, Shengen IEEE Transactions on Parallel and Distributed Systems, Vol. 33, Issue 5 https://doi.org/10.1109/TPDS.2021.3104252	journal	May 2022
Performance Evaluation and Optimization of HBM-Enabled GPU for Data-Intensive Applications Zhu, Maohua; Zhuo, Youwei; Wang, Chao IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 26, Issue 5 https://doi.org/10.1109/TVLSI.2018.2791442	journal	May 2018
Edge AI: On-Demand Accelerating Deep Neural Network Inference via Edge Computing Li, En; Zeng, Liekang; Zhou, Zhi IEEE Transactions on Wireless Communications, Vol. 19, Issue 1 https://doi.org/10.1109/TWC.2019.2946140	journal	January 2020
Recent Trends in Stochastic Gradient Descent for Machine Learning and big data Newton, David; Pasupathy, Raghu; Yousefian, Farzad 2018 Winter Simulation Conference (WSC) https://doi.org/10.1109/WSC.2018.8632351	conference	December 2018
Leveraging burst buffer coordination to prevent I/O interference Kougkas, Anthony; Dorier, Matthieu; Latham, Rob 2016 IEEE 12th International Conference on e-Science (e-Science) https://doi.org/10.1109/eScience.2016.7870922	conference	October 2016
Evaluating Real-Time Anomaly Detection Algorithms -- The Numenta Anomaly Benchmark Lavin, Alexander; Ahmad, Subutai 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA) https://doi.org/10.1109/icmla.2015.141	conference	December 2015
Optimization Methods for Large-Scale Machine Learning Bottou, Léon; Curtis, Frank E.; Nocedal, Jorge SIAM Review, Vol. 60, Issue 2 https://doi.org/10.1137/16M1080173	journal	January 2018
Large-scale analysis of disease pathways in the human interactome Agrawal, Monica; Zitnik, Marinka; Leskovec, Jure Biocomputing 2018 https://doi.org/10.1142/9789813235533_0011	conference	November 2017
Column-stores vs. row-stores Abadi, Daniel J.; Madden, Samuel R.; Hachem, Nabil Proceedings of the 2008 ACM SIGMOD international conference on Management of data https://doi.org/10.1145/1376616.1376712	conference	June 2008
Defining and evaluating network communities based on ground-truth Yang, Jaewon; Leskovec, Jure Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics https://doi.org/10.1145/2350190.2350193	conference	August 2012
A comparative study of high-performance computing on the cloud Marathe, Aniruddha; Harris, Rachel; Lowenthal, David K. Proceedings of the 22nd international symposium on High-performance parallel and distributed computing - HPDC '13 https://doi.org/10.1145/2493123.2462919	conference	January 2013
A Dataset and Taxonomy for Urban Sound Research Salamon, Justin; Jacoby, Christopher; Bello, Juan Pablo Proceedings of the 22nd ACM international conference on Multimedia https://doi.org/10.1145/2647868.2655045	conference	November 2014
In-Datacenter Performance Analysis of a Tensor Processing Unit Jouppi, Norman P.; Borchers, Al; Boyle, Rick Proceedings of the 44th Annual International Symposium on Computer Architecture - ISCA '17 https://doi.org/10.1145/3079856.3080246	conference	January 2017
I/O Characterization and Performance Evaluation of BeeGFS for Deep Learning Chowdhury, Fahim; Zhu, Yue; Heer, Todd Proceedings of the 48th International Conference on Parallel Processing https://doi.org/10.1145/3337821.3337902	conference	August 2019
Accelerating Machine Learning I/O by Overlapping Data Staging and Mini-batch Generations Serizawa, Kazuhiro; Tatebe, Osamu Proceedings of the 6th IEEE/ACM International Conference on Big Data Computing, Applications and Technologies https://doi.org/10.1145/3365109.3368768	conference	December 2019
Deep compressive offloading Yao, Shuochao; Li, Jinyang; Liu, Dongxin Proceedings of the 18th Conference on Embedded Networked Sensor Systems https://doi.org/10.1145/3384419.3430898	conference	November 2020
Overview and Importance of Data Quality for Machine Learning Tasks Jain, Abhinav; Patel, Hima; Nagalapatti, Lokesh Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining https://doi.org/10.1145/3394486.3406477	conference	August 2020
DeepSpeed Rasley, Jeff; Rajbhandari, Samyam; Ruwase, Olatunji Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining https://doi.org/10.1145/3394486.3406703	conference	August 2020
Clairvoyant prefetching for distributed machine learning I/O Dryden, Nikoli; Böhringer, Roman; Ben-Nun, Tal Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3458817.3476181	conference	November 2021
ZeRO-infinity Rajbhandari, Samyam; Ruwase, Olatunji; Rasley, Jeff Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3458817.3476205	conference	November 2021
tf.data service Audibert, Andrew; Chen, Yang; Graur, Dan Proceedings of the 2023 ACM Symposium on Cloud Computing https://doi.org/10.1145/3620678.3624666	conference	October 2023
DDStore: Distributed Data Store for Scalable Training of Graph Neural Networks on Large Atomistic Modeling Datasets Choi, Jong Youl; Lupo Pasini, Massimiliano; Zhang, Pei Proceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis https://doi.org/10.1145/3624062.3624171	conference	November 2023
DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models Maurya, Avinash; Underwood, Robert; Rafique, M. Mustafa Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing https://doi.org/10.1145/3625549.3658685	conference	June 2024
Data Readiness for AI: A 360-Degree Survey Hiniduma, Kaveen; Byna, Suren; Bez, Jean Luca ACM Computing Surveys, Vol. 57, Issue 9 https://doi.org/10.1145/3722214	journal	April 2025
A survey on Image Data Augmentation for Deep Learning Shorten, Connor; Khoshgoftaar, Taghi M. Journal of Big Data, Vol. 6, Issue 1 https://doi.org/10.1186/s40537-019-0197-0	journal	July 2019
Text Data Augmentation for Deep Learning Shorten, Connor; Khoshgoftaar, Taghi M.; Furht, Borko Journal of Big Data, Vol. 8, Issue 1 https://doi.org/10.1186/s40537-021-00492-0	journal	July 2021
Towards accelerating model parallelism in distributed deep learning systems Choi, Hyeonseong; Lee, Byung Hyun; Chun, Se Young PLOS ONE, Vol. 18, Issue 11 https://doi.org/10.1371/journal.pone.0293338	journal	November 2023
Progressive compressed records Kuchnik, Michael; Amvrosiadis, George; Smith, Virginia Proceedings of the VLDB Endowment, Vol. 14, Issue 11 https://doi.org/10.14778/3476249.3476308	journal	July 2021
AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training Chen, Chia-Yu; Choi, Jungwook; Brand, Daniel Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32, Issue 1 https://doi.org/10.1609/aaai.v32i1.11728	journal	April 2018
Fast WordPiece Tokenization Song, Xinying; Salcianu, Alex; Song, Yang Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing https://doi.org/10.18653/v1/2021.emnlp-main.160	conference	January 2021
Cora Dataset McCallum, Andrew Texas Data Repository https://doi.org/10.18738/T8/HUIG48	dataset	January 2017
Understanding Lustre Internals. Second Edition George, Anjus; Mohr, Rick; Simmons, James https://doi.org/10.2172/1824954	report	September 2021
Analyzing Data Reference Characteristics of Deep Learning Workloads for Improving Buffer Cache Performance Lee, Jeongha; Bahn, Hyokyung Applied Sciences, Vol. 13, Issue 22 https://doi.org/10.3390/app132212102	journal	November 2023
Adaptively Periodic I/O Scheduling for Concurrent HPC Applications Zha, Benbo; Shen, Hong Electronics, Vol. 11, Issue 9 https://doi.org/10.3390/electronics11091318	journal	April 2022
AuroraGPT: A Large-Scale Foundation Model for Advancing Science Thakur, Rajeev Zenodo https://doi.org/10.5281/zenodo.13345059	text	January 2024
Erratum for Discovering Order Dependencies through Order Compatibility (EDBT 2019) Szlichta, Jaroslaw; Godfrey, Parke; Golab, Lukasz OpenProceedings.org https://doi.org/10.5441/002	dataset	January 2020
Hyrise Re-engineered: An Extensible Database System for Research in Relational In-Memory Data Management Dreseler, Markus; Kossmann, Jan; Boissier, Martin OpenProceedings.org https://doi.org/10.5441/002/edbt.2019.28	dataset	January 2019

Similar Records

Access Patterns and Performance Behaviors of Multi-layer Supercomputer I/O Subsystems under Production Load

Conference · Wed Jun 01 00:00:00 EDT 2022 · OSTI ID:1885368

Access Patterns and Performance Behaviors of Multi-layer Supercomputer I/O Subsystems under Production Load

Conference · Mon Jun 27 00:00:00 EDT 2022 · OSTI ID:1959026

Parallel I/O Evaluation Techniques and Emerging HPC Workloads: A Perspective

Conference · Fri Oct 01 00:00:00 EDT 2021 · OSTI ID:1973311

Related Subjects

97 MATHEMATICS AND COMPUTING
HPC I/O
I/O access pattern
machine learning
storage

I/O in Machine Learning Applications on HPC Systems: A 360-degree Survey

Citation Formats

References (99)

Similar Records

Related Subjects