DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Design and implementation of I/O performance prediction scheme on HPC systems through large-scale log analysis

Journal Article · · Journal of Big Data

Abstract Large-scale high performance computing (HPC) systems typically consist of many thousands of CPUs and storage units used by hundreds to thousands of users simultaneously. Applications from large numbers of users have diverse characteristics, such as varying computation, communication, memory, and I/O intensity. A good understanding of the performance characteristics of each user application is important for job scheduling and resource provisioning. Among these performance characteristics, I/O performance is becoming increasingly important as data sizes rapidly increase and large-scale applications, such as simulation and model training, are widely adopted. However, predicting I/O performance is difficult because I/O systems are shared among all users and involve many layers of software and hardware stack, including the application, network interconnect, operating system, file system, and storage devices. Furthermore, updates to these layers and changes in system management policy can significantly alter the I/O behavior of applications and the entire system. To improve the prediction of the I/O performance on HPC systems, we propose integrating information from several different system logs and developing a regression-based approach to predict the I/O performance. Our proposed scheme can dynamically select the most relevant features from the log entries using various feature selection algorithms and scoring functions, and can automatically select the regression algorithm with the best accuracy for the prediction task. The evaluation results show that our proposed scheme can predict the write performance with up to 90% prediction accuracy and the read performance with up to 99% prediction accuracy using the real logs from the Cori supercomputer system at NERSC.

Sponsoring Organization:
USDOE
Grant/Contract Number:
AC02-05CH11231
OSTI ID:
1974087
Alternate ID(s):
OSTI ID: 2229336
Journal Information:
Journal of Big Data, Journal Name: Journal of Big Data Journal Issue: 1 Vol. 10; ISSN 2196-1115
Publisher:
Springer Science + Business MediaCopyright Statement
Country of Publication:
Germany
Language:
English

References (32)

Estimating mutual information journal June 2004
Stochastic gradient boosting journal February 2002
Interpreting Write Performance of Supercomputer I/O Systems with Regression Models conference May 2021
Entropy-Aware I/O Pipelining for Large-Scale Deep Learning on HPC Systems
  • Zhu, Yue; Chowdhury, Fahim; Fu, Huansong
  • 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) https://doi.org/10.1109/MASCOTS.2018.00023
conference September 2018
Characterizing and predicting the I/O performance of HPC applications using a parameterized synthetic benchmark conference November 2008
Revisiting I/O behavior in large-scale storage systems
  • Patel, Tirthak; Byna, Suren; Lockwood, Glenn K.
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3295500.3356183
conference November 2019
IOMiner: Large-Scale Analytics Framework for Gaining Knowledge from I/O Logs conference September 2018
Is Feature Selection Still Necessary? book January 2006
24/7 Characterization of petascale I/O workloads conference August 2009
Simulation-Based Performance Prediction of HPC Applications: A Case Study of HPL conference November 2020
Active Learning-based Automatic Tuning and Prediction of Parallel I/O Performance conference November 2019
I/O performance challenges at leadership scale conference January 2009
Efficient Greedy Learning of Gaussian Mixture Models journal February 2003
Machine Learning Predictions of Runtime and IO Traffic on High-End Clusters conference September 2016
DCA-IO: A Dynamic I/O Control Scheme for Parallel and Distributed File Systems conference May 2019
The Distance-Weighted k-Nearest-Neighbor Rule journal April 1976
Modular HPC I/O Characterization with Darshan conference November 2016
PBS Pro: Grid Computing and Scheduling Attributes book January 2004
Improving parallel I/O autotuning with performance modeling
  • Behzad, Babak; Byna, Surendra; Wild, Stefan M.
  • Proceedings of the 23rd international symposium on High-performance parallel and distributed computing - HPDC '14 https://doi.org/10.1145/2600212.2600708
conference January 2014
Towards HPC I/O Performance Prediction through Large-scale Log Analysis conference June 2020
Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems conference November 2019
SLURM: Simple Linux Utility for Resource Management book January 2003
A new ensemble deep graph reinforcement learning network for spatio-temporal traffic volume forecasting in a freeway network journal April 2022
On the Use of Machine Learning to Predict the Time and Resources Consumed by Applications conference May 2010
machine. journal October 2001
Modeling and Predicting Disk I/O Time of HPC Applications conference June 2010
Multilayer perceptron, fuzzy sets, and classification journal January 1992
A novel direct performance adaptive control of aero-engine using subspace-based improved model predictive control journal September 2022
Predicting Output Performance of a Petascale Supercomputer
  • Xie, Bing; Huang, Yezhou; Chase, Jeffrey S.
  • Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '17 https://doi.org/10.1145/3078597.3078614
conference January 2017
Pattern-driven parallel I/O tuning conference January 2015
Data Structures for Statistical Computing in Python conference January 2010
A Year in the Life of a Parallel File System conference November 2018