Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

AIIO: Using Artificial Intelligence for Job-Level and Automatic I/O Performance Bottleneck Diagnosis

Conference · · Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing

Manually diagnosing the I/O performance bottleneck for a single application (hereinafter referred to as the "job level'') is a tedious and error-prone procedure requiring domain scientists to have deep knowledge of complex storage systems. However, existing automatic methods for I/O performance bottleneck diagnosis have one major issue: the granularity of the analysis is at the platform or group level and the diagnosis results cannot be applied to the individual application. To address this issue, we designed and developed a method named "Artificial Intelligence for I/O"(AIIO), which uses AI and its interpretation technology to diagnose I/O performance bottlenecks at the job level automatically. By considering the sparsity of I/O log files, employing multiple AI models for performance prediction, merging diagnosis results across multiple models, and generalizing its performance prediction and diagnosis functions, AIIO can accurately and robustly identify the bottleneck of an even unseen application. Experimental results show that real and unseen applications can use the diagnosis results from AIIO to improve their I/O performance by at most 146 times.

Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC)
DOE Contract Number:
AC02-05CH11231
OSTI ID:
2228856
Journal Information:
Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing, Conference: HPDC '23: 32. International Symposium on High-Performance Parallel and Distributed Computing, Orlando, FL (United States), 16-23 Jun 2023
Country of Publication:
United States
Language:
English

References (26)

Characterizing Machine Learning I/O Workloads on Leadership Scale HPC Systems conference November 2021
"Why Should I Trust You?": Explaining the Predictions of Any Classifier
  • Ribeiro, Marco Tulio; Singh, Sameer; Guestrin, Carlos
  • Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '16 https://doi.org/10.1145/2939672.2939778
conference January 2016
Access Patterns and Performance Behaviors of Multi-layer Supercomputer I/O Subsystems under Production Load
  • Bez, Jean Luca; Karimi, Ahmad Maroof; Paul, Arnab K.
  • Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing https://doi.org/10.1145/3502181.3531461
conference June 2022
Statistical scalability analysis of communication operations in distributed applications journal June 2001
Interpreting Write Performance of Supercomputer I/O Systems with Regression Models conference May 2021
Stack Trace Analysis for Large Scale Debugging conference March 2007
Artificial intelligence for throughput bottleneck analysis – State-of-the-art and future directions journal July 2021
An In-Depth I/O Pattern Analysis in HPC Systems conference December 2021
Parallel I/O, analysis, and visualization of a trillion particle simulation
  • Byna, Surendra; Chou, Jerry; Rubel, Oliver
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.92
conference November 2012
Systematically inferring I/O performance variability by examining repetitive job behavior
  • Costa, Emily; Patel, Tirthak; Schwaller, Benjamin
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3458817.3476186
conference November 2021
Six degrees of scientific data: reading patterns for extreme scale science IO conference January 2011
Early experiences in application level I/O tracing on blue gene systems conference April 2008
IOMiner: Large-Scale Analytics Framework for Gaining Knowledge from I/O Logs conference September 2018
Gauge: An Interactive Data-Driven Visualization Tool for HPC Application I/O Performance Analysis conference November 2020
Drishti: Guiding End-Users in the I/O Optimization Journey conference November 2022
Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems conference November 2019
XGBoost: A Scalable Tree Boosting System conference January 2016
Characterizing the I/O behavior of scientific applications on the Cray XT conference November 2007
A New File-Specific Stripe Size Selection Method for Highly Concurrent Data Access conference September 2012
pNFS, POSIX, and MPI-IO conference November 2009
I/O Bottleneck Detection and Tuning: Connecting the Dots using Interactive Log Analysis conference November 2021
An analytic performance model of disk arrays conference June 1993
Data partitioning and load balancing in parallel disk systems journal February 1998
ExaHDF5: Delivering Efficient Parallel I/O on Exascale Computing Systems journal January 2020
machine. journal October 2001
File-access characteristics of parallel scientific workloads journal January 1996