AIIO: Using Artificial Intelligence for Job-Level and Automatic I/O Performance Bottleneck Diagnosis
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA
- The Ohio State University & Lawrence Berkeley National Laboratory, Columbus, OH, USA
Manually diagnosing the I/O performance bottleneck for a single application (hereinafter referred to as the "job level'') is a tedious and error-prone procedure requiring domain scientists to have deep knowledge of complex storage systems. However, existing automatic methods for I/O performance bottleneck diagnosis have one major issue: the granularity of the analysis is at the platform or group level and the diagnosis results cannot be applied to the individual application. To address this issue, we designed and developed a method named "Artificial Intelligence for I/O"(AIIO), which uses AI and its interpretation technology to diagnose I/O performance bottlenecks at the job level automatically. By considering the sparsity of I/O log files, employing multiple AI models for performance prediction, merging diagnosis results across multiple models, and generalizing its performance prediction and diagnosis functions, AIIO can accurately and robustly identify the bottleneck of an even unseen application. Experimental results show that real and unseen applications can use the diagnosis results from AIIO to improve their I/O performance by at most 146 times.
- Research Organization:
- Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC)
- DOE Contract Number:
- AC02-05CH11231
- OSTI ID:
- 2228856
- Journal Information:
- Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing, Conference: HPDC '23: 32. International Symposium on High-Performance Parallel and Distributed Computing, Orlando, FL (United States), 16-23 Jun 2023
- Country of Publication:
- United States
- Language:
- English
Characterizing Machine Learning I/O Workloads on Leadership Scale HPC Systems
|
conference | November 2021 |
"Why Should I Trust You?": Explaining the Predictions of Any Classifier
|
conference | January 2016 |
Access Patterns and Performance Behaviors of Multi-layer Supercomputer I/O Subsystems under Production Load
|
conference | June 2022 |
Statistical scalability analysis of communication operations in distributed applications
|
journal | June 2001 |
Interpreting Write Performance of Supercomputer I/O Systems with Regression Models
|
conference | May 2021 |
Stack Trace Analysis for Large Scale Debugging
|
conference | March 2007 |
Artificial intelligence for throughput bottleneck analysis – State-of-the-art and future directions
|
journal | July 2021 |
An In-Depth I/O Pattern Analysis in HPC Systems
|
conference | December 2021 |
Parallel I/O, analysis, and visualization of a trillion particle simulation
|
conference | November 2012 |
Systematically inferring I/O performance variability by examining repetitive job behavior
|
conference | November 2021 |
Six degrees of scientific data: reading patterns for extreme scale science IO
|
conference | January 2011 |
Early experiences in application level I/O tracing on blue gene systems
|
conference | April 2008 |
IOMiner: Large-Scale Analytics Framework for Gaining Knowledge from I/O Logs
|
conference | September 2018 |
Gauge: An Interactive Data-Driven Visualization Tool for HPC Application I/O Performance Analysis
|
conference | November 2020 |
Drishti: Guiding End-Users in the I/O Optimization Journey
|
conference | November 2022 |
Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems
|
conference | November 2019 |
XGBoost: A Scalable Tree Boosting System
|
conference | January 2016 |
Characterizing the I/O behavior of scientific applications on the Cray XT
|
conference | November 2007 |
A New File-Specific Stripe Size Selection Method for Highly Concurrent Data Access
|
conference | September 2012 |
pNFS, POSIX, and MPI-IO
|
conference | November 2009 |
I/O Bottleneck Detection and Tuning: Connecting the Dots using Interactive Log Analysis
|
conference | November 2021 |
An analytic performance model of disk arrays
|
conference | June 1993 |
Data partitioning and load balancing in parallel disk systems
|
journal | February 1998 |
ExaHDF5: Delivering Efficient Parallel I/O on Exascale Computing Systems
|
journal | January 2020 |
machine.
|
journal | October 2001 |
File-access characteristics of parallel scientific workloads
|
journal | January 1996 |
Similar Records
A Zoom-in Analysis of I/O Logs to Detect Root Causes of I/O Performance Bottlenecks
IEA/AIE-89; Proceedings of the Second International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems, University of Tennessee, Tullahoma, June 6-9, 1989. Volumes 1 2