Machine Learning Based Parallel I/O Predictive Modeling: A Case Study on Lustre File Systems
Parallel I/O hardware and software infrastructure is a key contributor to performance variability for applications running on large-scale HPC systems. This variability confounds efforts to predict application performance for characterization, modeling, optimization, and job scheduling. We propose a modeling approach that improves predictive ability by explicitly treating the variability and by leveraging the sensitivity of application parameters on performance to group applications with similar characteristics. We develop a Gaussian process-based machine learning algorithm to model I/O performance and its variability as a function of application and file system characteristics. We demonstrate the effectiveness of the proposed approach using data collected from the Edison system at the National Energy Research Scientific Computing Center. The results show that the proposed sensitivity-based models are better at prediction when compared with application-partitioned or unpartitioned models. We highlight modeling techniques that are robust to the outliers that can occur in production parallel file systems. Using the developed metrics and modeling approach, we provide insights into the file system metrics that have a significant impact on I/O performance.
- Research Organization:
- Argonne National Lab. (ANL), Argonne, IL (United States)
- Sponsoring Organization:
- USDOE Office of Science - Office of Advanced Scientific Computing Research (ASCR)
- DOE Contract Number:
- AC02-06CH11357
- OSTI ID:
- 1491024
- Resource Relation:
- Conference: 2018 ISC High Performance, 06/24/18 - 06/28/18, Frankfurt, DE
- Country of Publication:
- United States
- Language:
- English
Collective I/O Tuning Using Analytical and Machine Learning Models
|
conference | September 2015 |
Damaris: Addressing Performance Variability in Data Management for Post-Petascale Simulations
|
journal | October 2016 |
Extremely randomized trees
|
journal | March 2006 |
XGBoost: A Scalable Tree Boosting System
|
conference | January 2016 |
On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other
|
journal | March 1947 |
Analysis and Correlation of Application I/O Performance and System-Wide I/O Activity
|
conference | August 2017 |
The Universe at extreme scale: Multi-petaflop sky simulation on the BG/Q
|
conference | November 2012 |
An analytic performance model of disk arrays
|
journal | June 1993 |
Modular HPC I/O Characterization with Darshan
|
conference | November 2016 |
Improving parallel I/O autotuning with performance modeling
|
conference | January 2014 |
On the Root Causes of Cross-Application I/O Interference in HPC Storage Systems
|
conference | May 2016 |
Reducing I/O variability using dynamic I/O path characterization in petascale storage systems
|
journal | November 2016 |
Performance modeling in action: Performance prediction of a Cray XT4 system during upgrade
|
conference | May 2009 |
A Statistical Analysis of the Performance Variability of Read/Write Operations on Parallel File Systems
|
journal | January 2017 |
CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination
|
conference | May 2014 |
UMAMI: a recipe for generating meaningful metrics through holistic I/O performance analysis
|
conference | January 2017 |
Managing Variability in the IO Performance of Petascale Storage Systems
|
conference | November 2010 |
Predicting Output Performance of a Petascale Supercomputer
|
conference | January 2017 |
Similar Records
Automatic and Transparent Resource Contention Mitigation for Improving Large-Scale Parallel File System Performance
Automatic and Transparent Resource Contention Mitigation for Improving Large-Scale Parallel File System Performance