Machine Learning Assisted HPC Workload Trace Generation for Leadership Scale Storage Systems
- ORNL
Monitoring and analyzing a wide range of I/O activities in an HPC cluster is important in maintaining mission-critical performance in a large-scale, multi-user, parallel storage system. Center-wide I/O traces can provide high-level information and fine-grained activities per application or per user running in the system. Studying such large-scale traces can provide helpful insights into the system. It can be used to develop predictive methods for making predictive decisions, adjusting scheduling policies, or providing decisions for the design of next-generation systems. However, sharing real-world I/O traces to expedite such research efforts leaves a few concerns; i) the cost of sharing the large traces is expensive due to this large size, and ii) privacy concern is an issue.We address such issues by building an end-to-end machine learn- ing (ML) workflow that can generate I/O traces for large-scale HPC applications. We leverage ML based feature selection and gener- ative models for I/O trace generation. The generative models are trained on I/O traces collected by the darshan I/O characterization tool over a period of one year. We present a two-step generation process consisting of two deep-learning models, called the feature generator and the trace generator. The combination of two-step generative models provides robustness by reducing the bias of the model and accounting for the stochastic nature of the I/O traces across different runs of an application. We evaluate the performance of the generative models and show that the two-step model can generate time-series I/O traces with less than 20% root mean square error.
- Research Organization:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE; USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
- DOE Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1883852
- Country of Publication:
- United States
- Language:
- English
HPC Workload Characterization Using Feature Selection and Clustering
|
conference | June 2020 |
Scalable I/O tracing and analysis
|
conference | January 2009 |
24/7 Characterization of petascale I/O workloads
|
conference | August 2009 |
Techniques for modeling large-scale HPC I/O workloads
|
conference | January 2015 |
Analysis and Correlation of Application I/O Performance and System-Wide I/O Activity
|
conference | August 2017 |
Deep Residual Learning for Image Recognition
|
conference | June 2016 |
The Design and Implementation of a Domain-Specific Language for Network Performance Testing
|
journal | October 2007 |
Normalized Mutual Information Feature Selection
|
journal | February 2009 |
Characterizing and Understanding HPC Job Failures Over The 2K-Day Life of IBM BlueGene/Q System
|
conference | June 2019 |
Modeling I/O Performance Variability Using Conditional Variational Autoencoders
|
conference | September 2018 |
Systematically inferring I/O performance variability by examining repetitive job behavior
|
conference | November 2021 |
Tbbt
|
journal | June 2005 |
Characterizing Machine Learning I/O Workloads on Leadership Scale HPC Systems
|
conference | November 2021 |
A general regression neural network
|
journal | January 1991 |
Physics-constrained deep learning for high-dimensional surrogate modeling and uncertainty quantification without labeled data
|
journal | October 2019 |
Enabling Parallel Simulation of Large-Scale HPC Network Systems
|
journal | January 2017 |
Understanding HPC Application I/O Behavior Using System Level Statistics
|
conference | December 2020 |
Automatic Generation of I/O Kernels for HPC Applications
|
conference | November 2014 |
Similar Records
Characterizing Machine Learning I/O Workloads on Leadership Scale HPC Systems
DXT: Darshan eXtended Tracing
Performance Evaluation of Darshan 3.0.0 on the Cray XC30
Conference
·
Mon Nov 01 00:00:00 EDT 2021
·
OSTI ID:1885376
DXT: Darshan eXtended Tracing
Conference
·
Tue Jan 08 23:00:00 EST 2019
·
OSTI ID:1490709
Performance Evaluation of Darshan 3.0.0 on the Cray XC30
Technical Report
·
Fri Apr 01 00:00:00 EDT 2016
·
OSTI ID:1250469