Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Machine Learning Assisted HPC Workload Trace Generation for Leadership Scale Storage Systems

Conference ·
Monitoring and analyzing a wide range of I/O activities in an HPC cluster is important in maintaining mission-critical performance in a large-scale, multi-user, parallel storage system. Center-wide I/O traces can provide high-level information and fine-grained activities per application or per user running in the system. Studying such large-scale traces can provide helpful insights into the system. It can be used to develop predictive methods for making predictive decisions, adjusting scheduling policies, or providing decisions for the design of next-generation systems. However, sharing real-world I/O traces to expedite such research efforts leaves a few concerns; i) the cost of sharing the large traces is expensive due to this large size, and ii) privacy concern is an issue.We address such issues by building an end-to-end machine learn- ing (ML) workflow that can generate I/O traces for large-scale HPC applications. We leverage ML based feature selection and gener- ative models for I/O trace generation. The generative models are trained on I/O traces collected by the darshan I/O characterization tool over a period of one year. We present a two-step generation process consisting of two deep-learning models, called the feature generator and the trace generator. The combination of two-step generative models provides robustness by reducing the bias of the model and accounting for the stochastic nature of the I/O traces across different runs of an application. We evaluate the performance of the generative models and show that the two-step model can generate time-series I/O traces with less than 20% root mean square error.
Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE; USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1883852
Country of Publication:
United States
Language:
English

References (18)

HPC Workload Characterization Using Feature Selection and Clustering
  • Bang, Jiwoo; Kim, Chungyong; Wu, Kesheng
  • HPDC '20: The 29th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 3rd International Workshop on Systems and Network Telemetry and Analytics https://doi.org/10.1145/3391812.3396270
conference June 2020
Scalable I/O tracing and analysis conference January 2009
24/7 Characterization of petascale I/O workloads conference August 2009
Techniques for modeling large-scale HPC I/O workloads
  • Snyder, Shane; Carns, Philip
  • Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems - PMBS '15 https://doi.org/10.1145/2832087.2832091
conference January 2015
Analysis and Correlation of Application I/O Performance and System-Wide I/O Activity conference August 2017
Deep Residual Learning for Image Recognition conference June 2016
The Design and Implementation of a Domain-Specific Language for Network Performance Testing journal October 2007
Normalized Mutual Information Feature Selection journal February 2009
Characterizing and Understanding HPC Job Failures Over The 2K-Day Life of IBM BlueGene/Q System conference June 2019
Modeling I/O Performance Variability Using Conditional Variational Autoencoders conference September 2018
Systematically inferring I/O performance variability by examining repetitive job behavior
  • Costa, Emily; Patel, Tirthak; Schwaller, Benjamin
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3458817.3476186
conference November 2021
Tbbt journal June 2005
Characterizing Machine Learning I/O Workloads on Leadership Scale HPC Systems conference November 2021
A general regression neural network journal January 1991
Physics-constrained deep learning for high-dimensional surrogate modeling and uncertainty quantification without labeled data journal October 2019
Enabling Parallel Simulation of Large-Scale HPC Network Systems journal January 2017
Understanding HPC Application I/O Behavior Using System Level Statistics conference December 2020
Automatic Generation of I/O Kernels for HPC Applications conference November 2014

Similar Records

Characterizing Machine Learning I/O Workloads on Leadership Scale HPC Systems
Conference · Mon Nov 01 00:00:00 EDT 2021 · OSTI ID:1885376

DXT: Darshan eXtended Tracing
Conference · Tue Jan 08 23:00:00 EST 2019 · OSTI ID:1490709

Performance Evaluation of Darshan 3.0.0 on the Cray XC30
Technical Report · Fri Apr 01 00:00:00 EDT 2016 · OSTI ID:1250469

Related Subjects