Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Machine Learning Based Parallel I/O Predictive Modeling: A Case Study on Lustre File Systems

Conference ·

Parallel I/O hardware and software infrastructure is a key contributor to performance variability for applications running on large-scale HPC systems. This variability confounds efforts to predict application performance for characterization, modeling, optimization, and job scheduling. We propose a modeling approach that improves predictive ability by explicitly treating the variability and by leveraging the sensitivity of application parameters on performance to group applications with similar characteristics. We develop a Gaussian process-based machine learning algorithm to model I/O performance and its variability as a function of application and file system characteristics. We demonstrate the effectiveness of the proposed approach using data collected from the Edison system at the National Energy Research Scientific Computing Center. The results show that the proposed sensitivity-based models are better at prediction when compared with application-partitioned or unpartitioned models. We highlight modeling techniques that are robust to the outliers that can occur in production parallel file systems. Using the developed metrics and modeling approach, we provide insights into the file system metrics that have a significant impact on I/O performance.

Research Organization:
Argonne National Laboratory (ANL), Argonne, IL (United States)
Sponsoring Organization:
USDOE Office of Science - Office of Advanced Scientific Computing Research (ASCR)
DOE Contract Number:
AC02-06CH11357
OSTI ID:
1491024
Resource Relation:
Conference: 2018 ISC High Performance, 06/24/18 - 06/28/18, Frankfurt, DE
Country of Publication:
United States
Language:
English

References (21)

Interpolation of Spatial Data January 1999
Collective I/O Tuning Using Analytical and Machine Learning Models September 2015
Real-Time I/O-Monitoring of HPC Applications with SIOX, Elasticsearch, Grafana and FUSE January 2017
Damaris: Addressing Performance Variability in Data Management for Post-Petascale Simulations October 2016
Extremely randomized trees March 2006
XGBoost: A Scalable Tree Boosting System January 2016
On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other March 1947
Analysis and Correlation of Application I/O Performance and System-Wide I/O Activity August 2017
An Introduction to Variational Methods for Graphical Models January 1999
The Universe at extreme scale: Multi-petaflop sky simulation on the BG/Q
  • No authors listed
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.106
November 2012
An analytic performance model of disk arrays June 1993
Modular HPC I/O Characterization with Darshan November 2016
Improving parallel I/O autotuning with performance modeling January 2014
On the Root Causes of Cross-Application I/O Interference in HPC Storage Systems May 2016
Reducing I/O variability using dynamic I/O path characterization in petascale storage systems November 2016
Performance modeling in action: Performance prediction of a Cray XT4 system during upgrade May 2009
A Statistical Analysis of the Performance Variability of Read/Write Operations on Parallel File Systems January 2017
CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination
  • No authors listed
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.27
May 2014
UMAMI: a recipe for generating meaningful metrics through holistic I/O performance analysis
  • No authors listed
  • Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems - PDSW-DISCS '17 https://doi.org/10.1145/3149393.3149395
January 2017
Managing Variability in the IO Performance of Petascale Storage Systems
  • No authors listed
  • 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2010.32
November 2010
Predicting Output Performance of a Petascale Supercomputer January 2017