skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Machine Learning Based Parallel I/O Predictive Modeling: A Case Study on Lustre File Systems

Conference ·

Parallel I/O hardware and software infrastructure is a key contributor to performance variability for applications running on large-scale HPC systems. This variability confounds efforts to predict application performance for characterization, modeling, optimization, and job scheduling. We propose a modeling approach that improves predictive ability by explicitly treating the variability and by leveraging the sensitivity of application parameters on performance to group applications with similar characteristics. We develop a Gaussian process-based machine learning algorithm to model I/O performance and its variability as a function of application and file system characteristics. We demonstrate the effectiveness of the proposed approach using data collected from the Edison system at the National Energy Research Scientific Computing Center. The results show that the proposed sensitivity-based models are better at prediction when compared with application-partitioned or unpartitioned models. We highlight modeling techniques that are robust to the outliers that can occur in production parallel file systems. Using the developed metrics and modeling approach, we provide insights into the file system metrics that have a significant impact on I/O performance.

Research Organization:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Organization:
USDOE Office of Science - Office of Advanced Scientific Computing Research (ASCR)
DOE Contract Number:
AC02-06CH11357
OSTI ID:
1491024
Resource Relation:
Conference: 2018 ISC High Performance, 06/24/18 - 06/28/18, Frankfurt, DE
Country of Publication:
United States
Language:
English

References (18)

Collective I/O Tuning Using Analytical and Machine Learning Models conference September 2015
Damaris: Addressing Performance Variability in Data Management for Post-Petascale Simulations journal October 2016
Extremely randomized trees journal March 2006
XGBoost: A Scalable Tree Boosting System conference January 2016
On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other journal March 1947
Analysis and Correlation of Application I/O Performance and System-Wide I/O Activity conference August 2017
The Universe at extreme scale: Multi-petaflop sky simulation on the BG/Q
  • Habib, Salman; Morozov, Vitali; Finkel, Hal
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.106
conference November 2012
An analytic performance model of disk arrays journal June 1993
Modular HPC I/O Characterization with Darshan conference November 2016
Improving parallel I/O autotuning with performance modeling
  • Behzad, Babak; Byna, Surendra; Wild, Stefan M.
  • Proceedings of the 23rd international symposium on High-performance parallel and distributed computing - HPDC '14 https://doi.org/10.1145/2600212.2600708
conference January 2014
On the Root Causes of Cross-Application I/O Interference in HPC Storage Systems conference May 2016
Reducing I/O variability using dynamic I/O path characterization in petascale storage systems journal November 2016
Performance modeling in action: Performance prediction of a Cray XT4 system during upgrade conference May 2009
A Statistical Analysis of the Performance Variability of Read/Write Operations on Parallel File Systems journal January 2017
CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination
  • Dorier, Matthieu; Antoniu, Gabriel; Ross, Rob
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.27
conference May 2014
UMAMI: a recipe for generating meaningful metrics through holistic I/O performance analysis
  • Lockwood, Glenn K.; Yoo, Wucherl; Byna, Suren
  • Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems - PDSW-DISCS '17 https://doi.org/10.1145/3149393.3149395
conference January 2017
Managing Variability in the IO Performance of Petascale Storage Systems
  • Lofstead, Jay; Zheng, Fang; Liu, Qing
  • 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2010.32
conference November 2010
Predicting Output Performance of a Petascale Supercomputer
  • Xie, Bing; Huang, Yezhou; Chase, Jeffrey S.
  • Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '17 https://doi.org/10.1145/3078597.3078614
conference January 2017