Machine Learning Based Parallel I/O Predictive Modeling: A Case Study on Lustre File Systems

Madireddy, Sandeep; Balaprakash, Prasanna; Carns, Philip; Latham, Robert; Ross, Robert; Snyder, Shane; Wild, Stefan M.

doi:10.1007/978-3-319-92040-5_10

Title: Machine Learning Based Parallel I/O Predictive Modeling: A Case Study on Lustre File Systems

Conference · Mon Jan 01 00:00:00 EST 2018

DOI:https://doi.org/10.1007/978-3-319-92040-5_10· OSTI ID:1491024

Madireddy, Sandeep; Balaprakash, Prasanna; Carns, Philip; Latham, Robert; Ross, Robert; Snyder, Shane; Wild, Stefan M.

Parallel I/O hardware and software infrastructure is a key contributor to performance variability for applications running on large-scale HPC systems. This variability confounds efforts to predict application performance for characterization, modeling, optimization, and job scheduling. We propose a modeling approach that improves predictive ability by explicitly treating the variability and by leveraging the sensitivity of application parameters on performance to group applications with similar characteristics. We develop a Gaussian process-based machine learning algorithm to model I/O performance and its variability as a function of application and file system characteristics. We demonstrate the effectiveness of the proposed approach using data collected from the Edison system at the National Energy Research Scientific Computing Center. The results show that the proposed sensitivity-based models are better at prediction when compared with application-partitioned or unpartitioned models. We highlight modeling techniques that are robust to the outliers that can occur in production parallel file systems. Using the developed metrics and modeling approach, we provide insights into the file system metrics that have a significant impact on I/O performance.

View Conference

Cite

Export

Save

Research Organization:: Argonne National Lab. (ANL), Argonne, IL (United States)

Sponsoring Organization:: USDOE Office of Science - Office of Advanced Scientific Computing Research (ASCR)

DOE Contract Number:: AC02-06CH11357

OSTI ID:: 1491024

Resource Relation:: Conference: 2018 ISC High Performance, 06/24/18 - 06/28/18, Frankfurt, DE

Country of Publication:: United States

Language:: English

References (18)

Collective I/O Tuning Using Analytical and Machine Learning Models Isaila, Florin; Balaprakash, Prasanna; Wild, Stefan M. 2015 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2015.29	conference	September 2015
Damaris: Addressing Performance Variability in Data Management for Post-Petascale Simulations Dorier, Matthieu; Antoniu, Gabriel; Cappello, Franck ACM Transactions on Parallel Computing, Vol. 3, Issue 3 https://doi.org/10.1145/2987371	journal	October 2016
Extremely randomized trees Geurts, Pierre; Ernst, Damien; Wehenkel, Louis Machine Learning, Vol. 63, Issue 1 https://doi.org/10.1007/s10994-006-6226-1	journal	March 2006
XGBoost: A Scalable Tree Boosting System Chen, Tianqi; Guestrin, Carlos Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '16 https://doi.org/10.1145/2939672.2939785	conference	January 2016
On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other Mann, H. B.; Whitney, D. R. The Annals of Mathematical Statistics, Vol. 18, Issue 1 https://doi.org/10.1214/aoms/1177730491	journal	March 1947
Analysis and Correlation of Application I/O Performance and System-Wide I/O Activity Madireddy, Sandeep; Balaprakash, Prasanna; Carns, Philip 2017 International Conference on Networking, Architecture, and Storage (NAS) https://doi.org/10.1109/NAS.2017.8026844	conference	August 2017
The Universe at extreme scale: Multi-petaflop sky simulation on the BG/Q Habib, Salman; Morozov, Vitali; Finkel, Hal 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.106	conference	November 2012
An analytic performance model of disk arrays Lee, Edward K.; Katz, Randy H. ACM SIGMETRICS Performance Evaluation Review, Vol. 21, Issue 1 https://doi.org/10.1145/166962.166994	journal	June 1993
Modular HPC I/O Characterization with Darshan Snyder, Shane; Carns, Philip; Harms, Kevin 2016 5th Workshop on Extreme-Scale Programming Tools (ESPT) https://doi.org/10.1109/ESPT.2016.006	conference	November 2016
Improving parallel I/O autotuning with performance modeling Behzad, Babak; Byna, Surendra; Wild, Stefan M. Proceedings of the 23rd international symposium on High-performance parallel and distributed computing - HPDC '14 https://doi.org/10.1145/2600212.2600708	conference	January 2014
On the Root Causes of Cross-Application I/O Interference in HPC Storage Systems Yildiz, Orcun; Dorier, Matthieu; Ibrahim, Shadi 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2016.50	conference	May 2016
Reducing I/O variability using dynamic I/O path characterization in petascale storage systems Son, Seung Woo; Sehrish, Saba; Liao, Wei-keng The Journal of Supercomputing, Vol. 73, Issue 5, p. 2069-2097 https://doi.org/10.1007/s11227-016-1904-7	journal	November 2016
Performance modeling in action: Performance prediction of a Cray XT4 system during upgrade Barker, Kevin J.; Davis, Kei; Kerbyson, Darren J. Distributed Processing (IPDPS), 2009 IEEE International Symposium on Parallel & Distributed Processing https://doi.org/10.1109/IPDPS.2009.5161098	conference	May 2009
A Statistical Analysis of the Performance Variability of Read/Write Operations on Parallel File Systems Inacio, Eduardo C.; Barbetta, Pedro A.; Dantas, Mario A. R. Procedia Computer Science, Vol. 108 https://doi.org/10.1016/j.procs.2017.05.026	journal	January 2017
CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination Dorier, Matthieu; Antoniu, Gabriel; Ross, Rob 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.27	conference	May 2014
UMAMI: a recipe for generating meaningful metrics through holistic I/O performance analysis Lockwood, Glenn K.; Yoo, Wucherl; Byna, Suren Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems - PDSW-DISCS '17 https://doi.org/10.1145/3149393.3149395	conference	January 2017
Managing Variability in the IO Performance of Petascale Storage Systems Lofstead, Jay; Zheng, Fang; Liu, Qing 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2010.32	conference	November 2010
Predicting Output Performance of a Petascale Supercomputer Xie, Bing; Huang, Yezhou; Chase, Jeffrey S. Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '17 https://doi.org/10.1145/3078597.3078614	conference	January 2017

Similar Records

Tuning HDF5 for Lustre File Systems

Conference · Fri Sep 24 00:00:00 EDT 2010 · OSTI ID:1491024

Howison, Mark; Koziol, Quincey; Knaak, David; +2 more

Automatic and Transparent Resource Contention Mitigation for Improving Large-Scale Parallel File System Performance

Conference · Fri Dec 01 00:00:00 EST 2017 · OSTI ID:1491024

Neuwirth, Sarah; Wang, Feiyi; Oral, Sarp; +1 more