skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Machine Learning Based Parallel I/O Predictive Modeling: A Case Study on Lustre File Systems

Abstract

Parallel I/O hardware and software infrastructure is a key contributor to performance variability for applications running on large-scale HPC systems. This variability confounds efforts to predict application performance for characterization, modeling, optimization, and job scheduling. We propose a modeling approach that improves predictive ability by explicitly treating the variability and by leveraging the sensitivity of application parameters on performance to group applications with similar characteristics. We develop a Gaussian process-based machine learning algorithm to model I/O performance and its variability as a function of application and file system characteristics. We demonstrate the effectiveness of the proposed approach using data collected from the Edison system at the National Energy Research Scientific Computing Center. The results show that the proposed sensitivity-based models are better at prediction when compared with application-partitioned or unpartitioned models. We highlight modeling techniques that are robust to the outliers that can occur in production parallel file systems. Using the developed metrics and modeling approach, we provide insights into the file system metrics that have a significant impact on I/O performance.

Authors:
; ; ; ; ; ;
Publication Date:
Research Org.:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Org.:
USDOE Office of Science - Office of Advanced Scientific Computing Research
OSTI Identifier:
1491024
DOE Contract Number:  
AC02-06CH11357
Resource Type:
Conference
Resource Relation:
Conference: 2018 ISC High Performance, 06/24/18 - 06/28/18, Frankfurt, DE
Country of Publication:
United States
Language:
English
Subject:
I/O performance variability; machine learning; parallel file systems; robust Gaussian process regression

Citation Formats

Madireddy, Sandeep, Balaprakash, Prasanna, Carns, Philip, Latham, Robert, Ross, Robert, Snyder, Shane, and Wild, Stefan M. Machine Learning Based Parallel I/O Predictive Modeling: A Case Study on Lustre File Systems. United States: N. p., 2018. Web. doi:10.1007/978-3-319-92040-5_10.
Madireddy, Sandeep, Balaprakash, Prasanna, Carns, Philip, Latham, Robert, Ross, Robert, Snyder, Shane, & Wild, Stefan M. Machine Learning Based Parallel I/O Predictive Modeling: A Case Study on Lustre File Systems. United States. doi:10.1007/978-3-319-92040-5_10.
Madireddy, Sandeep, Balaprakash, Prasanna, Carns, Philip, Latham, Robert, Ross, Robert, Snyder, Shane, and Wild, Stefan M. Mon . "Machine Learning Based Parallel I/O Predictive Modeling: A Case Study on Lustre File Systems". United States. doi:10.1007/978-3-319-92040-5_10.
@article{osti_1491024,
title = {Machine Learning Based Parallel I/O Predictive Modeling: A Case Study on Lustre File Systems},
author = {Madireddy, Sandeep and Balaprakash, Prasanna and Carns, Philip and Latham, Robert and Ross, Robert and Snyder, Shane and Wild, Stefan M.},
abstractNote = {Parallel I/O hardware and software infrastructure is a key contributor to performance variability for applications running on large-scale HPC systems. This variability confounds efforts to predict application performance for characterization, modeling, optimization, and job scheduling. We propose a modeling approach that improves predictive ability by explicitly treating the variability and by leveraging the sensitivity of application parameters on performance to group applications with similar characteristics. We develop a Gaussian process-based machine learning algorithm to model I/O performance and its variability as a function of application and file system characteristics. We demonstrate the effectiveness of the proposed approach using data collected from the Edison system at the National Energy Research Scientific Computing Center. The results show that the proposed sensitivity-based models are better at prediction when compared with application-partitioned or unpartitioned models. We highlight modeling techniques that are robust to the outliers that can occur in production parallel file systems. Using the developed metrics and modeling approach, we provide insights into the file system metrics that have a significant impact on I/O performance.},
doi = {10.1007/978-3-319-92040-5_10},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2018},
month = {1}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:

Works referenced in this record:

Damaris: Addressing Performance Variability in Data Management for Post-Petascale Simulations
journal, October 2016

  • Dorier, Matthieu; Antoniu, Gabriel; Cappello, Franck
  • ACM Transactions on Parallel Computing, Vol. 3, Issue 3
  • DOI: 10.1145/2987371

Extremely randomized trees
journal, March 2006


On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other
journal, March 1947


An analytic performance model of disk arrays
journal, June 1993

  • Lee, Edward K.; Katz, Randy H.
  • ACM SIGMETRICS Performance Evaluation Review, Vol. 21, Issue 1
  • DOI: 10.1145/166962.166994

Reducing I/O variability using dynamic I/O path characterization in petascale storage systems
journal, November 2016

  • Son, Seung Woo; Sehrish, Saba; Liao, Wei-keng
  • The Journal of Supercomputing, Vol. 73, Issue 5, p. 2069-2097
  • DOI: 10.1007/s11227-016-1904-7

A Statistical Analysis of the Performance Variability of Read/Write Operations on Parallel File Systems
journal, January 2017