skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems

Abstract

In high-performance computing (HPC), I/O performance prediction offers the potential to improve the efficiency of scientific computing. In particular, accurate prediction can make runtime estimates more precise, guide users toward optimal checkpoint strategies, and better inform facility provisioning and scheduling policies. HPC I/O performance is notoriously difficult to predict and model, however, in large part because of inherent variability and a lack of transparency in the behaviors of constituent storage system components. In this work we seek to advance the state of the art in HPC I/O performance prediction by (1) modeling the mean performance to address high variability, (2) deriving model features from write patterns, system architecture and system configurations, and (3) employing Lasso regression model to improve model accuracy. We demonstrate the efficacy of our approach by applying it to a crucial subset of common HPC I/O motifs, namely, file-per-process checkpoint write workloads. We conduct experiments on two distinct production HPC platforms—Titan at the Oak Ridge Leadership Computing Facility and Cetus at the Argonne Leadership Computing Facility—to train and evaluate our models. We find that we can attain ≤ 30% relative error for 92.79% and 99.64% of the samples in our test set on these platforms, respectively.

Authors:
 [1];  [2];  [3];  [4];  [3];  [5]; ORCiD logo [1]; ORCiD logo [1]; ORCiD logo [1]
  1. ORNL
  2. Carnegie Mellon University (CMU)
  3. Argonne National Laboratory (ANL)
  4. Duke University
  5. Sandia National Laboratories (SNL)
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
OSTI Identifier:
1606822
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: 4th International Parallel Data Systems Workshop (PDSW 2019) - Denver, Colorado, United States of America - 11/18/2019 10:00:00 AM-11/18/2019 10:00:00 AM
Country of Publication:
United States
Language:
English

Citation Formats

Xie, Bing, Tan, Zilong, Carns, Philip, Chase, Jeffrey, Harms, Kevin, Lofstead, Gerald, Oral, Sarp, Vazhkudai, Sudharshan, and Wang, Feiyi. Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems. United States: N. p., 2019. Web. doi:10.1109/PDSW49588.2019.00008.
Xie, Bing, Tan, Zilong, Carns, Philip, Chase, Jeffrey, Harms, Kevin, Lofstead, Gerald, Oral, Sarp, Vazhkudai, Sudharshan, & Wang, Feiyi. Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems. United States. doi:10.1109/PDSW49588.2019.00008.
Xie, Bing, Tan, Zilong, Carns, Philip, Chase, Jeffrey, Harms, Kevin, Lofstead, Gerald, Oral, Sarp, Vazhkudai, Sudharshan, and Wang, Feiyi. Fri . "Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems". United States. doi:10.1109/PDSW49588.2019.00008. https://www.osti.gov/servlets/purl/1606822.
@article{osti_1606822,
title = {Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems},
author = {Xie, Bing and Tan, Zilong and Carns, Philip and Chase, Jeffrey and Harms, Kevin and Lofstead, Gerald and Oral, Sarp and Vazhkudai, Sudharshan and Wang, Feiyi},
abstractNote = {In high-performance computing (HPC), I/O performance prediction offers the potential to improve the efficiency of scientific computing. In particular, accurate prediction can make runtime estimates more precise, guide users toward optimal checkpoint strategies, and better inform facility provisioning and scheduling policies. HPC I/O performance is notoriously difficult to predict and model, however, in large part because of inherent variability and a lack of transparency in the behaviors of constituent storage system components. In this work we seek to advance the state of the art in HPC I/O performance prediction by (1) modeling the mean performance to address high variability, (2) deriving model features from write patterns, system architecture and system configurations, and (3) employing Lasso regression model to improve model accuracy. We demonstrate the efficacy of our approach by applying it to a crucial subset of common HPC I/O motifs, namely, file-per-process checkpoint write workloads. We conduct experiments on two distinct production HPC platforms—Titan at the Oak Ridge Leadership Computing Facility and Cetus at the Argonne Leadership Computing Facility—to train and evaluate our models. We find that we can attain ≤ 30% relative error for 92.79% and 99.64% of the samples in our test set on these platforms, respectively.},
doi = {10.1109/PDSW49588.2019.00008},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2019},
month = {11}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: