Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems

Xie, Bing; Tan, Zilong; Carns, Philip; Chase, Jeffrey; Harms, Kevin; Lofstead, Gerald; Oral, Sarp; Vazhkudai, Sudharshan; Wang, Feiyi

doi:10.1109/PDSW49588.2019.00008

Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems

Conference · Fri Nov 01 04:00:00 EDT 2019

DOI:https://doi.org/10.1109/PDSW49588.2019.00008· OSTI ID:1606822

Xie, Bing ^[1]; Tan, Zilong ^[2]; Carns, Philip ^[3]; Chase, Jeffrey ^[4]; Harms, Kevin ^[3]; Lofstead, Gerald ^[5]; ^[1]; ^[1]; ^[1]

ORNL
Carnegie Mellon University (CMU)
Argonne National Laboratory (ANL)
Duke University
Sandia National Laboratories (SNL)

In high-performance computing (HPC), I/O performance prediction offers the potential to improve the efficiency of scientific computing. In particular, accurate prediction can make runtime estimates more precise, guide users toward optimal checkpoint strategies, and better inform facility provisioning and scheduling policies. HPC I/O performance is notoriously difficult to predict and model, however, in large part because of inherent variability and a lack of transparency in the behaviors of constituent storage system components. In this work we seek to advance the state of the art in HPC I/O performance prediction by (1) modeling the mean performance to address high variability, (2) deriving model features from write patterns, system architecture and system configurations, and (3) employing Lasso regression model to improve model accuracy. We demonstrate the efficacy of our approach by applying it to a crucial subset of common HPC I/O motifs, namely, file-per-process checkpoint write workloads. We conduct experiments on two distinct production HPC platforms—Titan at the Oak Ridge Leadership Computing Facility and Cetus at the Argonne Leadership Computing Facility—to train and evaluate our models. We find that we can attain ≤ 30% relative error for 92.79% and 99.64% of the samples in our test set on these platforms, respectively.

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1606822

Country of Publication:: United States

Language:: English

Similar Records

Understanding I/O workload characteristics of a Peta-scale storage system

Journal Article · Mon Nov 10 19:00:00 EST 2014 · Journal of Supercomputing · OSTI ID:1185800

Workload Characterization of a Leadership Class Storage Cluster

Conference · Thu Dec 31 23:00:00 EST 2009 · OSTI ID:993463

UnifyFS: A User-level Shared File System for Unified Access to Distributed Local Storage

Conference · Mon May 01 00:00:00 EDT 2023 · OSTI ID:1995690

Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems

Citation Formats

Similar Records

Related Subjects