Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Characterizing Machine Learning I/O Workloads on Leadership Scale HPC Systems

Conference ·

High performance computing (HPC) is no longer solely limited to traditional workloads such as simulation and modeling. With the increase in the popularity of machine learning (ML) and deep learning (DL) technologies, we are observing that an increasing number of HPC users are incorporating ML methods into their workflow and scientific discovery processes, across a wide spectrum of science domains such as biology, earth science, and physics. This gives rise to a diverse set of I/O patterns than the traditional checkpoint/restart-based HPC I/O behavior. The details of the I/O characteristics of such ML I/O workloads have not been studied extensively for large-scale leadership HPC systems. This paper aims to fill that gap by providing an in-depth analysis to gain an understanding of the I/O behavior of ML I/O workloads using darshan - an I/O characterization tool designed for lightweight tracing and profiling. We study the darshan logs of more than 23, 000 HPC ML I/O jobs over a time period of one year running on Summit - the second-fastest supercomputer in the world. This paper provides a systematic I/O characterization of ML I/O jobs running on a leadership scale supercomputer to understand how the I/O behavior differs across science domains and the scale of workloads, and analyze the usage of parallel file system and burst buffer by ML I/O workloads.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1885376
Country of Publication:
United States
Language:
English

References (10)

Understanding and Improving Computational Science Storage Access through Continuous Characterization journal October 2011
tf-Darshan: Understanding Fine-grained I/O Performance in Machine Learning Workloads conference September 2020
An Ephemeral Burst-Buffer File System for Scientific Applications
  • Wang, Teng; Mohror, Kathryn; Moody, Adam
  • SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.68
conference November 2016
I/O Characterization and Performance Evaluation of BeeGFS for Deep Learning conference August 2019
24/7 Characterization of petascale I/O workloads conference August 2009
I/O Acceleration via Multi-Tiered Data Buffering and Prefetching journal January 2020
Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments
  • Naksinehaboon, N.; Leangsuksun, C.
  • 2008 8th International Symposium on Cluster Computing and the Grid (CCGRID '08), 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID) https://doi.org/10.1109/CCGRID.2008.109
conference May 2008
Understanding Data Motion in the Modern HPC Data Center conference November 2019
Modular HPC I/O Characterization with Darshan conference November 2016
Understanding HPC Application I/O Behavior Using System Level Statistics conference December 2020

Similar Records

Related Subjects