User-level File Systems Specialized for HPC Workloads Year-End Report
- Florida State Univ., Tallahassee, FL (United States)
- Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
High-performance computing (HPC) clusters attract the attention of Deep Learning (DL) training users due to the clusters' powerful computation capabilities. While there have been many existing efforts to enable deep neural works to leverage the powerful CPU and GPU processors from leadership high-performance computing (HPC) systems, large-scale deep learning with larger datasets requires efficient I/O support from the underlying file and storage systems. As some current and upcoming HPC clusters have large on-node memory or are equipped with NVMe SSD on compute nodes, more distributed DL trainings are considered to leverage those storages to store dataset for efficient dataset access. Our project goal is to design a specialized DL-oriented file system to improve datasets loading performance of any DL training applications. Over the past year, the team at Florida State University has performed research activities in three different aspects, including 1) completing a specialized memory-based I/O framework (DeepIO) for improving dataset loading performance on DL applications; 2) proposing a more generalized file system (DLFS) for DL applications on node-local SSDs; 3) completing the BeeGFS performance evaluation project for Deep Neural Networks.
- Research Organization:
- Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
- Sponsoring Organization:
- USDOE National Nuclear Security Administration (NNSA)
- DOE Contract Number:
- AC52-07NA27344
- OSTI ID:
- 1544466
- Report Number(s):
- LLNL-SR-764802; 954764
- Country of Publication:
- United States
- Language:
- English
Similar Records
SCR-Exa: Enhanced Scalable Checkpoint Restart (SCR) Library for Next Generation Exascale Computing
Scalable training of graph convolutional neural networks for fast and accurate predictions of HOMO-LUMO gap in molecules