skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Design and Implementation of Burst Buffer Over-Subscription Scheme for HPC Storage Systems

Journal Article · · IEEE Access
ORCiD logo [1]; ORCiD logo [2];  [2];  [1]; ORCiD logo [3]
  1. Department of Computer Science and Engineering, Seoul National University, Seoul, South Korea
  2. Lawrence Berkeley National Laboratory, Computational Research Division, Berkeley, CA, USA
  3. Department of Game Design and Development, Sangmyung University, Seoul, South Korea

Burst Buffer is widely used in supercomputer centers to bridge the performance gap between computational power and the high-performance I/O systems. The primary role of Burst Buffer is to temporarily absorb the bursty I/O and reduce the heavy access on Parallel File System (PFS). However, the job resource manager on High-Performance Computer (HPC) systems prefers to use a dedicated Burst Buffer allocation approach, which eventually leads to the severely underutilized Burst Buffer resource. To improve the efficiency of using the expensive Burst Buffer resource, we analyze the I/O patterns on Burst Buffer in depth. We propose Burst Buffer over-subscription allocation method, which improves Burst Buffer utilization by allowing each job to access Burst Buffer only during its I/O phases so that the jobs can overlap each other. Furthermore, we develop a new I/O congestion-Aware scheduler and a transparent data management system between Burst Buffer and PFS. Our approach also reduces the memory overhead and improves the data persistence of the data management system by adapting the persistent memory. With the proposed approach, not only the Burst Buffer utilization can be improved, but also HPC applications can achieve high I/O performance by exploiting the powerful Burst Buffer hardware capabilities. Experimental results show that BBOS can improve Burst Buffer utilization by up to 120% while more stable and higher checkpoint performance is guaranteed even under high I/O loads compared to other state-of-The-Art schedulers. Besides, our approach can improve the hit ratio of restart requests by up to 96.4% and provides up to 210% higher restart throughput on Burst Buffer.

Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); National Research Foundation of Korea (NRF); Korean Government (MSIT)
Grant/Contract Number:
AC02-05CH11231; NRF-2016M3C4A7952587; 4199990214639; NRF-2021R1F1A1063438
OSTI ID:
1908919
Alternate ID(s):
OSTI ID: 1908920; OSTI ID: 1983925
Journal Information:
IEEE Access, Journal Name: IEEE Access Vol. 11; ISSN 2169-3536
Publisher:
Institute of Electrical and Electronics EngineersCopyright Statement
Country of Publication:
United States
Language:
English

Similar Records

BBOS: Efficient HPC Storage Management via Burst Buffer Over-Subscription
Conference · Fri May 01 00:00:00 EDT 2020 · OSTI ID:1908919

An empirical study of I/O separation for burst buffers in HPC systems
Journal Article · Sun Nov 01 00:00:00 EDT 2020 · Journal of Parallel and Distributed Computing · OSTI ID:1908919

SPARC: Demonstrate burst-buffer-based checkpoint/restart on ATS-1.
Technical Report · Mon Jan 01 00:00:00 EST 2018 · OSTI ID:1908919