BBOS: Efficient HPC Storage Management via Burst Buffer Over-Subscription

Sung, H; Bang, J; Kim, C; Kim, HS; Sim, A; Lockwood, GK; Eom, H

doi:10.1109/CCGrid49817.2020.00-79

BBOS: Efficient HPC Storage Management via Burst Buffer Over-Subscription

Conference · Fri May 01 04:00:00 EDT 2020

DOI:https://doi.org/10.1109/CCGrid49817.2020.00-79· OSTI ID:1827659

Sung, H; Bang, J; Kim, C; Kim, HS; Sim, A; Lockwood, GK; Eom, H

To avoid access to PFS, dedicated BB allocation is preferred despite of severe BB underutilization. Recently, new all-flash HPC storage systems with integrated BB and PFS are proposed, which speed up access to PFS. For this reason, we adopt BB over-subscription allocation method by allowing HPC applications to use BB only for I/O phase for improving BB utilization. Unfortunately, BB over-subscription aggravates I/O interference and demotion overhead from BB to PFS, resulting in degraded performance. To minimize the performance degradation, we develop an I/O scheduler to prevent I/O congestion and a new transparent data management system based on checkpoint/restart characteristics of HPC applications. With the proposed approach, not only the BB utilization can be improved, but also high performance of applications is achieved. In our experiments, we find that BB utilization is improved at least 2.2x, and more stable and higher checkpoint performance is guaranteed compared to other approaches. Besides, we achieve up to 96.4% hit ratio of restart requests on BB and up to 3.1x higher restart performance than others.

Research Organization:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Organization:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)

DOE Contract Number:: AC02-05CH11231

OSTI ID:: 1827659

Country of Publication:: United States

Language:: English

Similar Records

McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression

Journal Article · Mon Dec 31 19:00:00 EST 2012 · Scientific Programming · OSTI ID:1197891

Orchestrating Fault Prediction with Live Migration and Checkpointing

Conference · Mon Jun 01 00:00:00 EDT 2020 · OSTI ID:1648858

BBOS: Efficient HPC Storage Management via Burst Buffer Over-Subscription

Citation Formats

Similar Records

Related Subjects