BBOS: Efficient HPC Storage Management via Burst Buffer Over-Subscription
To avoid access to PFS, dedicated BB allocation is preferred despite of severe BB underutilization. Recently, new all-flash HPC storage systems with integrated BB and PFS are proposed, which speed up access to PFS. For this reason, we adopt BB over-subscription allocation method by allowing HPC applications to use BB only for I/O phase for improving BB utilization. Unfortunately, BB over-subscription aggravates I/O interference and demotion overhead from BB to PFS, resulting in degraded performance. To minimize the performance degradation, we develop an I/O scheduler to prevent I/O congestion and a new transparent data management system based on checkpoint/restart characteristics of HPC applications. With the proposed approach, not only the BB utilization can be improved, but also high performance of applications is achieved. In our experiments, we find that BB utilization is improved at least 2.2x, and more stable and higher checkpoint performance is guaranteed compared to other approaches. Besides, we achieve up to 96.4% hit ratio of restart requests on BB and up to 3.1x higher restart performance than others.
- Research Organization:
- Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
- DOE Contract Number:
- AC02-05CH11231
- OSTI ID:
- 1827659
- Country of Publication:
- United States
- Language:
- English
Similar Records
McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression
Orchestrating Fault Prediction with Live Migration and Checkpointing
Journal Article
·
Mon Dec 31 19:00:00 EST 2012
· Scientific Programming
·
OSTI ID:1197891
Orchestrating Fault Prediction with Live Migration and Checkpointing
Conference
·
Mon Jun 01 00:00:00 EDT 2020
·
OSTI ID:1648858