Optimizing checkpoint data placement with guaranteed burst buffer endurance in large-scale hierarchical storage systems

Wan, Lipeng; Cao, Qing; Wang, Feiyi; Oral, Sarp

doi:10.1016/j.jpdc.2016.10.002

Title: Optimizing checkpoint data placement with guaranteed burst buffer endurance in large-scale hierarchical storage systems

Journal Article · Wed Feb 01 00:00:00 EST 2017 · Journal of Parallel and Distributed Computing

DOI:https://doi.org/10.1016/j.jpdc.2016.10.002· OSTI ID:1565566

Wan, Lipeng; Cao, Qing; Wang, Feiyi; Oral, Sarp

Non-volatile devices, such as SSDs, will be an integral part of the deepening storage hierarchy on large-scale HPC systems. These devices can be on the compute nodes as part of a distributed burst buffer service or they can be external. Wherever they are located in the hierarchy, one critical design issue is the SSD endurance under the write-heavy workloads, such as the checkpoint I/O for scientific applications. For these environments, it is widely assumed that checkpoint operations can occur once every 60 min and for each checkpoint step as much as half of the system memory can be written out. Unfortunately, for large-scale HPC applications, the burst buffer SSDs can be worn out much more quickly given the extensive amount of data written at every checkpoint step. One possible solution is to control the amount of data written by reducing the checkpoint frequency. However, a direct effect caused by reduced checkpoint frequency is the increased vulnerability window of system failures and therefore potentially wasted computation time, especially for large-scale compute jobs.

Cite

Export

Save

Research Organization:: Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF); UT-Battelle LLC/ORNL, Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE Office of Science (SC)

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1565566

Journal Information:: Journal of Parallel and Distributed Computing, Vol. 100, Issue C; ISSN 0743-7315

Publisher:: Elsevier

Country of Publication:: United States

Language:: English

Similar Records

Optimizing checkpoint data placement with guaranteed burst buffer endurance in large-scale hierarchical storage systems

Journal Article · Fri Oct 14 00:00:00 EDT 2016 · Journal of Parallel and Distributed Computing · OSTI ID:1565566

Wan, Lipeng; Cao, Qing; Wang, Feiyi; +1 more

SPARC: Demonstrate burst-buffer-based checkpoint/restart on ATS-1.

Technical Report · Mon Jan 01 00:00:00 EST 2018 · OSTI ID:1565566

Oldfield, Ron A.; Ulmer, Craig D.; Widener, Patrick; +1 more

An empirical study of I/O separation for burst buffers in HPC systems

Journal Article · Sun Nov 01 00:00:00 EDT 2020 · Journal of Parallel and Distributed Computing · OSTI ID:1565566

Koo, Donghun; Lee, Jaehwan; Liu, Jialin; +7 more

Related Subjects

Computer Science

Title: Optimizing checkpoint data placement with guaranteed burst buffer endurance in large-scale hierarchical storage systems

Citation Formats

Similar Records

Related Subjects