DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Checkpointing Strategies for Shared High-Performance Computing Platforms

Abstract

Input/output (I/O) from various sources often contend for scarcely available bandwidth. For example, checkpoint/restart (CR) protocols can help to ensure application progress in failure-prone environments. However, CR I/O alongside an application's normal, requisite I/O can increase I/O contention and might negatively impact performance. In this work, we consider different aspects (system-level scheduling policies and hardware) that optimize the overall performance of concurrently executing CR-based applications that share I/O resources. We provide a theoretical model and derive a set of necessary constraints to minimize the global waste on a given platform. Our results demonstrate that Young/Daly's optimal checkpoint interval, despite providing a sensible metric for a single, undisturbed application, is not sufficient to optimally address resource contention at scale. We show that by combining optimal checkpointing periods with contention-aware system-level I/O scheduling strategies, we can significantly improve overall application performance and maximize the platform throughput. Finally, we evaluate how specialized hardware, namely burst buffers, may help to mitigate the I/O contention problem. Altogether, these results provide critical analysis and direct guidance on how to design efficient, CR ready, large -scale platforms without a large investment in the I/O subsystem.

Authors:
 [1];  [2];  [1];  [3];  [4];  [1];  [5]
  1. Univ. of Tennessee, Knoxville, TN (United States)
  2. ENS Lyon (France); Univ. of Tennessee, Knoxville, TN (United States)
  3. Emory Univ., Atlanta, GA (United States)
  4. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
  5. Univ. of Manchester (United Kingdom); Univ. of Tennessee, Knoxville, TN (United States)
Publication Date:
Research Org.:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
OSTI Identifier:
1492861
Report Number(s):
SAND-2018-12751J
Journal ID: ISSN 2185-2839; 669702
Grant/Contract Number:  
AC04-94AL85000
Resource Type:
Accepted Manuscript
Journal Name:
International Journal of Networking and Computing
Additional Journal Information:
Journal Volume: 9; Journal Issue: 1; Journal ID: ISSN 2185-2839
Publisher:
IJNC Editorial Committee
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING

Citation Formats

Herault, Thomas, Robert, Yves, Bouteiller, Aurelien, Arnold, Dorian, Ferreira, Kurt Brian, George, George, and Dongarra, Jack. Checkpointing Strategies for Shared High-Performance Computing Platforms. United States: N. p., 2019. Web. doi:10.15803/ijnc.9.1_28.
Herault, Thomas, Robert, Yves, Bouteiller, Aurelien, Arnold, Dorian, Ferreira, Kurt Brian, George, George, & Dongarra, Jack. Checkpointing Strategies for Shared High-Performance Computing Platforms. United States. https://doi.org/10.15803/ijnc.9.1_28
Herault, Thomas, Robert, Yves, Bouteiller, Aurelien, Arnold, Dorian, Ferreira, Kurt Brian, George, George, and Dongarra, Jack. Tue . "Checkpointing Strategies for Shared High-Performance Computing Platforms". United States. https://doi.org/10.15803/ijnc.9.1_28. https://www.osti.gov/servlets/purl/1492861.
@article{osti_1492861,
title = {Checkpointing Strategies for Shared High-Performance Computing Platforms},
author = {Herault, Thomas and Robert, Yves and Bouteiller, Aurelien and Arnold, Dorian and Ferreira, Kurt Brian and George, George and Dongarra, Jack},
abstractNote = {Input/output (I/O) from various sources often contend for scarcely available bandwidth. For example, checkpoint/restart (CR) protocols can help to ensure application progress in failure-prone environments. However, CR I/O alongside an application's normal, requisite I/O can increase I/O contention and might negatively impact performance. In this work, we consider different aspects (system-level scheduling policies and hardware) that optimize the overall performance of concurrently executing CR-based applications that share I/O resources. We provide a theoretical model and derive a set of necessary constraints to minimize the global waste on a given platform. Our results demonstrate that Young/Daly's optimal checkpoint interval, despite providing a sensible metric for a single, undisturbed application, is not sufficient to optimally address resource contention at scale. We show that by combining optimal checkpointing periods with contention-aware system-level I/O scheduling strategies, we can significantly improve overall application performance and maximize the platform throughput. Finally, we evaluate how specialized hardware, namely burst buffers, may help to mitigate the I/O contention problem. Altogether, these results provide critical analysis and direct guidance on how to design efficient, CR ready, large -scale platforms without a large investment in the I/O subsystem.},
doi = {10.15803/ijnc.9.1_28},
journal = {International Journal of Networking and Computing},
number = 1,
volume = 9,
place = {United States},
year = {2019},
month = {1}
}