Exploring the feasibility of lossy compression for PDE simulations
Abstract
Checkpoint restart plays an important role in high-performance computing (HPC) applications, allowing simulation runtime to extend beyond a single job allocation and facilitating recovery from hardware failure. Yet, as machines grow in size and in complexity, traditional approaches to checkpoint restart are becoming prohibitive. Current methods store a subset of the application's state and exploit the memory hierarchy in the machine. However, as the energy cost of data movement continues to dominate, further reductions in checkpoint size are needed. Lossy compression, which can significantly reduce checkpoint sizes, offers a potential to reduce computational cost in checkpoint restart. This article investigates the use of numerical properties of partial differential equation (PDE) simulations, such as bounds on the truncation error, to evaluate the feasibility of using lossy compression in checkpointing PDE simulations. Restart from a checkpoint with lossy compression is considered for a fail-stop error in two time-dependent HPC application codes: PlasComCM and Nek5000. Here, the results show that error in application variables due to a restart from a lossy compressed checkpoint can be masked by the numerical error in the discretization, leading to increased efficiency in checkpoint restart without influencing overall accuracy in the simulation.
- Authors:
-
- Holcombe Department of Electrical and Computer Engineering, Clemson University, Clemson, SC, USA
- Mathematics and Computer Science Division, Argonne National Laboratory, Lemont, IL, USA
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- Publication Date:
- Research Org.:
- Argonne National Laboratory (ANL), Argonne, IL (United States)
- Sponsoring Org.:
- Air Force Research Laboratory (AFRL), Air Force Office of Scientific Research (AFOSR); National Science Foundation (NSF); USDOE National Nuclear Security Administration (NNSA)
- OSTI Identifier:
- 1425688
- Alternate Identifier(s):
- OSTI ID: 1510066
- Grant/Contract Number:
- NA0002374; AC02-06CH11357
- Resource Type:
- Published Article
- Journal Name:
- International Journal of High Performance Computing Applications
- Additional Journal Information:
- Journal Name: International Journal of High Performance Computing Applications Journal Volume: 33 Journal Issue: 2; Journal ID: ISSN 1094-3420
- Publisher:
- SAGE Publications
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING; lossy compression; checkpoint restart; compression; error propagation; error tolerance selection; exascale; fault tolerance
Citation Formats
Calhoun, Jon, Cappello, Franck, Olson, Luke N., Snir, Marc, and Gropp, William D. Exploring the feasibility of lossy compression for PDE simulations. United States: N. p., 2018.
Web. doi:10.1177/1094342018762036.
Calhoun, Jon, Cappello, Franck, Olson, Luke N., Snir, Marc, & Gropp, William D. Exploring the feasibility of lossy compression for PDE simulations. United States. https://doi.org/10.1177/1094342018762036
Calhoun, Jon, Cappello, Franck, Olson, Luke N., Snir, Marc, and Gropp, William D. Mon .
"Exploring the feasibility of lossy compression for PDE simulations". United States. https://doi.org/10.1177/1094342018762036.
@article{osti_1425688,
title = {Exploring the feasibility of lossy compression for PDE simulations},
author = {Calhoun, Jon and Cappello, Franck and Olson, Luke N. and Snir, Marc and Gropp, William D.},
abstractNote = {Checkpoint restart plays an important role in high-performance computing (HPC) applications, allowing simulation runtime to extend beyond a single job allocation and facilitating recovery from hardware failure. Yet, as machines grow in size and in complexity, traditional approaches to checkpoint restart are becoming prohibitive. Current methods store a subset of the application's state and exploit the memory hierarchy in the machine. However, as the energy cost of data movement continues to dominate, further reductions in checkpoint size are needed. Lossy compression, which can significantly reduce checkpoint sizes, offers a potential to reduce computational cost in checkpoint restart. This article investigates the use of numerical properties of partial differential equation (PDE) simulations, such as bounds on the truncation error, to evaluate the feasibility of using lossy compression in checkpointing PDE simulations. Restart from a checkpoint with lossy compression is considered for a fail-stop error in two time-dependent HPC application codes: PlasComCM and Nek5000. Here, the results show that error in application variables due to a restart from a lossy compressed checkpoint can be masked by the numerical error in the discretization, leading to increased efficiency in checkpoint restart without influencing overall accuracy in the simulation.},
doi = {10.1177/1094342018762036},
journal = {International Journal of High Performance Computing Applications},
number = 2,
volume = 33,
place = {United States},
year = {Mon Mar 12 00:00:00 EDT 2018},
month = {Mon Mar 12 00:00:00 EDT 2018}
}
https://doi.org/10.1177/1094342018762036
Web of Science
Works referenced in this record:
High Throughput Compression of Double-Precision Floating-Point Data
conference, March 2007
- Burtscher, Martin; Ratanaworabhan, Paruj
- 2007 Data Compression Conference (DCC'07)
Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization
conference, May 2017
- Tao, Dingwen; Di, Sheng; Chen, Zizhong
- 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
Fast Error-Bounded Lossy HPC Data Compression with SZ
conference, May 2016
- Di, Sheng; Cappello, Franck
- 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
Collective I/O Tuning Using Analytical and Machine Learning Models
conference, September 2015
- Isaila, Florin; Balaprakash, Prasanna; Wild, Stefan M.
- 2015 IEEE International Conference on Cluster Computing (CLUSTER)
Fixed-Rate Compressed Floating-Point Arrays
journal, December 2014
- Lindstrom, Peter
- IEEE Transactions on Visualization and Computer Graphics, Vol. 20, Issue 12
Toward Exascale Resilience
journal, September 2009
- Cappello, Franck; Geist, Al; Gropp, Bill
- The International Journal of High Performance Computing Applications, Vol. 23, Issue 4
On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance
conference, September 2012
- Ibtesham, Dewan; Arnold, Dorian; Bridges, Patrick G.
- 2012 41st International Conference on Parallel Processing (ICPP)