DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Exploring the feasibility of lossy compression for PDE simulations

Abstract

Checkpoint restart plays an important role in high-performance computing (HPC) applications, allowing simulation runtime to extend beyond a single job allocation and facilitating recovery from hardware failure. Yet, as machines grow in size and in complexity, traditional approaches to checkpoint restart are becoming prohibitive. Current methods store a subset of the application's state and exploit the memory hierarchy in the machine. However, as the energy cost of data movement continues to dominate, further reductions in checkpoint size are needed. Lossy compression, which can significantly reduce checkpoint sizes, offers a potential to reduce computational cost in checkpoint restart. This article investigates the use of numerical properties of partial differential equation (PDE) simulations, such as bounds on the truncation error, to evaluate the feasibility of using lossy compression in checkpointing PDE simulations. Restart from a checkpoint with lossy compression is considered for a fail-stop error in two time-dependent HPC application codes: PlasComCM and Nek5000. Here, the results show that error in application variables due to a restart from a lossy compressed checkpoint can be masked by the numerical error in the discretization, leading to increased efficiency in checkpoint restart without influencing overall accuracy in the simulation.

Authors:
 [1];  [2];  [3];  [3];  [3]
  1. Holcombe Department of Electrical and Computer Engineering, Clemson University, Clemson, SC, USA
  2. Mathematics and Computer Science Division, Argonne National Laboratory, Lemont, IL, USA
  3. Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA
Publication Date:
Research Org.:
Argonne National Laboratory (ANL), Argonne, IL (United States)
Sponsoring Org.:
Air Force Research Laboratory (AFRL), Air Force Office of Scientific Research (AFOSR); National Science Foundation (NSF); USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1425688
Alternate Identifier(s):
OSTI ID: 1510066
Grant/Contract Number:  
NA0002374; AC02-06CH11357
Resource Type:
Published Article
Journal Name:
International Journal of High Performance Computing Applications
Additional Journal Information:
Journal Name: International Journal of High Performance Computing Applications Journal Volume: 33 Journal Issue: 2; Journal ID: ISSN 1094-3420
Publisher:
SAGE Publications
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; lossy compression; checkpoint restart; compression; error propagation; error tolerance selection; exascale; fault tolerance

Citation Formats

Calhoun, Jon, Cappello, Franck, Olson, Luke N., Snir, Marc, and Gropp, William D. Exploring the feasibility of lossy compression for PDE simulations. United States: N. p., 2018. Web. doi:10.1177/1094342018762036.
Calhoun, Jon, Cappello, Franck, Olson, Luke N., Snir, Marc, & Gropp, William D. Exploring the feasibility of lossy compression for PDE simulations. United States. https://doi.org/10.1177/1094342018762036
Calhoun, Jon, Cappello, Franck, Olson, Luke N., Snir, Marc, and Gropp, William D. Mon . "Exploring the feasibility of lossy compression for PDE simulations". United States. https://doi.org/10.1177/1094342018762036.
@article{osti_1425688,
title = {Exploring the feasibility of lossy compression for PDE simulations},
author = {Calhoun, Jon and Cappello, Franck and Olson, Luke N. and Snir, Marc and Gropp, William D.},
abstractNote = {Checkpoint restart plays an important role in high-performance computing (HPC) applications, allowing simulation runtime to extend beyond a single job allocation and facilitating recovery from hardware failure. Yet, as machines grow in size and in complexity, traditional approaches to checkpoint restart are becoming prohibitive. Current methods store a subset of the application's state and exploit the memory hierarchy in the machine. However, as the energy cost of data movement continues to dominate, further reductions in checkpoint size are needed. Lossy compression, which can significantly reduce checkpoint sizes, offers a potential to reduce computational cost in checkpoint restart. This article investigates the use of numerical properties of partial differential equation (PDE) simulations, such as bounds on the truncation error, to evaluate the feasibility of using lossy compression in checkpointing PDE simulations. Restart from a checkpoint with lossy compression is considered for a fail-stop error in two time-dependent HPC application codes: PlasComCM and Nek5000. Here, the results show that error in application variables due to a restart from a lossy compressed checkpoint can be masked by the numerical error in the discretization, leading to increased efficiency in checkpoint restart without influencing overall accuracy in the simulation.},
doi = {10.1177/1094342018762036},
journal = {International Journal of High Performance Computing Applications},
number = 2,
volume = 33,
place = {United States},
year = {Mon Mar 12 00:00:00 EDT 2018},
month = {Mon Mar 12 00:00:00 EDT 2018}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record
https://doi.org/10.1177/1094342018762036

Citation Metrics:
Cited by: 21 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

High Throughput Compression of Double-Precision Floating-Point Data
conference, March 2007

  • Burtscher, Martin; Ratanaworabhan, Paruj
  • 2007 Data Compression Conference (DCC'07)
  • DOI: 10.1109/DCC.2007.44

Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization
conference, May 2017

  • Tao, Dingwen; Di, Sheng; Chen, Zizhong
  • 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
  • DOI: 10.1109/IPDPS.2017.115

Fast Error-Bounded Lossy HPC Data Compression with SZ
conference, May 2016

  • Di, Sheng; Cappello, Franck
  • 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
  • DOI: 10.1109/IPDPS.2016.11

Collective I/O Tuning Using Analytical and Machine Learning Models
conference, September 2015

  • Isaila, Florin; Balaprakash, Prasanna; Wild, Stefan M.
  • 2015 IEEE International Conference on Cluster Computing (CLUSTER)
  • DOI: 10.1109/CLUSTER.2015.29

Fixed-Rate Compressed Floating-Point Arrays
journal, December 2014

  • Lindstrom, Peter
  • IEEE Transactions on Visualization and Computer Graphics, Vol. 20, Issue 12
  • DOI: 10.1109/TVCG.2014.2346458

Toward Exascale Resilience
journal, September 2009

  • Cappello, Franck; Geist, Al; Gropp, Bill
  • The International Journal of High Performance Computing Applications, Vol. 23, Issue 4
  • DOI: 10.1177/1094342009347767

On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance
conference, September 2012

  • Ibtesham, Dewan; Arnold, Dorian; Bridges, Patrick G.
  • 2012 41st International Conference on Parallel Processing (ICPP)
  • DOI: 10.1109/ICPP.2012.45