skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Exploring the feasibility of lossy compression for PDE simulations

Abstract

Checkpoint restart plays an important role in high-performance computing (HPC) applications, allowing simulation runtime to extend beyond a single job allocation and facilitating recovery from hardware failure. Yet, as machines grow in size and in complexity, traditional approaches to checkpoint restart are becoming prohibitive. Current methods store a subset of the application's state and exploit the memory hierarchy in the machine. However, as the energy cost of data movement continues to dominate, further reductions in checkpoint size are needed. Lossy compression, which can significantly reduce checkpoint sizes, offers a potential to reduce computational cost in checkpoint restart. This article investigates the use of numerical properties of partial differential equation (PDE) simulations, such as bounds on the truncation error, to evaluate the feasibility of using lossy compression in checkpointing PDE simulations. Restart from a checkpoint with lossy compression is considered for a fail-stop error in two time-dependent HPC application codes: PlasComCM and Nek5000. Here, the results show that error in application variables due to a restart from a lossy compressed checkpoint can be masked by the numerical error in the discretization, leading to increased efficiency in checkpoint restart without influencing overall accuracy in the simulation.

Authors:
 [1];  [2];  [3];  [3];  [3]
  1. Clemson Univ., Clemson, SC (United States)
  2. Argonne National Lab. (ANL), Lemont, IL (United States)
  3. Univ. of Illinois at Urbana-Champaign, Urbana, IL (United States)
Publication Date:
Research Org.:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Org.:
Air Force Research Laboratory (AFRL), Air Force Office of Scientific Research (AFOSR); National Science Foundation (NSF); USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1425688
Alternate Identifier(s):
OSTI ID: 1510066
Grant/Contract Number:  
AC02-06CH11357; NA0002374
Resource Type:
Published Article
Journal Name:
International Journal of High Performance Computing Applications
Additional Journal Information:
Journal Volume: 33; Journal Issue: 2; Journal ID: ISSN 1094-3420
Publisher:
SAGE
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; lossy compression; checkpoint restart; compression; error propagation; error tolerance selection; exascale; fault tolerance

Citation Formats

Calhoun, Jon, Cappello, Franck, Olson, Luke N., Snir, Marc, and Gropp, William D. Exploring the feasibility of lossy compression for PDE simulations. United States: N. p., 2018. Web. doi:10.1177/1094342018762036.
Calhoun, Jon, Cappello, Franck, Olson, Luke N., Snir, Marc, & Gropp, William D. Exploring the feasibility of lossy compression for PDE simulations. United States. doi:10.1177/1094342018762036.
Calhoun, Jon, Cappello, Franck, Olson, Luke N., Snir, Marc, and Gropp, William D. Mon . "Exploring the feasibility of lossy compression for PDE simulations". United States. doi:10.1177/1094342018762036.
@article{osti_1425688,
title = {Exploring the feasibility of lossy compression for PDE simulations},
author = {Calhoun, Jon and Cappello, Franck and Olson, Luke N. and Snir, Marc and Gropp, William D.},
abstractNote = {Checkpoint restart plays an important role in high-performance computing (HPC) applications, allowing simulation runtime to extend beyond a single job allocation and facilitating recovery from hardware failure. Yet, as machines grow in size and in complexity, traditional approaches to checkpoint restart are becoming prohibitive. Current methods store a subset of the application's state and exploit the memory hierarchy in the machine. However, as the energy cost of data movement continues to dominate, further reductions in checkpoint size are needed. Lossy compression, which can significantly reduce checkpoint sizes, offers a potential to reduce computational cost in checkpoint restart. This article investigates the use of numerical properties of partial differential equation (PDE) simulations, such as bounds on the truncation error, to evaluate the feasibility of using lossy compression in checkpointing PDE simulations. Restart from a checkpoint with lossy compression is considered for a fail-stop error in two time-dependent HPC application codes: PlasComCM and Nek5000. Here, the results show that error in application variables due to a restart from a lossy compressed checkpoint can be masked by the numerical error in the discretization, leading to increased efficiency in checkpoint restart without influencing overall accuracy in the simulation.},
doi = {10.1177/1094342018762036},
journal = {International Journal of High Performance Computing Applications},
number = 2,
volume = 33,
place = {United States},
year = {2018},
month = {3}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record
DOI: 10.1177/1094342018762036

Citation Metrics:
Cited by: 2 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

High Throughput Compression of Double-Precision Floating-Point Data
conference, March 2007

  • Burtscher, Martin; Ratanaworabhan, Paruj
  • 2007 Data Compression Conference (DCC'07)
  • DOI: 10.1109/DCC.2007.44

Toward Exascale Resilience
journal, September 2009

  • Cappello, Franck; Geist, Al; Gropp, Bill
  • The International Journal of High Performance Computing Applications, Vol. 23, Issue 4
  • DOI: 10.1177/1094342009347767

Fast Error-Bounded Lossy HPC Data Compression with SZ
conference, May 2016

  • Di, Sheng; Cappello, Franck
  • 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
  • DOI: 10.1109/IPDPS.2016.11

On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance
conference, September 2012

  • Ibtesham, Dewan; Arnold, Dorian; Bridges, Patrick G.
  • 2012 41st International Conference on Parallel Processing (ICPP)
  • DOI: 10.1109/ICPP.2012.45

Collective I/O Tuning Using Analytical and Machine Learning Models
conference, September 2015

  • Isaila, Florin; Balaprakash, Prasanna; Wild, Stefan M.
  • 2015 IEEE International Conference on Cluster Computing (CLUSTER)
  • DOI: 10.1109/CLUSTER.2015.29

Fixed-Rate Compressed Floating-Point Arrays
journal, December 2014

  • Lindstrom, Peter
  • IEEE Transactions on Visualization and Computer Graphics, Vol. 20, Issue 12
  • DOI: 10.1109/TVCG.2014.2346458

Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization
conference, May 2017

  • Tao, Dingwen; Di, Sheng; Chen, Zizhong
  • 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
  • DOI: 10.1109/IPDPS.2017.115