Optimization of ErrorBounded Lossy Compression for HardtoCompress HPC Data
Abstract
Since today’s scientific applications are producing vast amounts of data, compressing them before storage/transmission is critical. Results of existing compressors show two types of HPC data sets: highly compressible and hard to compress. In this work, we carefully design and optimize the errorbounded lossy compression for hardtocompress scientific data. We propose an optimized algorithm that can adaptively partition the HPC data into bestfit consecutive segments each having mutually close data values, such that the compression condition can be optimized. Another significant contribution is the optimization of shifting offset such that the XORleadingzero length between two consecutive unpredictable data points can be maximized. We finally devise an adaptive method to select the bestfit compressor at runtime for maximizing the compression factor. We evaluate our solution using 13 benchmarks based on realworld scientific problems, and we compare it with 9 other stateoftheart compressors. Experiments show that our compressor can always guarantee the compression errors within the userspecified error bounds. Most importantly, our optimization can improve the compression factor effectively, by up to 49% for hardtocompress data sets with similar compression/decompression time cost.
 Authors:
 Publication Date:
 Research Org.:
 Argonne National Lab. (ANL), Argonne, IL (United States)
 Sponsoring Org.:
 USDOE National Nuclear Security Administration (NNSA)
 OSTI Identifier:
 1417025
 DOE Contract Number:
 AC0206CH11357
 Resource Type:
 Journal Article
 Resource Relation:
 Journal Name: IEEE Transactions on Parallel and Distributed Systems; Journal Volume: 29; Journal Issue: 1
 Country of Publication:
 United States
 Language:
 English
 Subject:
 Errorbounded lossy compression; floatingpoint data compression; high performance computing; scientific simulation
Citation Formats
Di, Sheng, and Cappello, Franck. Optimization of ErrorBounded Lossy Compression for HardtoCompress HPC Data. United States: N. p., 2018.
Web. doi:10.1109/TPDS.2017.2749300.
Di, Sheng, & Cappello, Franck. Optimization of ErrorBounded Lossy Compression for HardtoCompress HPC Data. United States. doi:10.1109/TPDS.2017.2749300.
Di, Sheng, and Cappello, Franck. 2018.
"Optimization of ErrorBounded Lossy Compression for HardtoCompress HPC Data". United States.
doi:10.1109/TPDS.2017.2749300.
@article{osti_1417025,
title = {Optimization of ErrorBounded Lossy Compression for HardtoCompress HPC Data},
author = {Di, Sheng and Cappello, Franck},
abstractNote = {Since today’s scientific applications are producing vast amounts of data, compressing them before storage/transmission is critical. Results of existing compressors show two types of HPC data sets: highly compressible and hard to compress. In this work, we carefully design and optimize the errorbounded lossy compression for hardtocompress scientific data. We propose an optimized algorithm that can adaptively partition the HPC data into bestfit consecutive segments each having mutually close data values, such that the compression condition can be optimized. Another significant contribution is the optimization of shifting offset such that the XORleadingzero length between two consecutive unpredictable data points can be maximized. We finally devise an adaptive method to select the bestfit compressor at runtime for maximizing the compression factor. We evaluate our solution using 13 benchmarks based on realworld scientific problems, and we compare it with 9 other stateoftheart compressors. Experiments show that our compressor can always guarantee the compression errors within the userspecified error bounds. Most importantly, our optimization can improve the compression factor effectively, by up to 49% for hardtocompress data sets with similar compression/decompression time cost.},
doi = {10.1109/TPDS.2017.2749300},
journal = {IEEE Transactions on Parallel and Distributed Systems},
number = 1,
volume = 29,
place = {United States},
year = 2018,
month = 1
}

Highresolution Earth system model simulations generate enormous data volumes, and retaining the data from these simulations often strains institutional storage resources. Further, these exceedingly large storage requirements negatively impact science objectives, for example, by forcing reductions in data output frequency, simulation length, or ensemble size. To lessen data volumes from the Community Earth System Model (CESM), we advocate the use of lossy data compression techniques. While lossy data compression does not exactly preserve the original data (as lossless compression does), lossy techniques have an advantage in terms of smaller storage requirements. To preserve the integrity of the scientific simulation data,more »Cited by 2

Lossy compression of weak lensing data
Future orbiting observatories will survey large areas of sky in order to constrain the physics of dark matter and dark energy using weak gravitational lensing and other methods. Lossy compression of the resultant data will improve the cost and feasibility of transmitting the images through the space communication network. We evaluate the consequences of the lossy compression algorithm of Bernstein et al. (2010) for the highprecision measurement of weaklensing galaxy ellipticities. This squareroot algorithm compresses each pixel independently, and the information discarded is by construction less than the Poisson error from photon shot noise. For simulated spacebased images (without cosmicmore » 
The compression–error tradeoff for large gridded data sets
The netCDF4 format is widely used for large gridded scientific data sets and includes several compression methods: lossy linear scaling and the nonlossy deflate and shuffle algorithms. Many multidimensional geoscientific data sets exhibit considerable variation over one or several spatial dimensions (e.g., vertically) with less variation in the remaining dimensions (e.g., horizontally). On such data sets, linear scaling with a single pair of scale and offset parameters often entails considerable loss of precision. We introduce an alternative compression method called "layerpacking" that simultaneously exploits lossy linear scaling and lossless compression. Layerpacking stores arrays (instead of a scalar pair) of scalemore »Cited by 1 
The compression–error tradeoff for large gridded data sets
The netCDF4 format is widely used for large gridded scientific data sets and includes several compression methods: lossy linear scaling and the nonlossy deflate and shuffle algorithms. Many multidimensional geoscientific data sets exhibit considerable variation over one or several spatial dimensions (e.g., vertically) with less variation in the remaining dimensions (e.g., horizontally). On such data sets, linear scaling with a single pair of scale and offset parameters often entails considerable loss of precision. We introduce an alternative compression method called "layerpacking" that simultaneously exploits lossy linear scaling and lossless compression. Layerpacking stores arrays (instead of a scalar pair) of scalemore »Cited by 1 
Reducing Disk Storage of Full3D Seismic Waveform Tomography (F3DT) Through Lossy Online Compression
Full3D seismic waveform tomography (F3DT) is the latest seismic tomography technique that can assimilate broadband, multicomponent seismic waveform observations into highresolution 3D subsurface seismic structure models. The main drawback in the current F3DT implementation, in particular the scatteringintegral implementation (F3DTSI), is the high disk storage cost and the associated I/O overhead of archiving the 4D spacetime wavefields of the receiver or sourceside strain tensors. The strain tensor fields are needed for computing the data sensitivity kernels, which are used for constructing the Jacobian matrix in the GaussNewton optimization algorithm. In this study, we have successfully integrated a lossy compression algorithmmore »Cited by 3