DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A case study on parallel HDF5 dataset concatenation for high energy physics data analysis

Journal Article · · Parallel Computing

In High Energy Physics (HEP), experimentalists generate large volumes of data that, when analyzed, helps us better understand the fundamental particles and their interactions. This data is often captured in many files of small size, creating a data management challenge for scientists. In order to better facilitate data management, transfer, and analysis on large scale platforms, it is advantageous to aggregate data further into a smaller number of larger files. However, this translation process can consume significant time and resources, and if performed incorrectly the resulting aggregated files can be inefficient for highly parallel access during analysis on large scale platforms. In this paper, we present our case study on parallel I/O strategies and HDF5 features for reducing data aggregation time, making effective use of compression, and ensuring efficient access to the resulting data during analysis at scale. We focus on NOvA detector data in this case study, a large-scale HEP experiment generating many terabytes of data. Here, the lessons learned from our case study inform the handling of similar datasets, thus expanding community knowledge related to this common data management task.

Research Organization:
Argonne National Laboratory (ANL), Argonne, IL (United States); Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States); Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA); USDOE National Nuclear Security Administration (NNSA), Office of Defense Programs (DP); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR). Scientific Discovery through Advanced Computing (SciDAC); USDOE Office of Science (SC), High Energy Physics (HEP)
Grant/Contract Number:
AC02-05CH11231; AC02-07CH11359; AC05-00OR22725; SC0014330; SC0019358
OSTI ID:
1866361
Report Number(s):
FERMILAB-PUB-22-256-QIS-SCD; arXiv:2205.01168; oai:inspirehep.net:2064257
Journal Information:
Parallel Computing, Journal Name: Parallel Computing Vol. 110; ISSN 0167-8191
Publisher:
ElsevierCopyright Statement
Country of Publication:
United States
Language:
English

References (6)

The Genomedata format for storing large-scale functional genomics data journal April 2010
Scientific data exchange: a schema for HDF5-based storage of raw and analyzed data journal October 2014
Parallel I/O for 3-D Global FDTD Earth–Ionosphere Waveguide Models at Resolutions on the Order of ~1 km and Higher Using HDF5 journal July 2018
Unifying biological image formats with HDF5 journal October 2009
Parallel data analysis directly on scientific file formats conference January 2014
Implementation of CCSDS Lossless Data Compression for Space and Data Archive Applications conference October 2002