A case study on parallel HDF5 dataset concatenation for high energy physics data analysis
- Northwestern Univ., Evanston, IL (United States)
- Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
- Argonne National Lab. (ANL), Lemont, IL (United States)
In High Energy Physics (HEP), experimentalists generate large volumes of data that, when analyzed, helps us better understand the fundamental particles and their interactions. This data is often captured in many files of small size, creating a data management challenge for scientists. In order to better facilitate data management, transfer, and analysis on large scale platforms, it is advantageous to aggregate data further into a smaller number of larger files. However, this translation process can consume significant time and resources, and if performed incorrectly the resulting aggregated files can be inefficient for highly parallel access during analysis on large scale platforms. In this paper, we present our case study on parallel I/O strategies and HDF5 features for reducing data aggregation time, making effective use of compression, and ensuring efficient access to the resulting data during analysis at scale. We focus on NOvA detector data in this case study, a large-scale HEP experiment generating many terabytes of data. Here, the lessons learned from our case study inform the handling of similar datasets, thus expanding community knowledge related to this common data management task.
- Research Organization:
- Argonne National Laboratory (ANL), Argonne, IL (United States); Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States); Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
- Sponsoring Organization:
- USDOE National Nuclear Security Administration (NNSA); USDOE National Nuclear Security Administration (NNSA), Office of Defense Programs (DP); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR). Scientific Discovery through Advanced Computing (SciDAC); USDOE Office of Science (SC), High Energy Physics (HEP)
- Grant/Contract Number:
- AC02-05CH11231; AC02-07CH11359; AC05-00OR22725; SC0014330; SC0019358
- OSTI ID:
- 1866361
- Report Number(s):
- FERMILAB-PUB-22-256-QIS-SCD; arXiv:2205.01168; oai:inspirehep.net:2064257
- Journal Information:
- Parallel Computing, Journal Name: Parallel Computing Vol. 110; ISSN 0167-8191
- Publisher:
- ElsevierCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Similar Records
HDF5-FastQuery: Accelerating Complex Queries on HDF Datasets usingFast Bitmap Indices
Silo & HDF5 I/O Scaling Improvements on BG/P Systems