Spark and HPC for High Energy Physics Data Analyses
A full High Energy Physics (HEP) data analysis is divided into multiple data reduction phases. Processing within these phases is extremely time consuming, therefore intermediate results are stored in files held in mass storage systems and referenced as part of large datasets. This processing model limits what can be done with interactive data analytics. Growth in size and complexity of experimental datasets, along with emerging big data tools are beginning to cause changes to the traditional ways of doing data analyses. Use of big data tools for HEP analysis looks promising, mainly because extremely large HEP datasets can be represented and held in memory across a system, and accessed interactively by encoding an analysis using highlevel programming abstractions. The mainstream tools, however, are not designed for scientific computing or for exploiting the available HPC platform features. We use an example from the Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC) in Geneva, Switzerland. The LHC is the highest energy particle collider in the world. Our use case focuses on searching for new types of elementary particles explaining Dark Matter in the universe. We use HDF5 as our input data format, and Spark to implement the use case. We show the benefits and limitations of using Spark with HDF5 on Edison at NERSC.
- Research Organization:
- Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), High Energy Physics (HEP) (SC-25)
- DOE Contract Number:
- AC02-07CH11359
- OSTI ID:
- 1355920
- Report Number(s):
- FERMILAB-PUB-17-078-CD; 1598097
- Country of Publication:
- United States
- Language:
- English
Similar Records
Python and HPC for High Energy Physics Data Analyses
CMS Analysis and Data Reduction with Apache Spark
PATHA: Performance Analysis Tool for HPC Applications
Journal Article
·
Sat Dec 31 19:00:00 EST 2016
·
OSTI ID:1413085
CMS Analysis and Data Reduction with Apache Spark
Journal Article
·
Wed Oct 17 20:00:00 EDT 2018
· Journal of Physics. Conference Series
·
OSTI ID:1414399
PATHA: Performance Analysis Tool for HPC Applications
Journal Article
·
Wed Feb 17 19:00:00 EST 2016
· IEEE International Performance, Computing, and Communications Conference
·
OSTI ID:1379097