Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Spark and HPC for High Energy Physics Data Analyses

Journal Article ·
A full High Energy Physics (HEP) data analysis is divided into multiple data reduction phases. Processing within these phases is extremely time consuming, therefore intermediate results are stored in files held in mass storage systems and referenced as part of large datasets. This processing model limits what can be done with interactive data analytics. Growth in size and complexity of experimental datasets, along with emerging big data tools are beginning to cause changes to the traditional ways of doing data analyses. Use of big data tools for HEP analysis looks promising, mainly because extremely large HEP datasets can be represented and held in memory across a system, and accessed interactively by encoding an analysis using highlevel programming abstractions. The mainstream tools, however, are not designed for scientific computing or for exploiting the available HPC platform features. We use an example from the Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC) in Geneva, Switzerland. The LHC is the highest energy particle collider in the world. Our use case focuses on searching for new types of elementary particles explaining Dark Matter in the universe. We use HDF5 as our input data format, and Spark to implement the use case. We show the benefits and limitations of using Spark with HDF5 on Edison at NERSC.
Research Organization:
Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States)
Sponsoring Organization:
USDOE Office of Science (SC), High Energy Physics (HEP) (SC-25)
DOE Contract Number:
AC02-07CH11359
OSTI ID:
1355920
Report Number(s):
FERMILAB-PUB-17-078-CD; 1598097
Country of Publication:
United States
Language:
English

Similar Records

Python and HPC for High Energy Physics Data Analyses
Journal Article · Sat Dec 31 19:00:00 EST 2016 · OSTI ID:1413085

CMS Analysis and Data Reduction with Apache Spark
Journal Article · Wed Oct 17 20:00:00 EDT 2018 · Journal of Physics. Conference Series · OSTI ID:1414399

PATHA: Performance Analysis Tool for HPC Applications
Journal Article · Wed Feb 17 19:00:00 EST 2016 · IEEE International Performance, Computing, and Communications Conference · OSTI ID:1379097

Related Subjects