skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Spark and HPC for High Energy Physics Data Analyses

Abstract

A full High Energy Physics (HEP) data analysis is divided into multiple data reduction phases. Processing within these phases is extremely time consuming, therefore intermediate results are stored in files held in mass storage systems and referenced as part of large datasets. This processing model limits what can be done with interactive data analytics. Growth in size and complexity of experimental datasets, along with emerging big data tools are beginning to cause changes to the traditional ways of doing data analyses. Use of big data tools for HEP analysis looks promising, mainly because extremely large HEP datasets can be represented and held in memory across a system, and accessed interactively by encoding an analysis using highlevel programming abstractions. The mainstream tools, however, are not designed for scientific computing or for exploiting the available HPC platform features. We use an example from the Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC) in Geneva, Switzerland. The LHC is the highest energy particle collider in the world. Our use case focuses on searching for new types of elementary particles explaining Dark Matter in the universe. We use HDF5 as our input data format, and Spark to implement the use case.more » We show the benefits and limitations of using Spark with HDF5 on Edison at NERSC.« less

Authors:
; ;
Publication Date:
Research Org.:
Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)
Sponsoring Org.:
USDOE Office of Science (SC), High Energy Physics (HEP) (SC-25)
OSTI Identifier:
1355920
Report Number(s):
FERMILAB-PUB-17-078-CD
1598097
DOE Contract Number:  
AC02-07CH11359
Resource Type:
Journal Article
Resource Relation:
Conference: Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International, Lake Buena Vista, FL (United States), 29 May-2 Jun 2017
Country of Publication:
United States
Language:
English

Citation Formats

Sehrish, Saba, Kowalkowski, Jim, and Paterno, Marc. Spark and HPC for High Energy Physics Data Analyses. United States: N. p., 2017. Web. doi:10.1109/IPDPSW.2017.112.
Sehrish, Saba, Kowalkowski, Jim, & Paterno, Marc. Spark and HPC for High Energy Physics Data Analyses. United States. https://doi.org/10.1109/IPDPSW.2017.112
Sehrish, Saba, Kowalkowski, Jim, and Paterno, Marc. Mon . "Spark and HPC for High Energy Physics Data Analyses". United States. https://doi.org/10.1109/IPDPSW.2017.112. https://www.osti.gov/servlets/purl/1355920.
@article{osti_1355920,
title = {Spark and HPC for High Energy Physics Data Analyses},
author = {Sehrish, Saba and Kowalkowski, Jim and Paterno, Marc},
abstractNote = {A full High Energy Physics (HEP) data analysis is divided into multiple data reduction phases. Processing within these phases is extremely time consuming, therefore intermediate results are stored in files held in mass storage systems and referenced as part of large datasets. This processing model limits what can be done with interactive data analytics. Growth in size and complexity of experimental datasets, along with emerging big data tools are beginning to cause changes to the traditional ways of doing data analyses. Use of big data tools for HEP analysis looks promising, mainly because extremely large HEP datasets can be represented and held in memory across a system, and accessed interactively by encoding an analysis using highlevel programming abstractions. The mainstream tools, however, are not designed for scientific computing or for exploiting the available HPC platform features. We use an example from the Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC) in Geneva, Switzerland. The LHC is the highest energy particle collider in the world. Our use case focuses on searching for new types of elementary particles explaining Dark Matter in the universe. We use HDF5 as our input data format, and Spark to implement the use case. We show the benefits and limitations of using Spark with HDF5 on Edison at NERSC.},
doi = {10.1109/IPDPSW.2017.112},
url = {https://www.osti.gov/biblio/1355920}, journal = {},
number = ,
volume = ,
place = {United States},
year = {2017},
month = {5}
}