Using Big Data Technologies for HEP Analysis

Cremonesi, Matteo; Bellini, Claudio; Bian, Bianny; Canali, Luca; Dimakopoulos, Vasileios; Elmer, Peter; Fisk, Ian; Girone, Maria; Gutsche, Oliver; Hoh, Siew-Yan; Jayatilaka, Bo; Khristenko, Viktor; Luiselli, Andrea; Melo, Andrew; Evangelos, Evangelos; Olivito, Dominick; Pazzini, Jacopo; Pivarski, Jim; Svyatkovskiy, Alexey; Zanetti, Marco

Title: Using Big Data Technologies for HEP Analysis

Journal Article · Mon Jan 21 00:00:00 EST 2019 · TBD

OSTI ID:1529354

Cremonesi, Matteo ^[1]; Bellini, Claudio ^[2]; Bian, Bianny ^[2]; Canali, Luca ^[3]; Dimakopoulos, Vasileios ^[3]; Elmer, Peter ^[4]; Fisk, Ian ^[5]; Girone, Maria ^[3]; Gutsche, Oliver ^[1]; Hoh, Siew-Yan ^[6]; Jayatilaka, Bo ^[1]; Khristenko, Viktor ^[3]; Luiselli, Andrea ^[2]; Melo, Andrew ^[7]; Evangelos, Evangelos ^[3]; Olivito, Dominick ^[8]; Pazzini, Jacopo ^[6]; Pivarski, Jim ^[4]; Svyatkovskiy, Alexey ^[4]; Zanetti, Marco ^[6]

Fermilab
Intel, Santa Clara
CERN
Princeton U.
Flatiron Inst., New York
Genoa U.
Vanderbilt U.
UC, San Diego

The HEP community is approaching an era were the excellent performances of the particle accelerators in delivering collision at high rate will force the experiments to record a large amount of information. The growing size of the datasets could potentially become a limiting factor in the capability to produce scientific results timely and efficiently. Recently, new technologies and new approaches have been developed in industry to answer to the necessity to retrieve information as quickly as possible to analyze PB and EB datasets. Providing the scientists with these modern computing tools will lead to rethinking the principles of data analysis in HEP, making the overall scientific process faster and smoother. In this paper, we are presenting the latest developments and the most recent results on the usage of Apache Spark for HEP analysis. The study aims at evaluating the efficiency of the application of the new tools both quantitatively, by measuring the performances, and qualitatively, focusing on the user experience. The first goal is achieved by developing a data reduction facility: working together with CERN Openlab and Intel, CMS replicates a real physics search using Spark-based technologies, with the ambition of reducing 1 PB of public data in 5 hours, collected by the CMS experiment, to 1 TB of data in a format suitable for physics analysis. The second goal is achieved by implementing multiple physics use-cases in Apache Spark using as input preprocessed datasets derived from official CMS data and simulation. By performing different end-analyses up to the publication plots on different hardware, feasibility, usability and portability are compared to the ones of a traditional ROOT-based workflow.

View Journal Article

Cite

Export

Save

Research Organization:: Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)

Sponsoring Organization:: USDOE Office of Science (SC), High Energy Physics (HEP)

DOE Contract Number:: AC02-07CH11359

OSTI ID:: 1529354

Report Number(s):: arXiv:1901.07143; FERMILAB-PUB-19-037-CD-PPD; 1716266

Journal Information:: TBD, Journal Name: TBD

Country of Publication:: United States

Language:: English

Similar Records

CMS Analysis and Data Reduction with Apache Spark

Journal Article · Thu Oct 18 00:00:00 EDT 2018 · Journal of Physics. Conference Series · OSTI ID:1529354

Gutsche, Oliver; Canali, Luca; Cremer, Illia; +12 more

Big Data in HEP: A comprehensive use case study

Journal Article · Thu Nov 23 00:00:00 EST 2017 · Journal of Physics. Conference Series · OSTI ID:1529354

Gutsche, Oliver; Cremonesi, Matteo; Elmer, Peter; +7 more

Spark and HPC for High Energy Physics Data Analyses

Journal Article · Mon May 01 00:00:00 EDT 2017 · OSTI ID:1529354

Sehrish, Saba; Kowalkowski, Jim; Paterno, Marc

Title: Using Big Data Technologies for HEP Analysis

Citation Formats

Similar Records

Related Subjects