Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Evaluating Awkward Arrays, uproot, and coffea as a query platform for High Energy Physics Data

Conference · · J.Phys.Conf.Ser.

Query languages for High Energy Physics (HEP) are an ever present topic within the field. A query language that can efficiently represent the nested data structures that encode the statistical and physical meaning of HEP data will help analysts by ensuring their code is more clear and pertinent. As the result of a multi-year effort to develop an in-memory columnar representation of high energy physics data, the NumPy, Awkward Array, and uproot Python packages present a mature and efficient interface to HEP data. Atop that base, the coffea package adds functionality to launch queries at scale, manage and apply experiment-specific transformations to data, and present a rich object-oriented columnar data representation to the analyst. Recently, a set of Analysis Description Language (ADL) benchmarks has been established to compare HEP queries in multiple languages and frameworks. In this paper we present these benchmark queries implemented within the coffea framework and discuss their readability and performance characteristics. We find that the columnar queries perform as well or better than the implementations given in previous studies.

Research Organization:
Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)
Sponsoring Organization:
USDOE Office of Science (SC), High Energy Physics (HEP)
Contributing Organization:
CMS
DOE Contract Number:
AC02-07CH11359
OSTI ID:
1958510
Report Number(s):
FERMILAB-CONF-23-083-CMS; oai:inspirehep.net:2633613
Journal Information:
J.Phys.Conf.Ser., Vol. 2438, Issue 1
Country of Publication:
United States
Language:
English

References (13)

ROOT — An object oriented data analysis framework journal April 1997
The NumPy Array: A Structure for Efficient Numerical Computation journal March 2011
Awkward Arrays in Python, C++, and Numba journal January 2020
Coffea Columnar Object Framework For Effective Analysis journal January 2020
Apache Spark: a unified engine for big data processing journal October 2016
Evaluating query languages and systems for high-energy physics data journal October 2021
Distributed data analysis with ROOT RDataFrame journal January 2020
Python for Scientific Computing journal January 2007
Python for Scientists and Engineers journal March 2011
The Scikit HEP Project overview and prospects journal January 2020
Array programming with NumPy journal September 2020
Matplotlib: A 2D Graphics Environment journal January 2007
Columnar data analysis with ATLAS analysis formats journal January 2021

Related Subjects