Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Boosting RDataFrame performance with transparent bulk event processing

Conference · · EPJ Web Conf.

RDataFrame is ROOT’s high-level interface for Python and C++ data analysis. Since it first became available, RDataFrame adoption has grown steadily and it is now poised to be a major component of analysis software pipelines for LHC Run 3 and beyond. Thanks to its design inspired by declarative programming principles, RDataFrame enables the development of highperformance, highly parallel analyses without requiring expert knowledge of multi-threading and I/O: user logic is expressed in terms of self-contained, small computation kernels tied together by a high-level API. This design completely decouples analysis logic from its actual execution, and opens several interesting avenues for workflow optimization. In particular, in this work we explore the benefits of moving internal data processing from an event-by-event to a bulkby-bulk loop. This refactoring dramatically reduces the framework’s runtime overheads; in collaboration with the I/O layer it improves data access patterns; it exposes information that optimizing compilers might use to auto-vectorize the invocation of user-defined computations; finally, while existing user-facing interfaces remain unaffected, it becomes possible to additionally offer interfaces that explicitly expose bulks of events, useful e.g. for the injection of GPU kernels into the analysis workflow. In order to inform similar future R&D, design challenges will be presented, as well as an investigation of the relevant timememory trade-off backed by novel performance benchmarks.

Research Organization:
Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States)
Sponsoring Organization:
USDOE Office of Science (SC), High Energy Physics (HEP) (SC-25)
DOE Contract Number:
AC02-07CH11359
OSTI ID:
2468774
Report Number(s):
FERMILAB-CONF-24-0684-CSAID; oai:inspirehep.net:2786416
Journal Information:
EPJ Web Conf., Journal Name: EPJ Web Conf. Vol. 295
Country of Publication:
United States
Language:
English

References (8)

RNTuple performance: Status and Outlook journal February 2023
Evolution of the ROOT Tree I/O journal January 2020
RDataFrame: Easy Parallel ROOT Analysis at 100 Threads journal January 2019
Leveraging State-of-the-Art Engines for Large-Scale Data Analysis in High Energy Physics journal February 2023
RDataFrame enhancements for HEP analyses journal February 2023
ROOT — An object oriented data analysis framework journal April 1997
Readable and efficient HEP data analysis with bamboo journal January 2021
Apache Spark: a unified engine for big data processing journal October 2016

Similar Records

Accelerating Multigrid-based Hierarchical Scientific Data Refactoring on GPUs
Conference · 2020 · OSTI ID:1813580

Accelerating Multigrid-based Hierarchical Scientific Data Refactoring on GPUs
Conference · 2021 · OSTI ID:1817421

geryon v. 0.1
Software · 2010 · OSTI ID:1231307

Related Subjects