Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Exploring the Performance of Spark for a Scientific Use Case

Conference ·
OSTI ID:1250827
We present an evaluation of the performance of a Spark implementation of a classification algorithm in the domain of High Energy Physics (HEP). Spark is a general engine for in-memory, large-scale data processing, and is designed for applications where similar repeated analysis is performed on the same large data sets. Classification problems are one of the most common and critical data processing tasks across many domains. Many of these data processing tasks are both computation- and data-intensive, involving complex numerical computations employing extremely large data sets. We evaluated the performance of the Spark implementation on Cori, a NERSC resource, and compared the results to an untuned MPI implementation of the same algorithm. While the Spark implementation scaled well, it is not competitive in speed to our MPI implementation, even when using significantly greater computational resources.
Research Organization:
Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States)
Sponsoring Organization:
USDOE Office of Science (SC), High Energy Physics (HEP) (SC-25)
DOE Contract Number:
AC02-07CH11359
OSTI ID:
1250827
Report Number(s):
FERMILAB-CONF-16-072-CD; 1442301
Country of Publication:
United States
Language:
English

Similar Records

Data-parallel Python for High Energy Physics Analyses
Conference · Fri Oct 26 00:00:00 EDT 2018 · OSTI ID:1490837

Scalable Algorithms for MPI Intergroup Allgather and Allgatherv
Journal Article · Mon Apr 29 20:00:00 EDT 2019 · Parallel Computing · OSTI ID:1577476

Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies
Conference · Thu May 12 00:00:00 EDT 2016 · OSTI ID:1332132

Related Subjects