skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: CMS Analysis and Data Reduction with Apache Spark

Abstract

Experimental Particle Physics has been at the forefront of analyzing the world's largest datasets for decades. The HEP community was among the first to develop suitable software and computing tools for this task. In recent times, new toolkits and systems for distributed data processing, collectively called "Big Data" technologies have emerged from industry and open source projects to support the analysis of Petabyte and Exabyte datasets in industry. While the principles of data analysis in HEP have not changed (filtering and transforming experiment-specific data formats), these new technologies use different approaches and tools, promising a fresh look at analysis of very large datasets that could potentially reduce the time-to-physics with increased interactivity. Moreover these new tools are typically actively developed by large communities, often profiting of industry resources, and under open source licensing. These factors result in a boost for adoption and maturity of the tools and for the communities supporting them, at the same time helping in reducing the cost of ownership for the end-users. In this talk, we are presenting studies of using Apache Spark for end user data analysis. We are studying the HEP analysis workflow separated into two thrusts: the reduction of centrally produced experiment datasetsmore » and the end-analysis up to the publication plot. Studying the first thrust, CMS is working together with CERN openlab and Intel on the CMS Big Data Reduction Facility. The goal is to reduce 1 PB of official CMS data to 1 TB of ntuple output for analysis. We are presenting the progress of this 2-year project with first results of scaling up Spark-based HEP analysis. Studying the second thrust, we are presenting studies on using Apache Spark for a CMS Dark Matter physics search, comparing Spark's feasibility, usability and performance to the ROOT-based analysis.« less

Authors:
ORCiD logo [1];  [2];  [3];  [1];  [4];  [5];  [2];  [1];  [1];  [2];  [2];  [4];  [1];  [2];  [4]
  1. Fermilab
  2. CERN
  3. Magnetic Corp., Waltham
  4. Princeton U.
  5. Flatiron Inst., New York
Publication Date:
Research Org.:
Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)
Sponsoring Org.:
USDOE Office of Science (SC), High Energy Physics (HEP) (SC-25)
OSTI Identifier:
1414399
Report Number(s):
arXiv:1711.00375; FERMILAB-CONF-17-465-CD
1633859
DOE Contract Number:
AC02-07CH11359
Resource Type:
Conference
Resource Relation:
Conference: 18th International Workshop on Advanced Computing and Analysis Techniques in Physics Research, Seattle, WA, USA, 08/21-08/25/2017
Country of Publication:
United States
Language:
English

Citation Formats

Gutsche, Oliver, Canali, Luca, Cremer, Illia, Cremonesi, Matteo, Elmer, Peter, Fisk, Ian, Girone, Maria, Jayatilaka, Bo, Kowalkowski, Jim, Khristenko, Viktor, Motesnitsalis, Evangelos, Pivarski, Jim, Sehrish, Saba, Surdy, Kacper, and Svyatkovskiy, Alexey. CMS Analysis and Data Reduction with Apache Spark. United States: N. p., 2017. Web.
Gutsche, Oliver, Canali, Luca, Cremer, Illia, Cremonesi, Matteo, Elmer, Peter, Fisk, Ian, Girone, Maria, Jayatilaka, Bo, Kowalkowski, Jim, Khristenko, Viktor, Motesnitsalis, Evangelos, Pivarski, Jim, Sehrish, Saba, Surdy, Kacper, & Svyatkovskiy, Alexey. CMS Analysis and Data Reduction with Apache Spark. United States.
Gutsche, Oliver, Canali, Luca, Cremer, Illia, Cremonesi, Matteo, Elmer, Peter, Fisk, Ian, Girone, Maria, Jayatilaka, Bo, Kowalkowski, Jim, Khristenko, Viktor, Motesnitsalis, Evangelos, Pivarski, Jim, Sehrish, Saba, Surdy, Kacper, and Svyatkovskiy, Alexey. 2017. "CMS Analysis and Data Reduction with Apache Spark". United States. doi:. https://www.osti.gov/servlets/purl/1414399.
@article{osti_1414399,
title = {CMS Analysis and Data Reduction with Apache Spark},
author = {Gutsche, Oliver and Canali, Luca and Cremer, Illia and Cremonesi, Matteo and Elmer, Peter and Fisk, Ian and Girone, Maria and Jayatilaka, Bo and Kowalkowski, Jim and Khristenko, Viktor and Motesnitsalis, Evangelos and Pivarski, Jim and Sehrish, Saba and Surdy, Kacper and Svyatkovskiy, Alexey},
abstractNote = {Experimental Particle Physics has been at the forefront of analyzing the world's largest datasets for decades. The HEP community was among the first to develop suitable software and computing tools for this task. In recent times, new toolkits and systems for distributed data processing, collectively called "Big Data" technologies have emerged from industry and open source projects to support the analysis of Petabyte and Exabyte datasets in industry. While the principles of data analysis in HEP have not changed (filtering and transforming experiment-specific data formats), these new technologies use different approaches and tools, promising a fresh look at analysis of very large datasets that could potentially reduce the time-to-physics with increased interactivity. Moreover these new tools are typically actively developed by large communities, often profiting of industry resources, and under open source licensing. These factors result in a boost for adoption and maturity of the tools and for the communities supporting them, at the same time helping in reducing the cost of ownership for the end-users. In this talk, we are presenting studies of using Apache Spark for end user data analysis. We are studying the HEP analysis workflow separated into two thrusts: the reduction of centrally produced experiment datasets and the end-analysis up to the publication plot. Studying the first thrust, CMS is working together with CERN openlab and Intel on the CMS Big Data Reduction Facility. The goal is to reduce 1 PB of official CMS data to 1 TB of ntuple output for analysis. We are presenting the progress of this 2-year project with first results of scaling up Spark-based HEP analysis. Studying the second thrust, we are presenting studies on using Apache Spark for a CMS Dark Matter physics search, comparing Spark's feasibility, usability and performance to the ROOT-based analysis.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = 2017,
month =
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • Apache Spark is explored as a tool for analyzing large data sets from the magnetic fusion simulation code XGCI. Implementation details of Apache Spark on the NERSC Edison supercomputer are discussed, including binary file reading, and parameter setup. Here, an unsupervised machine learning algorithm, k-means clustering, is applied to XGCI particle distribution function data, showing that highly turbulent spatial regions do not have common coherent structures, but rather broad, ring-like structures in velocity space.
  • The Fort Apache Timber Company (FATCO) is a wholly-owned tribal enterprise of the White Mountain Apache Tribe (WMAT). WMAT officials are concerned about fuel buildup on the forest floor and the potential for catastrophic forests fires. Cogeneration is viewed as one means to effectively utilize biomass from the forest to reduce the chance of forest fires. FATCO presently spends approximately $1.6 million per year for electricity service from Navopache Electric Cooperative, Inc. for three sites. Peak demand is approximately 3.9 MW and the annual load factor is slightly under 50 percent. The blended cost of electricity is approximately $0.089 /more » kWh at the main mill. Biomass resources for fuel purposes may be obtained both from mill operations and from the forest operations. For many years FATCO has burned its wood residues to supply steam for dry kilns. It is estimated that a total of 125,778 bone dry tons (bdt) per year are available for fuel. A twenty year economic analysis model was used to evaluate the cogeneration potential. The model performs annual cash flow calculations to arrive at three measures of economic vitality: (1) Net Present Value (NPV), (2) levelized cost per kWh, and (3) Year 2 Return on Investment (ROI). Results of the analysis are positive for several scenarios.« less
  • The Apache Point Observatory Galactic Evolution Experiment (APOGEE), part of the Sloan Digital Sky Survey III, explores the stellar populations of the Milky Way using the Sloan 2.5-m telescope linked to a high resolution (R ∼ 22,500), near-infrared (1.51–1.70 μm) spectrograph with 300 optical fibers. For over 150,000 predominantly red giant branch stars that APOGEE targeted across the Galactic bulge, disks and halo, the collected high signal-to-noise ratio (>100 per half-resolution element) spectra provide accurate (∼0.1 km s{sup −1}) RVs, stellar atmospheric parameters, and precise (≲0.1 dex) chemical abundances for about 15 chemical species. Here we describe the basic APOGEEmore » data reduction software that reduces multiple 3D raw data cubes into calibrated, well-sampled, combined 1D spectra, as implemented for the SDSS-III/APOGEE data releases (DR10, DR11 and DR12). The processing of the near-IR spectral data of APOGEE presents some challenges for reduction, including automated sky subtraction and telluric correction over a 3°-diameter field and the combination of spectrally dithered spectra. We also discuss areas for future improvement.« less
  • Multichannel surface seismic reflection data recording is a standard industry tool used to examine various aspects of geology, especially the stratigraphic characteristics and structural style of sedimentary formations in the subsurface. With the help of the Jicarilla Apache Tribe and the Bureau of Indian Affairs we were able to locate over 800 kilometers (500 miles) of multichannel seismic reflection data located on the Jicarilla Apache Indian reservation. Most of the data was received in hardcopy form, but there were data sets where either the demultiplexed digital field data or the processed data accompanied the hardcopy sections. The seismic data wasmore » acquired from the mid 1960's to the early 1990's. The most extensive seismic coverage is in the southern part of the reservation, although there are two good surveys located on the northeastern and northwestern parts of the reservation. Most of the data show that subsurface formations are generally flat-lying in the southern and western portion of the reservation. There is, however, a significant amount of structure imaged on seismic data located over the San Juan Basin margin along the east-central and northern part of the reservation. Several west to east trending lines in these areas show a highly faulted monoclinal structure from the deep basin in the west up onto the basin margin to the east. Hydrocarbon exploration in flat lying formations is mostly stratigraphic in nature. Where there is structure in the subsurface and indications are that rocks have been folded, faulted, and fractured, exploration has concentrated on structural traps and porosity/permeability "sweet spots" caused by fracturing. Therefore, an understanding of the tectonics influencing the entire section is critical in understanding mechanisms for generating faults and fractures in the Cretaceous. It is apparent that much of the hydrocarbon production on the reservation is from fracture porosity in either source or reservoir sequences. Therefore it is important to understand the mechanism that controls the location and intensity of the fractures. A possible mechanism may be deep seated basement faulting that has been active through time. Examining the basement fault patterns in this part of the basin and their relation to fracture production may provide a model for new plays on the Jicarilla Indian Reservation. There are still parts of the reservation where the subsurface has not been imaged geophysically with either conventional two-dimensional or three-dimensional reflection seismic techniques. These methods, especially 3-D seismic, would provide the best data for mapping deep basement faulting. The authors would recommend that 3-D seismic be acquired along the Basin margin located along the eastern edge of the reservation and the results be used to construct detailed fault maps which may help to locate areas with the potential to contain highly fractured zones in the subsurface.« less