DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: XRootD popularity on hadoop clusters

Abstract

Performance data and metadata of the computing operations at the CMS experiment are collected through a distributed monitoring infrastructure, currently relying on a traditional Oracle database system. This paper shows how to harness Big Data architectures in order to improve the throughput and the efficiency of such monitoring. A large set of operational data - user activities, job submissions, resources, file transfers, site efficiencies, software releases, network traffic, machine logs - is being injected into a readily available Hadoop cluster, via several data streamers. The collected metadata is further organized running fast arbitrary queries; this offers the ability to test several Map&Reduce-based frameworks and measure the system speed-up when compared to the original database infrastructure. By leveraging a quality Hadoop data store and enabling an analytics framework on top, it is possible to design a mining platform to predict dataset popularity and discover patterns and correlations.

Authors:
 [1];  [2];  [3];  [4];  [5]
  1. Univ. of Pisa (Italy); Istituto Nazionale di Fisica Nucleare (INFN), Pisa (Italy)
  2. Istituto Nazionale di Fisica Nucleare (INFN), Pisa (Italy)
  3. Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)
  4. European Organization for Nuclear Research (CERN), Geneva (Switzerland)
  5. European Organization for Nuclear Research (CERN), Geneva (Switzerland); CMS Collaboration, et al.
Publication Date:
Research Org.:
Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)
Sponsoring Org.:
USDOE Office of Science (SC), High Energy Physics (HEP)
Contributing Org.:
CMS Collaboration
OSTI Identifier:
1831862
Report Number(s):
FERMILAB-PUB-17-715-CMS
Journal ID: ISSN 1742-6588; oai:inspirehep.net:1638557; TRN: US2216651
Grant/Contract Number:  
AC02-07CH11359
Resource Type:
Accepted Manuscript
Journal Name:
Journal of Physics. Conference Series
Additional Journal Information:
Journal Volume: 898; Journal Issue: 7; Journal ID: ISSN 1742-6588
Publisher:
IOP Publishing
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING

Citation Formats

Meoni, Marco, Boccali, Tommaso, Magini, Nicolò, Menichetti, Luca, and Giordano, Domenico. XRootD popularity on hadoop clusters. United States: N. p., 2017. Web. doi:10.1088/1742-6596/898/7/072027.
Meoni, Marco, Boccali, Tommaso, Magini, Nicolò, Menichetti, Luca, & Giordano, Domenico. XRootD popularity on hadoop clusters. United States. https://doi.org/10.1088/1742-6596/898/7/072027
Meoni, Marco, Boccali, Tommaso, Magini, Nicolò, Menichetti, Luca, and Giordano, Domenico. Wed . "XRootD popularity on hadoop clusters". United States. https://doi.org/10.1088/1742-6596/898/7/072027. https://www.osti.gov/servlets/purl/1831862.
@article{osti_1831862,
title = {XRootD popularity on hadoop clusters},
author = {Meoni, Marco and Boccali, Tommaso and Magini, Nicolò and Menichetti, Luca and Giordano, Domenico},
abstractNote = {Performance data and metadata of the computing operations at the CMS experiment are collected through a distributed monitoring infrastructure, currently relying on a traditional Oracle database system. This paper shows how to harness Big Data architectures in order to improve the throughput and the efficiency of such monitoring. A large set of operational data - user activities, job submissions, resources, file transfers, site efficiencies, software releases, network traffic, machine logs - is being injected into a readily available Hadoop cluster, via several data streamers. The collected metadata is further organized running fast arbitrary queries; this offers the ability to test several Map&Reduce-based frameworks and measure the system speed-up when compared to the original database infrastructure. By leveraging a quality Hadoop data store and enabling an analytics framework on top, it is possible to design a mining platform to predict dataset popularity and discover patterns and correlations.},
doi = {10.1088/1742-6596/898/7/072027},
journal = {Journal of Physics. Conference Series},
number = 7,
volume = 898,
place = {United States},
year = {Wed Nov 22 00:00:00 EST 2017},
month = {Wed Nov 22 00:00:00 EST 2017}
}

Works referenced in this record:

The Worldwide LHC Computing Grid (worldwide LCG)
journal, July 2007


CMS Physics Technical Design Report, Volume II: Physics Performance
journal, April 2007


Works referencing / citing this record:

Dataset Popularity Prediction for Caching of CMS Big Data
journal, February 2018

  • Meoni, Marco; Perego, Raffaele; Tonellotto, Nicola
  • Journal of Grid Computing, Vol. 16, Issue 2
  • DOI: 10.1007/s10723-018-9436-4