XRootD popularity on hadoop clusters

Meoni, Marco; Boccali, Tommaso; Magini, Nicolò; Menichetti, Luca; Giordano, Domenico

doi:10.1088/1742-6596/898/7/072027

Title: XRootD popularity on hadoop clusters

Abstract

Performance data and metadata of the computing operations at the CMS experiment are collected through a distributed monitoring infrastructure, currently relying on a traditional Oracle database system. This paper shows how to harness Big Data architectures in order to improve the throughput and the efficiency of such monitoring. A large set of operational data - user activities, job submissions, resources, file transfers, site efficiencies, software releases, network traffic, machine logs - is being injected into a readily available Hadoop cluster, via several data streamers. The collected metadata is further organized running fast arbitrary queries; this offers the ability to test several Map&Reduce-based frameworks and measure the system speed-up when compared to the original database infrastructure. By leveraging a quality Hadoop data store and enabling an analytics framework on top, it is possible to design a mining platform to predict dataset popularity and discover patterns and correlations.

Authors:

Meoni, Marco ^[1]; Boccali, Tommaso ^[2]; Magini, Nicolò ^[3]; Menichetti, Luca ^[4]; Giordano, Domenico ^[5]

Univ. of Pisa (Italy); Istituto Nazionale di Fisica Nucleare (INFN), Pisa (Italy)
Istituto Nazionale di Fisica Nucleare (INFN), Pisa (Italy)
Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)
European Organization for Nuclear Research (CERN), Geneva (Switzerland)
European Organization for Nuclear Research (CERN), Geneva (Switzerland); CMS Collaboration, et al.

Publication Date:: Wed Nov 22 00:00:00 EST 2017

Research Org.:: Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)

Sponsoring Org.:: USDOE Office of Science (SC), High Energy Physics (HEP)

Contributing Org.:: CMS Collaboration

OSTI Identifier:: 1831862

Report Number(s):: FERMILAB-PUB-17-715-CMS
Journal ID: ISSN 1742-6588; oai:inspirehep.net:1638557; TRN: US2216651

Grant/Contract Number:: AC02-07CH11359

Resource Type:: Accepted Manuscript

Journal Name:: Journal of Physics. Conference Series

Additional Journal Information:: Journal Volume: 898; Journal Issue: 7; Journal ID: ISSN 1742-6588

Publisher:: IOP Publishing

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING

Citation Formats


                    Meoni, Marco, Boccali, Tommaso, Magini, Nicolò, Menichetti, Luca, and Giordano, Domenico. XRootD popularity on hadoop clusters.  United States: N. p., 2017. 
Web.  doi:10.1088/1742-6596/898/7/072027.

Copy to clipboard


                    Meoni, Marco, Boccali, Tommaso, Magini, Nicolò, Menichetti, Luca, & Giordano, Domenico. XRootD popularity on hadoop clusters.  United States.  https://doi.org/10.1088/1742-6596/898/7/072027

Copy to clipboard


                    Meoni, Marco, Boccali, Tommaso, Magini, Nicolò, Menichetti, Luca, and Giordano, Domenico. Wed .  
"XRootD popularity on hadoop clusters".  United States.  https://doi.org/10.1088/1742-6596/898/7/072027.  https://www.osti.gov/servlets/purl/1831862.

Copy to clipboard


                    
@article{osti_1831862,

  title        = {XRootD popularity on hadoop clusters},

  author       = {Meoni, Marco and Boccali, Tommaso and Magini, Nicolò and Menichetti, Luca and Giordano, Domenico},

  abstractNote = {Performance data and metadata of the computing operations at the CMS experiment are collected through a distributed monitoring infrastructure, currently relying on a traditional Oracle database system. This paper shows how to harness Big Data architectures in order to improve the throughput and the efficiency of such monitoring. A large set of operational data - user activities, job submissions, resources, file transfers, site efficiencies, software releases, network traffic, machine logs - is being injected into a readily available Hadoop cluster, via several data streamers. The collected metadata is further organized running fast arbitrary queries; this offers the ability to test several Map&Reduce-based frameworks and measure the system speed-up when compared to the original database infrastructure. By leveraging a quality Hadoop data store and enabling an analytics framework on top, it is possible to design a mining platform to predict dataset popularity and discover patterns and correlations.},

  doi          = {10.1088/1742-6596/898/7/072027},

  journal      = {Journal of Physics. Conference Series},

  number       = 7,

  volume       = 898,

  place        = {United States},

  year         = {Wed Nov 22 00:00:00 EST 2017},

  month        = {Wed Nov 22 00:00:00 EST 2017}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1088/1742-6596/898/7/072027

Other availability

Search WorldCat to find libraries that may hold this journal

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

The Worldwide LHC Computing Grid (worldwide LCG)
journal, July 2007

Shiers, Jamie
Computer Physics Communications, Vol. 177, Issue 1-2
DOI: 10.1016/j.cpc.2007.02.021

CMS Physics Technical Design Report, Volume II: Physics Performance
journal, April 2007

Collaboration, The CMS
Journal of Physics G: Nuclear and Particle Physics, Vol. 34, Issue 6
DOI: 10.1088/0954-3899/34/6/S01

Works referencing / citing this record:

Dataset Popularity Prediction for Caching of CMS Big Data
journal, February 2018

Meoni, Marco; Perego, Raffaele; Tonellotto, Nicola
Journal of Grid Computing, Vol. 16, Issue 2
DOI: 10.1007/s10723-018-9436-4

Similar Records in DOE PAGES and OSTI.GOV collections:

The Archive Solution for Distributed Workflow Management Agents of the CMS Experiment at LHC

Journal Article Kuznetsov, Valentin ; Fischer, Nils Leif ; Guo, Yuyi - Computing and Software for Big Science

The CMS experiment at the CERN LHC developed the Workflow Management Archive system to persistently store unstructured framework job report documents produced by distributed workflow management agents. In this paper we present its architecture, implementation, deployment, and integration with the CMS and CERN computing infrastructures, such as central HDFS and Hadoop Spark cluster. The system leverages modern technologies such as a document oriented database and the Hadoop eco-system to provide the necessary flexibility to reliably process, store, and aggregatemore »« less
https://doi.org/10.1007/s41781-018-0005-0

Full Text Available
Scalable Pattern Matching in Metadata Graphs via Constraint Checking

Journal Article Reza, Tahsin ; Halawa, Hassan ; Ripeanu, Matei ; ... - ACM Transactions on Parallel Computing

Pattern matching is a fundamental tool for answering complex graph queries. Unfortunately, existing solutions have limited capabilities: They do not scale to process large graphs and/or support only a restricted set of search templates or usage scenarios. Moreover, the algorithms at the core of the existing techniques are not suitable for today’s graph processing infrastructures relying on horizontal scalability and shared-nothing clusters, as most of these algorithms are inherently sequential and difficult to parallelize. In this article we present an algorithmic pipeline that bases pattern matching on constraint checking. The key intuition is that each vertex and edge participating inmore »« less
https://doi.org/10.1145/3434391

Full Text Available
National Computational Infrastructure for LatticeGauge Theory SciDAC-2 Closeout Report

Technical Report Bapty, Theodore ; Dubey, Abhishek

As part of the reliability project work, researchers from Vanderbilt University, Fermi National Laboratory and Illinois Institute of technology developed a real-time cluster fault-tolerant cluster monitoring framework. The goal for the scientific workflow project is to investigate and develop domain-specific workflow tools for LQCD to help effectively orchestrate, in parallel, computational campaigns consisting of many loosely-coupled batch processing jobs. Major requirements for an LQCD workflow system include: a system to manage input metadata, e.g. physics parameters such as masses, a system to manage and permit the reuse of templates describing workflows, a system to capture data provenance information, a systemsmore »« less
https://doi.org/10.2172/1105905

Full Text Available
Extension of 4-8 Texture Hierarchies to Large Video Processing and Visualization

Technical Report Senecal, J G ; Wegner, A E

The purpose of this Techbase was to reduce to practice the tiled 4-8 texture hierarchy for the display of video imagery (i.e. sequences of frames). The immediate intent was to demonstrate its use in the analysis and display of sensor imagery. As sensors are increasing in resolution the physical amount of imagery that needs to be displayed can quickly overwhelm most display systems. For example, a sensor with a horizontal resolution of over 8000 pixels would generate an image over 10 feet wide on a standard 72 DPI display. Breaking an image into tiles, and then decomposing each tile intomore »« less
https://doi.org/10.2172/924013

Full Text Available
An asynchronous traversal engine for graph-based rich metadata management

Journal Article Dai, Dong ; Carns, Philip ; Ross, Robert B. ; ... - Parallel Computing

Rich metadata in high-performance computing (HPC) systems contains extended information about users, jobs, data files, and their relationships. Property graphs are a promising data model to represent heterogeneous rich metadata flexibly. Specifically, a property graph can use vertices to represent different entities and edges to record the relationships between vertices with unique annotations. The high-volume HPC use case, with millions of entities and relationships, naturally requires an out-of-core distributed property graph database, which must support live updates (to ingest production information in real time), low-latency point queries (for frequent metadata operations such as permission checking), and large-scale traversals (for provenancemore »« less
Cited by 2
https://doi.org/10.1016/j.parco.2016.06.002

Full Text Available

Similar Records

Title: XRootD popularity on hadoop clusters

Abstract

Citation Formats

The Worldwide LHC Computing Grid (worldwide LCG) journal, July 2007

CMS Physics Technical Design Report, Volume II: Physics Performance journal, April 2007

Dataset Popularity Prediction for Caching of CMS Big Data journal, February 2018

The Worldwide LHC Computing Grid (worldwide LCG)
journal, July 2007

CMS Physics Technical Design Report, Volume II: Physics Performance
journal, April 2007

Dataset Popularity Prediction for Caching of CMS Big Data
journal, February 2018