skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Accessing Data Federations with CVMFS

Abstract

Data federations have become an increasingly common tool for large collaborations such as CMS and Atlas to efficiently distribute large data files. Unfortunately, these typically are implemented with weak namespace semantics and a non-POSIX API. On the other hand, CVMFS has provided a POSIX-compliant read-only interface for use cases with a small working set size (such as software distribution). The metadata required for the CVMFS POSIX interface is distributed through a caching hierarchy, allowing it to scale to the level of about a hundred thousand hosts. In this paper, we will describe our contributions to CVMFS that merges the data scalability of XRootD-based data federations (such as AAA) with metadata scalability and POSIX interface of CVMFS. We modified CVMFS so it can serve unmodified files without copying them to the repository server. CVMFS 2.2.0 is also able to redirect requests for data files to servers outside of the CVMFS content distribution network. Finally, we added the ability to manage authorization and authentication using security credentials such as X509 proxy certificates. We combined these modifications with the OSGs StashCache regional XRootD caching infrastructure to create a cached data distribution network. Here, we will show performance metrics accessing the data federation throughmore » CVMFS compared to direct data federation access. Additionally, we will discuss the improved user experience of providing access to a data federation through a POSIX filesystem.« less

Authors:
 [1];  [1];  [2];  [3];  [3]
  1. Univ. of Nebraska, Lincoln, NE (United States)
  2. Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)
  3. European Organization for Nuclear Research (CERN), Geneva (Switzerland)
Publication Date:
Research Org.:
Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)
Sponsoring Org.:
USDOE Office of Science (SC), High Energy Physics (HEP) (SC-25)
OSTI Identifier:
1399106
Report Number(s):
FERMILAB-CONF-17-407-CD
Journal ID: ISSN 1742-6588; 1630215
Grant/Contract Number:
AC02-07CH11359
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
Journal of Physics. Conference Series
Additional Journal Information:
Journal Volume: 898; Journal ID: ISSN 1742-6588
Publisher:
IOP Publishing
Country of Publication:
United States
Language:
English
Subject:
96 KNOWLEDGE MANAGEMENT AND PRESERVATION

Citation Formats

Weitzel, Derek, Bockelman, Brian, Dykstra, Dave, Blomer, Jakob, and Meusel, Ren. Accessing Data Federations with CVMFS. United States: N. p., 2017. Web. doi:10.1088/1742-6596/898/6/062044.
Weitzel, Derek, Bockelman, Brian, Dykstra, Dave, Blomer, Jakob, & Meusel, Ren. Accessing Data Federations with CVMFS. United States. doi:10.1088/1742-6596/898/6/062044.
Weitzel, Derek, Bockelman, Brian, Dykstra, Dave, Blomer, Jakob, and Meusel, Ren. 2017. "Accessing Data Federations with CVMFS". United States. doi:10.1088/1742-6596/898/6/062044. https://www.osti.gov/servlets/purl/1399106.
@article{osti_1399106,
title = {Accessing Data Federations with CVMFS},
author = {Weitzel, Derek and Bockelman, Brian and Dykstra, Dave and Blomer, Jakob and Meusel, Ren},
abstractNote = {Data federations have become an increasingly common tool for large collaborations such as CMS and Atlas to efficiently distribute large data files. Unfortunately, these typically are implemented with weak namespace semantics and a non-POSIX API. On the other hand, CVMFS has provided a POSIX-compliant read-only interface for use cases with a small working set size (such as software distribution). The metadata required for the CVMFS POSIX interface is distributed through a caching hierarchy, allowing it to scale to the level of about a hundred thousand hosts. In this paper, we will describe our contributions to CVMFS that merges the data scalability of XRootD-based data federations (such as AAA) with metadata scalability and POSIX interface of CVMFS. We modified CVMFS so it can serve unmodified files without copying them to the repository server. CVMFS 2.2.0 is also able to redirect requests for data files to servers outside of the CVMFS content distribution network. Finally, we added the ability to manage authorization and authentication using security credentials such as X509 proxy certificates. We combined these modifications with the OSGs StashCache regional XRootD caching infrastructure to create a cached data distribution network. Here, we will show performance metrics accessing the data federation through CVMFS compared to direct data federation access. Additionally, we will discuss the improved user experience of providing access to a data federation through a POSIX filesystem.},
doi = {10.1088/1742-6596/898/6/062044},
journal = {Journal of Physics. Conference Series},
number = ,
volume = 898,
place = {United States},
year = 2017,
month =
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Save / Share:
  • This paper deals with the generation of safe tasks for displacement missions of a nonholonomic mobile robot in a mapped indoor environment. The goal of this study is to plan safe actions (path-following) as well as observations (they call it local maps federation [LMF]), leading the robot to configurations where pertinent features can be sensed, thus attaining best localization relative to its environment. A path-planning method dealing with uncertainties is proposed, where both uncertainties in localization and in control of a nonholonomic mobile robot are managed. Maps uncertainties are handled using the local map concept, which is introduced as amore » set of the best landmarks used for planning and executing robust motion movements. The safeness of the proposed method is due to the mixing between the planning phase and the navigation phase.« less
  • Metagenomic sequencing has produced significant amounts of data in recent years. For example, as of summer 2013, MGRAST has been used to annotate over 110,000 data sets totaling over 43 Terabases. With metagenomic sequencing finding even wider adoption in the scientific community, the existing web-based analysis tools and infrastructure in MG-RAST provide limited capability for data retrieval and analysis, such as comparative analysis between multiple data sets. Moreover, although the system provides many analysis tools, it is not comprehensive. By opening MG-RAST up via a web services API (application programmers interface) we have greatly expanded access to MG-RAST data, asmore » well as provided a mechanism for the use of third-party analysis tools with MG-RAST data. This RESTful API makes all data and data objects created by the MG-RAST pipeline accessible as JSON objects. As part of the DOE Systems Biology Knowledgebase project (KBase, http:// kbase.us) we have implemented a web services API for MG-RAST. This API complements the existing MG-RAST web interface and constitutes the basis of KBase’s microbial community capabilities. In addition, the API exposes a comprehensive collection of data to programmers. This API, which uses a RESTful (Representational State Transfer) implementation, is compatible with most programming environments and should be easy to use for end users and third parties. It provides comprehensive access to sequence data, quality control results, annotations, and many other data types. Where feasible, we have used standards to expose data and metadata. Code examples are provided in a number of languages both to show the versatility of the API and to provide a starting point for users. We present an API that exposes the data in MG-RAST for consumption by our users, greatly enhancing the utility of the MG-RAST service.« less
  • Metagenomic sequencing has produced significant amounts of data in recent years. For example, as of summer 2013, MG-RAST has been used to annotate over 110,000 data sets totaling over 43 Terabases. With metagenomic sequencing finding even wider adoption in the scientific community, the existing web-based analysis tools and infrastructure in MG-RAST provide limited capability for data retrieval and analysis, such as comparative analysis between multiple data sets. Moreover, although the system provides many analysis tools, it is not comprehensive. By opening MG-RAST up via a web services API (application programmers interface) we have greatly expanded access to MG-RAST data, asmore » well as provided a mechanism for the use of third-party analysis tools with MG-RAST data. This RESTful API makes all data and data objects created by the MG-RAST pipeline accessible as JSON objects. As part of the DOE Systems Biology Knowledgebase project (KBase, http://kbase.us) we have implemented a web services API for MG-RAST. This API complements the existing MG-RAST web interface and constitutes the basis of KBase's microbial community capabilities. In addition, the API exposes a comprehensive collection of data to programmers. This API, which uses a RESTful (Representational State Transfer) implementation, is compatible with most programming environments and should be easy to use for end users and third parties. It provides comprehensive access to sequence data, quality control results, annotations, and many other data types. Where feasible, we have used standards to expose data and metadata. Code examples are provided in a number of languages both to show the versatility of the API and to provide a starting point for users. We present an API that exposes the data in MG-RAST for consumption by our users, greatly enhancing the utility of the MG-RAST service.« less
  • The increasing number of vertebrate genomes being sequenced in draft or finished form provide a unique opportunity to study and decode the language of DNA sequence through comparative genome alignments. However, novel tools and strategies are required to accommodate this increasing volume of genomic information and to facilitate experimental annotation of genome function. Here we present the ECR Browser, a tool that provides an easy and dynamic access to whole genome alignments of human, mouse, rat and fish sequences. This web-based tool (http://ecrbrowser.dcode.org) provides the starting point for discovery of novel genes, identification of distant gene regulatory elements and predictionmore » of transcription factor binding sites. The genome alignment portal of the ECR Browser also permits fast and automated alignment of any user-submitted sequence to the genome of choice. The interconnection of the ECR browser with other DNA sequence analysis tools creates a unique portal for studying and exploring vertebrate genomes.« less
  • The concept of dense and sparse execution of arrays is introduced. Arrays themselves can be stored in a dense or sparse manner in a parallel memory with m memory modules. The paper proposes hardware for speeding up the execution of array operations of the form c(c/sub 0/+ci)=a(a/sub 0/+ai) op b(b/sub 0/+bi), where a/sub 0/, a, b/sub 0/, b, c/sub 0/, c are integer constants and i is an index variable. The hardware handles 'sparse execution', in which the operation op is not executed for every value of i. The hardware also makes provision for 'sparse storage', in which memory spacemore » is not provided for every array element. It is shown how to access array elements of the above form without conflict in an efficient way. The efficiency is obtained by using some specialised units which are basically smart memories with priority detection, one's counting or associative searching. Generalisation to multidimensional arrays is shown possible under restrictions defined in the paper. 12 references.« less