skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Striped Data Server for Scalable Parallel Data Analysis

Abstract

A columnar data representation is known to be an efficient way for data storage, specifically in cases when the analysis is often done based only on a small fragment of the available data structures. A data representation like Apache Parquet is a step forward from a columnar representation, which splits data horizontally to allow for easy parallelization of data analysis. Based on the general idea of columnar data storage, working on the [LDRD Project], we have developed a striped data representation, which, we believe, is better suited to the needs of High Energy Physics data analysis. A traditional columnar approach allows for efficient data analysis of complex structures. While keeping all the benefits of columnar data representations, the striped mechanism goes further by enabling easy parallelization of computations without requiring special hardware. We will present an implementation and some performance characteristics of such a data representation mechanism using a distributed no-SQL database or a local file system, unified under the same API and data representation model. The representation is efficient and at the same time simple so that it allows for a common data model and APIs for wide range of underlying storage mechanisms such as distributed no-SQL databases andmore » local file systems. Striped storage adopts Numpy arrays as its basic data representation format, which makes it easy and efficient to use in Python applications. The Striped Data Server is a web service, which allows to hide the server implementation details from the end user, easily exposes data to WAN users, and allows to utilize well known and developed data caching solutions to further increase data access efficiency. We are considering the Striped Data Server as the core of an enterprise scale data analysis platform for High Energy Physics and similar areas of data processing. As a result, we have been testing this architecture with a 2TB dataset from a CMS dark matter search and plan to expand it to multiple 100 TB or even PB scale. We will present the striped format, Striped Data Server architecture and performance test results.« less

Authors:
 [1];  [1];  [1];  [1]
  1. Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)
Publication Date:
Research Org.:
Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)
Sponsoring Org.:
USDOE Office of Science (SC), High Energy Physics (HEP) (SC-25)
OSTI Identifier:
1452811
Report Number(s):
FERMILAB-CONF-18-016-CD
Journal ID: ISSN 1742-6588; 1676675
Grant/Contract Number:  
AC02-07CH11359
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
Journal of Physics. Conference Series
Additional Journal Information:
Journal Volume: 1085; Journal Issue: 4; Conference: 18th International Workshop on Advanced Computing and Analysis Techniques in Physics Research, Seattle, WA (United States), 21-25 Aug 2017; Journal ID: ISSN 1742-6588
Publisher:
IOP Publishing
Country of Publication:
United States
Language:
English

Citation Formats

Chang, Jin, Gutsche, Oliver, Mandrichenko, Igor, and Pivarski, James. Striped Data Server for Scalable Parallel Data Analysis. United States: N. p., 2018. Web. doi:10.1088/1742-6596/1085/4/042035.
Chang, Jin, Gutsche, Oliver, Mandrichenko, Igor, & Pivarski, James. Striped Data Server for Scalable Parallel Data Analysis. United States. doi:10.1088/1742-6596/1085/4/042035.
Chang, Jin, Gutsche, Oliver, Mandrichenko, Igor, and Pivarski, James. Sat . "Striped Data Server for Scalable Parallel Data Analysis". United States. doi:10.1088/1742-6596/1085/4/042035. https://www.osti.gov/servlets/purl/1452811.
@article{osti_1452811,
title = {Striped Data Server for Scalable Parallel Data Analysis},
author = {Chang, Jin and Gutsche, Oliver and Mandrichenko, Igor and Pivarski, James},
abstractNote = {A columnar data representation is known to be an efficient way for data storage, specifically in cases when the analysis is often done based only on a small fragment of the available data structures. A data representation like Apache Parquet is a step forward from a columnar representation, which splits data horizontally to allow for easy parallelization of data analysis. Based on the general idea of columnar data storage, working on the [LDRD Project], we have developed a striped data representation, which, we believe, is better suited to the needs of High Energy Physics data analysis. A traditional columnar approach allows for efficient data analysis of complex structures. While keeping all the benefits of columnar data representations, the striped mechanism goes further by enabling easy parallelization of computations without requiring special hardware. We will present an implementation and some performance characteristics of such a data representation mechanism using a distributed no-SQL database or a local file system, unified under the same API and data representation model. The representation is efficient and at the same time simple so that it allows for a common data model and APIs for wide range of underlying storage mechanisms such as distributed no-SQL databases and local file systems. Striped storage adopts Numpy arrays as its basic data representation format, which makes it easy and efficient to use in Python applications. The Striped Data Server is a web service, which allows to hide the server implementation details from the end user, easily exposes data to WAN users, and allows to utilize well known and developed data caching solutions to further increase data access efficiency. We are considering the Striped Data Server as the core of an enterprise scale data analysis platform for High Energy Physics and similar areas of data processing. As a result, we have been testing this architecture with a 2TB dataset from a CMS dark matter search and plan to expand it to multiple 100 TB or even PB scale. We will present the striped format, Striped Data Server architecture and performance test results.},
doi = {10.1088/1742-6596/1085/4/042035},
journal = {Journal of Physics. Conference Series},
issn = {1742-6588},
number = 4,
volume = 1085,
place = {United States},
year = {2018},
month = {9}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Figures / Tables:

Figure 1 Figure 1: Data analysis process

Save / Share:
Figures/Tables have been extracted from DOE-funded journal article accepted manuscripts.