skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Spark-hdf5

Abstract

The spark-hdf5 package is an extension to the Apache Spark program to allow native access to HDF5 files. It allows users to query the structured files using SQL-like syntax, and can parallelize large queries across several workers.

Authors:
 [1];  [1];  [1];  [1];  [1]
  1. Lawrence Livermore National Laboratory
Publication Date:
Research Org.:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC)
Contributing Org.:
Lawrence Livermore National Laboratory
OSTI Identifier:
1298050
Report Number(s):
spark-hdf5; 004879MLTPL00
LLNL-CODE-699384
DOE Contract Number:
AC52-07NA27344
Resource Type:
Software
Software Revision:
00
Software Package Number:
004879
Software CPU:
MLTPL
Open Source:
Yes
Source Code Available:
Yes
Country of Publication:
United States

Citation Formats

Asplund, Joshua, Jiang, Ming, Gallagher, Brian, Miller, Mark, and Harrison, Cyrus. Spark-hdf5. Computer software. https://www.osti.gov//servlets/purl/1298050. Vers. 00. USDOE Office of Science (SC). 5 Jul. 2016. Web.
Asplund, Joshua, Jiang, Ming, Gallagher, Brian, Miller, Mark, & Harrison, Cyrus. (2016, July 5). Spark-hdf5 (Version 00) [Computer software]. https://www.osti.gov//servlets/purl/1298050.
Asplund, Joshua, Jiang, Ming, Gallagher, Brian, Miller, Mark, and Harrison, Cyrus. Spark-hdf5. Computer software. Version 00. July 5, 2016. https://www.osti.gov//servlets/purl/1298050.
@misc{osti_1298050,
title = {Spark-hdf5, Version 00},
author = {Asplund, Joshua and Jiang, Ming and Gallagher, Brian and Miller, Mark and Harrison, Cyrus},
abstractNote = {The spark-hdf5 package is an extension to the Apache Spark program to allow native access to HDF5 files. It allows users to query the structured files using SQL-like syntax, and can parallelize large queries across several workers.},
url = {https://www.osti.gov//servlets/purl/1298050},
doi = {},
year = {Tue Jul 05 00:00:00 EDT 2016},
month = {Tue Jul 05 00:00:00 EDT 2016},
note =
}

Software:
To order this software, request consultation services, or receive further information, please fill out the following request.

Save / Share:
  • The Analytical Spectroscopy Section of the Analytical Chemistry Division has had software to process spark source mass spectrometric (SSMS) data in operation for over two decades. Although the system has been verified in analysis of standards numerous times through out its operation, documentation has never been of primary concern. In recent years the quality assurance (QA) requirements by both Martin Marietta Energy Systems (MMES) and various sponsoring agencies have increased. This report provides the documentation, verification, and validation of the software used to process SSMS data that will satisfy the QA requirements of most analytical programs. The operation of eachmore » of the three major computer routines -- EMLCAL, PLATE, PEAK -- is described in enough detail to give a clear understanding of its function. Verification was accomplished by comparing code results to hand calculations, to physical data, and to an alternative code designed to perform the same type of analysis. Validation was accomplished by an isotopic analysis of a natural erbium standard and comparison of the results with the accepted isotopic abundances. Appendices contain user instructions, samples of outputs, examples of the data files, and definitions of all the labels and variables used within the program code.« less
  • Large scale scientific data is often stored in scientific data formats such as FITS, netCDF and HDF. These storage formats are of particular interest to the scientific user community since they provide multi-dimensional storage and retrieval. However, one of the drawbacks of these storage formats is that they do not support semantic indexing which is important for interactive data analysis where scientists look for features of interests such as ''Find all supernova explosions where energy >105 and temperature >106''. In this paper we present a novel approach called HDF5-FastQuery to accelerate the data access of large HDF5 files by introducingmore » multi-dimensional semantic indexing. Our implementation leverages an efficient indexing technology called ''bitmapindexing'' that has been widely used in the database community. Bitmapindices are especially well suited for interactive exploration of large-scale read-only data. Storing the bitmap indices into the HDF5 file has the following advantages: (a) Significant performance speedup of accessing subsets of multi-dimensional data and (b) portability of the indices across multiple computer platforms. We will present an API that simplifies the execution of queries on HDF5 files for general scientific applications and data analysis. The design is flexible enough to accommodate the use of arbitrary indexing technology for semantic range queries. We will also provide a detailed performance analysis of HDF5-FastQuery for both synthetic and scientific data. The results demonstrate that our proposed approach for multi-dimensional queries is up to a factor of 2 faster than HDF5.« less
  • Large scale scientific data is often stored in scientific data formats such as FITS, netCDF and HDF. These storage formats are of particular interest to the scientific user community since they provide multi-dimensional storage and retrieval. However, one of the drawbacks of these storage formats is that they do not support semantic indexing which is important for interactive data analysis where scientists look for features of interests such as ''Find all supernova explosions where energy > 10{sup 5} and temperature > 10{sup 6}''. In this paper we present a novel approach called HDF5-FastQuery to accelerate the data access of largemore » HDF5 files by introducing multi-dimensional semantic indexing. Our implementation leverages an efficient indexing technology called bitmap indexing that has been widely used in the database community. Bitmap indices are especially well suited for interactive exploration of large-scale read only data. Storing the bitmap indices into the HDF5 file has the following advantages: (a) Significant performance speedup of accessing subsets of multi-dimensional data and (b) portability of the indices across multiple computer platforms. We will present an API that simplifies the execution of queries on HDF5 files for general scientific applications and data analysis. The design is flexible enough to accommodate the use of arbitrary indexing technology for semantic range queries. We will also provide a detailed performance analysis of HDF5-FastQuery for both synthetic and scientific data. The results demonstrate that our proposed approach for multi-dimensional queries is up to a factor of 2 faster than HDF5.« less
  • This work focuses on research and development activities that bridge a gap between fundamental data management technology index, query, storage and retrieval and use of such technology in computational and computer science algorithms and applications. The work has resulted in a streamlined applications programming interface (API) that simplifies data storage and retrieval using the HDF5 data I/O library, and eases use of the FastBit compressed bitmap indexing software for data indexing/querying. The API, which we call HDF5-FastQuery, will have broad applications in domain sciences as well as associated data analysis and visualization applications.

To initiate an order for this software, request consultation services, or receive further information, fill out the request form below. You may also reach us by email at: .

OSTI staff will begin to process an order for scientific and technical software once the payment and signed site license agreement are received. If the forms are not in order, OSTI will contact you. No further action will be taken until all required information and/or payment is received. Orders are usually processed within three to five business days.

Software Request

(required)
(required)
(required)
(required)
(required)
(required)
(required)
(required)