skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Adding Data Management Services to Parallel File Systems

Technical Report ·
DOI:https://doi.org/10.2172/1171718· OSTI ID:1171718
 [1]
  1. Univ. of California, Santa Cruz, CA (United States)

The objective of this project, called DAMASC for “Data Management in Scientific Computing”, is to coalesce data management with parallel file system management to present a declarative interface to scientists for managing, querying, and analyzing extremely large data sets efficiently and predictably. Managing extremely large data sets is a key challenge of exascale computing. The overhead, energy, and cost of moving massive volumes of data demand designs where computation is close to storage. In current architectures, compute/analysis clusters access data in a physically separate parallel file system and largely leave it scientist to reduce data movement. Over the past decades the high-end computing community has adopted middleware with multiple layers of abstractions and specialized file formats such as NetCDF-4 and HDF5. These abstractions provide a limited set of high-level data processing functions, but have inherent functionality and performance limitations: middleware that provides access to the highly structured contents of scientific data files stored in the (unstructured) file systems can only optimize to the extent that file system interfaces permit; the highly structured formats of these files often impedes native file system performance optimizations. We are developing Damasc, an enhanced high-performance file system with native rich data management services. Damasc will enable efficient queries and updates over files stored in their native byte-stream format while retaining the inherent performance of file system data storage via declarative queries and updates over views of underlying files. Damasc has four key benefits for the development of data-intensive scientific code: (1) applications can use important data-management services, such as declarative queries, views, and provenance tracking, that are currently available only within database systems; (2) the use of these services becomes easier, as they are provided within a familiar file-based ecosystem; (3) common optimizations, e.g., indexing and caching, are readily supported across several file formats, avoiding effort duplication; and (4) performance improves significantly, as data processing is integrated more tightly with data storage. Our key contributions are: SciHadoop which explores changes to MapReduce assumption by taking advantage of semantics of structured data while preserving MapReduce’s failure and resource management; DataMods which extends common abstractions of parallel file systems so they become programmable such that they can be extended to natively support a variety of data models and can be hooked into emerging distributed runtimes such as Stanford’s Legion; and Miso which combines Hadoop and relational data warehousing to minimize time to insight, taking into account the overhead of ingesting data into data warehousing.

Research Organization:
The Regents of the University of California, Santa Cruz, CA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
SC0005428
OSTI ID:
1171718
Report Number(s):
DE-FC02-10ER26033; SC0005428
Country of Publication:
United States
Language:
English