Adding Data Management Services to Parallel File Systems

Brandt, Scott

doi:10.2172/1171718

Title: Adding Data Management Services to Parallel File Systems

Technical Report · Wed Mar 04 00:00:00 EST 2015

DOI:https://doi.org/10.2172/1171718· OSTI ID:1171718

Brandt, Scott ^[1]

Univ. of California, Santa Cruz, CA (United States)

The objective of this project, called DAMASC for “Data Management in Scientific Computing”, is to coalesce data management with parallel file system management to present a declarative interface to scientists for managing, querying, and analyzing extremely large data sets efficiently and predictably. Managing extremely large data sets is a key challenge of exascale computing. The overhead, energy, and cost of moving massive volumes of data demand designs where computation is close to storage. In current architectures, compute/analysis clusters access data in a physically separate parallel file system and largely leave it scientist to reduce data movement. Over the past decades the high-end computing community has adopted middleware with multiple layers of abstractions and specialized file formats such as NetCDF-4 and HDF5. These abstractions provide a limited set of high-level data processing functions, but have inherent functionality and performance limitations: middleware that provides access to the highly structured contents of scientific data files stored in the (unstructured) file systems can only optimize to the extent that file system interfaces permit; the highly structured formats of these files often impedes native file system performance optimizations. We are developing Damasc, an enhanced high-performance file system with native rich data management services. Damasc will enable efficient queries and updates over files stored in their native byte-stream format while retaining the inherent performance of file system data storage via declarative queries and updates over views of underlying files. Damasc has four key benefits for the development of data-intensive scientific code: (1) applications can use important data-management services, such as declarative queries, views, and provenance tracking, that are currently available only within database systems; (2) the use of these services becomes easier, as they are provided within a familiar file-based ecosystem; (3) common optimizations, e.g., indexing and caching, are readily supported across several file formats, avoiding effort duplication; and (4) performance improves significantly, as data processing is integrated more tightly with data storage. Our key contributions are: SciHadoop which explores changes to MapReduce assumption by taking advantage of semantics of structured data while preserving MapReduce’s failure and resource management; DataMods which extends common abstractions of parallel file systems so they become programmable such that they can be extended to natively support a variety of data models and can be hooked into emerging distributed runtimes such as Stanford’s Legion; and Miso which combines Hadoop and relational data warehousing to minimize time to insight, taking into account the overhead of ingesting data into data warehousing.

View Technical Report

Cite

Export

Save

Research Organization:: The Regents of the University of California, Santa Cruz, CA (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: SC0005428

OSTI ID:: 1171718

Report Number(s):: DE-FC02-10ER26033; SC0005428

Country of Publication:: United States

Language:: English

Similar Records

Center for Technology for Advanced Scientific Componet Software (TASCS)

Technical Report · Sun Oct 31 00:00:00 EDT 2010 · OSTI ID:1171718

Govindaraju, Madhusudhan

ArrayBridge: Interweaving declarative array processing with high-performance computing

Conference · Thu May 04 00:00:00 EDT 2017 · OSTI ID:1171718

Xing, Haoyuan; Floratos, Sofoklis; Blanas, Spyros; +4 more

...And Eat it Too: High Read Performance in Write-Optimized HPC I/O Middleware File Formats

Conference · Thu Jan 01 00:00:00 EST 2009 · OSTI ID:1171718

Klasky, Scott A; Lofstead, J.; Bent, John; +5 more

Related Subjects

96 KNOWLEDGE MANAGEMENT AND PRESERVATION
97 MATHEMATICS AND COMPUTING
FILEsystems
Data Management

Title: Adding Data Management Services to Parallel File Systems

Citation Formats

Similar Records

Related Subjects