Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Distributed caching for processing raw arrays

Conference ·
 [1];  [2];  [3];  [3];  [4];  [3]
  1. University of California Merced
  2. University of California Merced and Lawrence Berkeley National Laboratory
  3. Lawrence Berkeley National Laboratory
  4. CalTech
As applications continue to generate multi-dimensional data at exponentially increasing rates, fast analytics to extract meaningful results is becoming extremely important. The database community has developed array databases that alleviate this problem through a series of techniques. In-situ mechanisms provide direct access to raw data in the original format---without loading and partitioning. Parallel processing scales to the largest datasets. In-memory caching reduces latency when the same data are accessed across a workload of queries. However, we are not aware of any work on distributed caching of multi-dimensional raw arrays. In this paper, we introduce a distributed framework for cost-based caching of multi-dimensional arrays in native format. Given a set of files that contain portions of an array and an online query workload, the framework computes an effective caching plan in two stages. First, the plan identifies the cells to be cached locally from each of the input files by continuously refining an evolving R-tree index. In the second stage, an optimal assignment of cells to nodes that collocates dependent cells in order to minimize the overall data transfer is determined. We design cache eviction and placement heuristic algorithms that consider the historical query workload. A thorough experimental evaluation over two real datasets in three file formats confirms the superiority -- by as much as two orders of magnitude -- of the proposed framework over existing techniques in terms of cache overhead and workload execution time.
Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC)
DOE Contract Number:
AC02-05CH11231
OSTI ID:
1580975
Country of Publication:
United States
Language:
English

References (36)

RAM: A Multidimensional Array DBMS book January 2004
Data Vaults: A Symbiosis between Database Technology and Scientific File Repositories book January 2012
An evaluation of buffer management strategies for relational database systems journal November 1986
Formal representation of the SS-DB benchmark and experimental evaluation in EXTASCID journal May 2014
Proxy caching that estimates page load delays journal September 1997
LRFU: a spectrum of policies that subsumes the least recently used and least frequently used policies journal December 2001
Distributed Cache Management in Information-Centric Networks journal September 2013
The LRU-K page replacement algorithm for database disk buffering
  • O'Neil, Elizabeth J.; O'Neil, Patrick E.; Weikum, Gerhard
  • Proceedings of the 1993 ACM SIGMOD international conference on Management of data - SIGMOD '93 https://doi.org/10.1145/170035.170081
conference January 1993
Overview of sciDB: large scale array storage, processing and analysis conference January 2010
The case for RAMCloud journal July 2011
ArrayStore: a storage manager for complex parallel array processing conference January 2011
SciHadoop: array-based query processing in Hadoop
  • Buck, Joe B.; Watkins, Noah; LeFevre, Jeff
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063473
conference January 2011
SciQL: bridging the gap between science and relational DBMS conference January 2011
NoDB: efficient query execution on raw data files conference January 2012
Invisible loading: access-driven data transfer from raw files into database systems conference January 2013
Page replacement with multi-size pages and applications to Web caching conference January 1997
Parallel in-situ data processing with speculative loading conference January 2014
Sinew: a SQL system for multi-structured data conference January 2014
Parallel data analysis directly on scientific file formats conference January 2014
Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks conference January 2014
Skew-Aware Join Optimization for Array Databases conference January 2015
THERMAL-JOIN: A Scalable Spatial Join for Dynamic Workloads conference January 2015
The multidimensional database system RasDaMan conference January 1998
Vertical partitioning for query processing over raw data conference January 2015
Similarity Join over Array Data conference January 2016
Adaptive Caching in Big SQL using the HDFS Cache conference January 2016
Incremental View Maintenance over Array Data conference January 2017
Bi-Level Online Aggregation on Raw Data conference January 2017
R-trees: a dynamic index structure for spatial searching conference January 1984
Instant loading for main memory databases journal September 2013
ClusterJoin: a similarity joins framework using map-reduce journal August 2014
Adaptive query processing on RAW data journal August 2014
Fast queries over heterogeneous data through engine customization journal August 2016
The TileDB array data storage manager journal November 2016
Slalom: coasting through raw data via adaptive partitioning and indexing journal June 2017
ReCache: reactive caching for fast analytics over heterogeneous data journal November 2017

Similar Records

Effectiveness and predictability of in-network storage cache for Scientific Workflows
Conference · Tue Feb 21 23:00:00 EST 2023 · OSTI ID:2997068

Accurate modeling of cache replacement policies in a Data-Grid.
Conference · Wed Jan 22 23:00:00 EST 2003 · OSTI ID:815511

A Unified Multiple-Level Cache for High Performance Storage Systems
Journal Article · Sun Dec 31 23:00:00 EST 2006 · International Journal of High Performance Computing and Networking · OSTI ID:931933

Related Subjects