Distributed caching for processing raw arrays
- University of California Merced
- University of California Merced and Lawrence Berkeley National Laboratory
- Lawrence Berkeley National Laboratory
- CalTech
As applications continue to generate multi-dimensional data at exponentially increasing rates, fast analytics to extract meaningful results is becoming extremely important. The database community has developed array databases that alleviate this problem through a series of techniques. In-situ mechanisms provide direct access to raw data in the original format---without loading and partitioning. Parallel processing scales to the largest datasets. In-memory caching reduces latency when the same data are accessed across a workload of queries. However, we are not aware of any work on distributed caching of multi-dimensional raw arrays. In this paper, we introduce a distributed framework for cost-based caching of multi-dimensional arrays in native format. Given a set of files that contain portions of an array and an online query workload, the framework computes an effective caching plan in two stages. First, the plan identifies the cells to be cached locally from each of the input files by continuously refining an evolving R-tree index. In the second stage, an optimal assignment of cells to nodes that collocates dependent cells in order to minimize the overall data transfer is determined. We design cache eviction and placement heuristic algorithms that consider the historical query workload. A thorough experimental evaluation over two real datasets in three file formats confirms the superiority -- by as much as two orders of magnitude -- of the proposed framework over existing techniques in terms of cache overhead and workload execution time.
- Research Organization:
- Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC)
- DOE Contract Number:
- AC02-05CH11231
- OSTI ID:
- 1580975
- Country of Publication:
- United States
- Language:
- English
RAM: A Multidimensional Array DBMS
|
book | January 2004 |
Data Vaults: A Symbiosis between Database Technology and Scientific File Repositories
|
book | January 2012 |
An evaluation of buffer management strategies for relational database systems
|
journal | November 1986 |
Formal representation of the SS-DB benchmark and experimental evaluation in EXTASCID
|
journal | May 2014 |
Proxy caching that estimates page load delays
|
journal | September 1997 |
LRFU: a spectrum of policies that subsumes the least recently used and least frequently used policies
|
journal | December 2001 |
Distributed Cache Management in Information-Centric Networks
|
journal | September 2013 |
The LRU-K page replacement algorithm for database disk buffering
|
conference | January 1993 |
Overview of sciDB: large scale array storage, processing and analysis
|
conference | January 2010 |
The case for RAMCloud
|
journal | July 2011 |
ArrayStore: a storage manager for complex parallel array processing
|
conference | January 2011 |
SciHadoop: array-based query processing in Hadoop
|
conference | January 2011 |
SciQL: bridging the gap between science and relational DBMS
|
conference | January 2011 |
NoDB: efficient query execution on raw data files
|
conference | January 2012 |
Invisible loading: access-driven data transfer from raw files into database systems
|
conference | January 2013 |
Page replacement with multi-size pages and applications to Web caching
|
conference | January 1997 |
Parallel in-situ data processing with speculative loading
|
conference | January 2014 |
Sinew: a SQL system for multi-structured data
|
conference | January 2014 |
Parallel data analysis directly on scientific file formats
|
conference | January 2014 |
Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks
|
conference | January 2014 |
Skew-Aware Join Optimization for Array Databases
|
conference | January 2015 |
THERMAL-JOIN: A Scalable Spatial Join for Dynamic Workloads
|
conference | January 2015 |
The multidimensional database system RasDaMan
|
conference | January 1998 |
Vertical partitioning for query processing over raw data
|
conference | January 2015 |
Similarity Join over Array Data
|
conference | January 2016 |
Adaptive Caching in Big SQL using the HDFS Cache
|
conference | January 2016 |
Incremental View Maintenance over Array Data
|
conference | January 2017 |
Bi-Level Online Aggregation on Raw Data
|
conference | January 2017 |
R-trees: a dynamic index structure for spatial searching
|
conference | January 1984 |
Instant loading for main memory databases
|
journal | September 2013 |
ClusterJoin: a similarity joins framework using map-reduce
|
journal | August 2014 |
Adaptive query processing on RAW data
|
journal | August 2014 |
Fast queries over heterogeneous data through engine customization
|
journal | August 2016 |
The TileDB array data storage manager
|
journal | November 2016 |
Slalom: coasting through raw data via adaptive partitioning and indexing
|
journal | June 2017 |
ReCache: reactive caching for fast analytics over heterogeneous data
|
journal | November 2017 |
Similar Records
Effectiveness and predictability of in-network storage cache for Scientific Workflows
Accurate modeling of cache replacement policies in a Data-Grid.
A Unified Multiple-Level Cache for High Performance Storage Systems
Conference
·
Tue Feb 21 23:00:00 EST 2023
·
OSTI ID:2997068
Accurate modeling of cache replacement policies in a Data-Grid.
Conference
·
Wed Jan 22 23:00:00 EST 2003
·
OSTI ID:815511
A Unified Multiple-Level Cache for High Performance Storage Systems
Journal Article
·
Sun Dec 31 23:00:00 EST 2006
· International Journal of High Performance Computing and Networking
·
OSTI ID:931933