Distributed caching for processing raw arrays

Zhao, Weijie; Rusu, Florin; Dong, Bin; Wu, Kesheng; Ho, Anna Y. Q.; Nugent, Peter

doi:10.1145/3221269.3221295

Distributed caching for processing raw arrays

Conference · Mon Jul 09 04:00:00 EDT 2018

DOI:https://doi.org/10.1145/3221269.3221295· OSTI ID:1580975

Zhao, Weijie ^[1]; Rusu, Florin ^[2]; Dong, Bin ^[3]; Wu, Kesheng ^[3]; Ho, Anna Y. Q. ^[4]; Nugent, Peter ^[3]

University of California Merced
University of California Merced and Lawrence Berkeley National Laboratory
Lawrence Berkeley National Laboratory
CalTech

As applications continue to generate multi-dimensional data at exponentially increasing rates, fast analytics to extract meaningful results is becoming extremely important. The database community has developed array databases that alleviate this problem through a series of techniques. In-situ mechanisms provide direct access to raw data in the original format---without loading and partitioning. Parallel processing scales to the largest datasets. In-memory caching reduces latency when the same data are accessed across a workload of queries. However, we are not aware of any work on distributed caching of multi-dimensional raw arrays. In this paper, we introduce a distributed framework for cost-based caching of multi-dimensional arrays in native format. Given a set of files that contain portions of an array and an online query workload, the framework computes an effective caching plan in two stages. First, the plan identifies the cells to be cached locally from each of the input files by continuously refining an evolving R-tree index. In the second stage, an optimal assignment of cells to nodes that collocates dependent cells in order to minimize the overall data transfer is determined. We design cache eviction and placement heuristic algorithms that consider the historical query workload. A thorough experimental evaluation over two real datasets in three file formats confirms the superiority -- by as much as two orders of magnitude -- of the proposed framework over existing techniques in terms of cache overhead and workload execution time.

View Conference

Research Organization:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Organization:: USDOE Office of Science (SC)

DOE Contract Number:: AC02-05CH11231

OSTI ID:: 1580975

Country of Publication:: United States

Language:: English

References (36)

RAM: A Multidimensional Array DBMS van Ballegooij, Alex R. Current Trends in Database Technology - EDBT 2004 Workshops https://doi.org/10.1007/978-3-540-30192-9_15	book	January 2004
Data Vaults: A Symbiosis between Database Technology and Scientific File Repositories Ivanova, Milena; Kersten, Martin; Manegold, Stefan Lecture Notes in Computer Science https://doi.org/10.1007/978-3-642-31235-9_32	book	January 2012
An evaluation of buffer management strategies for relational database systems Chou, Hong -Tai; DeWitt, David J. Algorithmica, Vol. 1, Issue 1-4 https://doi.org/10.1007/BF01840450	journal	November 1986
Formal representation of the SS-DB benchmark and experimental evaluation in EXTASCID Cheng, Yu; Rusu, Florin Distributed and Parallel Databases, Vol. 33, Issue 3 https://doi.org/10.1007/s10619-014-7149-7	journal	May 2014
Proxy caching that estimates page load delays Wooster, Roland P.; Abrams, Marc Computer Networks and ISDN Systems, Vol. 29, Issue 8-13 https://doi.org/10.1016/S0169-7552(97)00041-X	journal	September 1997
LRFU: a spectrum of policies that subsumes the least recently used and least frequently used policies Lee, Donghee; Choi, Jongmoo; Kim, Jong-Hun IEEE Transactions on Computers, Vol. 50, Issue 12, p. 1352-1361 https://doi.org/10.1109/TC.2001.970573	journal	December 2001
Distributed Cache Management in Information-Centric Networks Sourlas, V.; Gkatzikis, L.; Flegkas, P. IEEE Transactions on Network and Service Management, Vol. 10, Issue 3 https://doi.org/10.1109/TNSM.2013.052113.120382	journal	September 2013
The LRU-K page replacement algorithm for database disk buffering O'Neil, Elizabeth J.; O'Neil, Patrick E.; Weikum, Gerhard Proceedings of the 1993 ACM SIGMOD international conference on Management of data - SIGMOD '93 https://doi.org/10.1145/170035.170081	conference	January 1993
Overview of sciDB: large scale array storage, processing and analysis Brown, Paul G. Proceedings of the 2010 international conference on Management of data - SIGMOD '10 https://doi.org/10.1145/1807167.1807271	conference	January 2010
The case for RAMCloud Ousterhout, John; Agrawal, Parag; Erickson, David Communications of the ACM, Vol. 54, Issue 7 https://doi.org/10.1145/1965724.1965751	journal	July 2011
ArrayStore: a storage manager for complex parallel array processing Soroush, Emad; Balazinska, Magdalena; Wang, Daniel Proceedings of the 2011 international conference on Management of data - SIGMOD '11 https://doi.org/10.1145/1989323.1989351	conference	January 2011
SciHadoop: array-based query processing in Hadoop Buck, Joe B.; Watkins, Noah; LeFevre, Jeff Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063473	conference	January 2011
SciQL: bridging the gap between science and relational DBMS Zhang, Ying; Kersten, Martin; Ivanova, Milena Proceedings of the 15th Symposium on International Database Engineering & Applications - IDEAS '11 https://doi.org/10.1145/2076623.2076639	conference	January 2011
NoDB: efficient query execution on raw data files Alagiannis, Ioannis; Borovica, Renata; Branco, Miguel Proceedings of the 2012 international conference on Management of Data - SIGMOD '12 https://doi.org/10.1145/2213836.2213864	conference	January 2012
Invisible loading: access-driven data transfer from raw files into database systems Abouzied, Azza; Abadi, Daniel J.; Silberschatz, Avi Proceedings of the 16th International Conference on Extending Database Technology - EDBT '13 https://doi.org/10.1145/2452376.2452377	conference	January 2013
Page replacement with multi-size pages and applications to Web caching Irani, Sandy Proceedings of the twenty-ninth annual ACM symposium on Theory of computing - STOC '97 https://doi.org/10.1145/258533.258666	conference	January 1997
Parallel in-situ data processing with speculative loading Cheng, Yu; Rusu, Florin Proceedings of the 2014 ACM SIGMOD international conference on Management of data - SIGMOD '14 https://doi.org/10.1145/2588555.2593673	conference	January 2014
Sinew: a SQL system for multi-structured data Tahara, Daniel; Diamond, Thaddeus; Abadi, Daniel J. Proceedings of the 2014 ACM SIGMOD international conference on Management of data - SIGMOD '14 https://doi.org/10.1145/2588555.2612183	conference	January 2014
Parallel data analysis directly on scientific file formats Blanas, Spyros; Wu, Kesheng; Byna, Surendra Proceedings of the 2014 ACM SIGMOD international conference on Management of data - SIGMOD '14 https://doi.org/10.1145/2588555.2612185	conference	January 2014
Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks Li, Haoyuan; Ghodsi, Ali; Zaharia, Matei Proceedings of the ACM Symposium on Cloud Computing - SOCC '14 https://doi.org/10.1145/2670979.2670985	conference	January 2014
Skew-Aware Join Optimization for Array Databases Duggan, Jennie; Papaemmanouil, Olga; Battle, Leilani Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD '15 https://doi.org/10.1145/2723372.2723709	conference	January 2015
THERMAL-JOIN: A Scalable Spatial Join for Dynamic Workloads Tauheed, Farhan; Heinis, Thomas; Ailamaki, Anastasia Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD '15 https://doi.org/10.1145/2723372.2749434	conference	January 2015
The multidimensional database system RasDaMan Baumann, P.; Dehmel, A.; Furtado, P. Proceedings of the 1998 ACM SIGMOD international conference on Management of data - SIGMOD '98 https://doi.org/10.1145/276304.276386	conference	January 1998
Vertical partitioning for query processing over raw data Zhao, Weijie; Cheng, Yu; Rusu, Florin Proceedings of the 27th International Conference on Scientific and Statistical Database Management - SSDBM '15 https://doi.org/10.1145/2791347.2791369	conference	January 2015
Similarity Join over Array Data Zhao, Weijie; Rusu, Florin; Dong, Bin Proceedings of the 2016 International Conference on Management of Data - SIGMOD '16 https://doi.org/10.1145/2882903.2915247	conference	January 2016
Adaptive Caching in Big SQL using the HDFS Cache Floratou, Avrilia; Megiddo, Nimrod; Potti, Navneet Proceedings of the Seventh ACM Symposium on Cloud Computing - SoCC '16 https://doi.org/10.1145/2987550.2987553	conference	January 2016
Incremental View Maintenance over Array Data Zhao, Weijie; Rusu, Florin; Dong, Bin Proceedings of the 2017 ACM International Conference on Management of Data - SIGMOD '17 https://doi.org/10.1145/3035918.3064041	conference	January 2017
Bi-Level Online Aggregation on Raw Data Cheng, Yu; Zhao, Weijie; Rusu, Florin Proceedings of the 29th International Conference on Scientific and Statistical Database Management - SSDBM '17 https://doi.org/10.1145/3085504.3085514	conference	January 2017
R-trees: a dynamic index structure for spatial searching Guttman, Antonin Proceedings of the 1984 ACM SIGMOD international conference on Management of data - SIGMOD '84 https://doi.org/10.1145/602259.602266	conference	January 1984
Instant loading for main memory databases Mühlbauer, Tobias; Rödiger, Wolf; Seilbeck, Robert Proceedings of the VLDB Endowment, Vol. 6, Issue 14 https://doi.org/10.14778/2556549.2556555	journal	September 2013
ClusterJoin: a similarity joins framework using map-reduce Das Sarma, Akash; He, Yeye; Chaudhuri, Surajit Proceedings of the VLDB Endowment, Vol. 7, Issue 12 https://doi.org/10.14778/2732977.2732981	journal	August 2014
Adaptive query processing on RAW data Karpathiotakis, Manos; Branco, Miguel; Alagiannis, Ioannis Proceedings of the VLDB Endowment, Vol. 7, Issue 12 https://doi.org/10.14778/2732977.2732986	journal	August 2014
Fast queries over heterogeneous data through engine customization Karpathiotakis, Manos; Alagiannis, Ioannis; Ailamaki, Anastasia Proceedings of the VLDB Endowment, Vol. 9, Issue 12 https://doi.org/10.14778/2994509.2994516	journal	August 2016
The TileDB array data storage manager Papadopoulos, Stavros; Datta, Kushal; Madden, Samuel Proceedings of the VLDB Endowment, Vol. 10, Issue 4 https://doi.org/10.14778/3025111.3025117	journal	November 2016
Slalom: coasting through raw data via adaptive partitioning and indexing Olma, Matthaios; Karpathiotakis, Manos; Alagiannis, Ioannis Proceedings of the VLDB Endowment, Vol. 10, Issue 10 https://doi.org/10.14778/3115404.3115415	journal	June 2017
ReCache: reactive caching for fast analytics over heterogeneous data Azim, Tahir; Karpathiotakis, Manos; Ailamaki, Anastasia Proceedings of the VLDB Endowment, Vol. 11, Issue 3 https://doi.org/10.14778/3157794.3157801	journal	November 2017

Similar Records

Effectiveness and predictability of in-network storage cache for Scientific Workflows

Conference · Tue Feb 21 23:00:00 EST 2023 · OSTI ID:2997068

Accurate modeling of cache replacement policies in a Data-Grid.

Conference · Wed Jan 22 23:00:00 EST 2003 · OSTI ID:815511

A Unified Multiple-Level Cache for High Performance Storage Systems

Journal Article · Sun Dec 31 23:00:00 EST 2006 · International Journal of High Performance Computing and Networking · OSTI ID:931933

Distributed caching for processing raw arrays

Citation Formats

References (36)

Similar Records

Related Subjects