Towards scalable dataframe systems
Journal Article
·
· Proceedings of the VLDB Endowment
- Univ. of California, Berkeley, CA (United States); OSTI
- Univ. of California, Berkeley, CA (United States)
Dataframes are a popular abstraction to represent, prepare, and analyze data. Despite the remarkable success of dataframe libraries in R and Python, dataframes face performance issues even on moderately large datasets. Moreover, there is significant ambiguity regarding dataframe semantics. In this paper we lay out a vision and roadmap for scalable dataframe systems. To demonstrate the potential in this area, we report on our experience building Modin, a scaled-up implementation of the most widely-used and complex dataframe API today, Python's pandas. With pandas as a reference, we propose a simple data model and algebra for dataframes to ground discussion in the field. Given this foundation, we lay out an agenda of open research opportunities where the distinct features of dataframes will require extending the state of the art in many dimensions of data management. We discuss the implications of signature dataframe features including flexible schemas, ordering, row/column equivalence, and data/metadata fluidity, as well as the piecemeal, trial-and-error-based approach to interacting with dataframes.
- Research Organization:
- Univ. of California, Oakland, CA (United States)
- Sponsoring Organization:
- National Science Foundation (NSF); USDOE Office of Science (SC)
- Grant/Contract Number:
- SC0016934
- OSTI ID:
- 1803099
- Journal Information:
- Proceedings of the VLDB Endowment, Journal Name: Proceedings of the VLDB Endowment Journal Issue: 12 Vol. 13; ISSN 2150-8097
- Publisher:
- Association for Computing Machinery (ACM)Copyright Statement
- Country of Publication:
- United States
- Language:
- English
R: A Language for Data Analysis and Graphics
|
journal | September 1996 |
Partial results in database systems
|
conference | June 2014 |
Efficient and extensible algorithms for multi query optimization
|
journal | May 2000 |
Multiple-query optimization
|
journal | March 1988 |
DataSpread
|
journal | August 2015 |
Optimization of nested XQuery expressions with orderby clauses
|
journal | February 2007 |
Visualization-aware sampling for very large databases
|
conference | May 2016 |
QPipe
|
conference | June 2005 |
A formal characterization of PIVOT/UNPIVOT
|
conference | October 2005 |
Skimmer
|
conference | May 2012 |
BlinkDB
|
conference | April 2013 |
Online aggregation
|
journal | June 1997 |
The analytical bootstrap
|
conference | June 2014 |
Fully Functional Static and Dynamic Succinct Trees
|
journal | May 2014 |
Spark SQL: Relational Data Processing in Spark
|
conference | January 2015 |
Query unnesting in object-oriented databases
|
journal | June 1998 |
Sample + Seek
|
conference | June 2016 |
PFunk-H
|
conference | June 2016 |
LaraDB
|
conference | May 2017 |
Exploration and Explanation in Computational Notebooks
|
conference | April 2018 |
VerdictDB
|
conference | May 2018 |
Optimally Leveraging Density and Locality for Exploratory Browsing and Sampling
|
conference | June 2018 |
DuckDB
|
conference | June 2019 |
From Ad-Hoc Data Analytics to DataOps
|
conference | June 2020 |
Efficient and extensible algorithms for multi query optimization
|
conference | May 2000 |
Rate-based query optimization for streaming information sources
|
conference | June 2002 |
Access path selection in a relational database management system
|
conference | January 1979 |
SharedDB
|
journal | February 2012 |
Rapid sampling for visualizations with ordering guarantees
|
journal | January 2015 |
Adaptive sampling for rapidly matching histograms
|
journal | June 2018 |
Helix
|
journal | December 2018 |
Similar Records
Adapter Python IO (Adapter) v1.0
pandas-sacct v1.0.0
In-depth analysis on parallel processing patterns for high-performance Dataframes
Software
·
Mon Apr 19 20:00:00 EDT 2021
·
OSTI ID:code-56244
pandas-sacct v1.0.0
Software
·
Wed Sep 30 20:00:00 EDT 2020
·
OSTI ID:code-51920
In-depth analysis on parallel processing patterns for high-performance Dataframes
Journal Article
·
Wed Jul 12 20:00:00 EDT 2023
· Future Generations Computer Systems
·
OSTI ID:3000839