Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Towards scalable dataframe systems

Journal Article · · Proceedings of the VLDB Endowment
Dataframes are a popular abstraction to represent, prepare, and analyze data. Despite the remarkable success of dataframe libraries in R and Python, dataframes face performance issues even on moderately large datasets. Moreover, there is significant ambiguity regarding dataframe semantics. In this paper we lay out a vision and roadmap for scalable dataframe systems. To demonstrate the potential in this area, we report on our experience building Modin, a scaled-up implementation of the most widely-used and complex dataframe API today, Python's pandas. With pandas as a reference, we propose a simple data model and algebra for dataframes to ground discussion in the field. Given this foundation, we lay out an agenda of open research opportunities where the distinct features of dataframes will require extending the state of the art in many dimensions of data management. We discuss the implications of signature dataframe features including flexible schemas, ordering, row/column equivalence, and data/metadata fluidity, as well as the piecemeal, trial-and-error-based approach to interacting with dataframes.
Research Organization:
Univ. of California, Oakland, CA (United States)
Sponsoring Organization:
National Science Foundation (NSF); USDOE Office of Science (SC)
Grant/Contract Number:
SC0016934
OSTI ID:
1803099
Journal Information:
Proceedings of the VLDB Endowment, Journal Name: Proceedings of the VLDB Endowment Journal Issue: 12 Vol. 13; ISSN 2150-8097
Publisher:
Association for Computing Machinery (ACM)Copyright Statement
Country of Publication:
United States
Language:
English

References (31)

R: A Language for Data Analysis and Graphics journal September 1996
Partial results in database systems conference June 2014
Efficient and extensible algorithms for multi query optimization journal May 2000
Multiple-query optimization journal March 1988
DataSpread journal August 2015
Optimization of nested XQuery expressions with orderby clauses journal February 2007
Visualization-aware sampling for very large databases conference May 2016
QPipe conference June 2005
A formal characterization of PIVOT/UNPIVOT conference October 2005
Skimmer conference May 2012
BlinkDB conference April 2013
Online aggregation journal June 1997
The analytical bootstrap conference June 2014
Fully Functional Static and Dynamic Succinct Trees journal May 2014
Spark SQL: Relational Data Processing in Spark conference January 2015
Query unnesting in object-oriented databases journal June 1998
Sample + Seek conference June 2016
PFunk-H conference June 2016
LaraDB conference May 2017
Exploration and Explanation in Computational Notebooks conference April 2018
VerdictDB conference May 2018
Optimally Leveraging Density and Locality for Exploratory Browsing and Sampling conference June 2018
DuckDB conference June 2019
From Ad-Hoc Data Analytics to DataOps conference June 2020
Efficient and extensible algorithms for multi query optimization conference May 2000
Rate-based query optimization for streaming information sources conference June 2002
Access path selection in a relational database management system
  • Selinger, P. Griffiths; Astrahan, M. M.; Chamberlin, D. D.
  • Proceedings of the 1979 ACM SIGMOD international conference on Management of data - SIGMOD '79 https://doi.org/10.1145/582095.582099
conference January 1979
SharedDB journal February 2012
Rapid sampling for visualizations with ordering guarantees journal January 2015
Adaptive sampling for rapidly matching histograms journal June 2018
Helix journal December 2018

Similar Records

Adapter Python IO (Adapter) v1.0
Software · Mon Apr 19 20:00:00 EDT 2021 · OSTI ID:code-56244

pandas-sacct v1.0.0
Software · Wed Sep 30 20:00:00 EDT 2020 · OSTI ID:code-51920

In-depth analysis on parallel processing patterns for high-performance Dataframes
Journal Article · Wed Jul 12 20:00:00 EDT 2023 · Future Generations Computer Systems · OSTI ID:3000839