Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

In-depth analysis on parallel processing patterns for high-performance Dataframes

Journal Article · · Future Generations Computer Systems
The Data Science domain has expanded monumentally in both research and industry communities during the past decade, predominantly owing to the Big Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are bringing more complexities to data engineering applications, which are now integrated into data processing pipelines to process terabytes of data. Typically, a significant amount of time is spent on data preprocessing in these pipelines, and hence improving its efficiency directly impacts the overall pipeline performance. The community has recently embraced the concept of Dataframes as the de-facto data structure for data representation and manipulation. However, the most widely used serial Dataframes today (R, pandas) experience performance limitations while working on even moderately large data sets. We believe that there is plenty of room for improvement by taking a look at this problem from a high-performance computing point of view. In a prior publication, we presented a set of parallel processing patterns for distributed dataframe operators and the reference runtime implementation, Cylon. In this paper, we are expanding on the initial concept by introducing a cost model for evaluating the said patterns. Furthermore, we evaluate the performance of Cylon on the ORNL Summit supercomputer.
Research Organization:
Univ. of Virginia, Charlottesville, VA (United States)
Sponsoring Organization:
USDOE Office of Science (SC)
Grant/Contract Number:
SC0023452
OSTI ID:
3000839
Report Number(s):
2307.01394
Journal Information:
Future Generations Computer Systems, Journal Name: Future Generations Computer Systems Vol. 149; ISSN 0167-739X
Publisher:
ElsevierCopyright Statement
Country of Publication:
United States
Language:
English

References (16)

Twister2: Design of a big data toolkit
  • Kamburugamuve, Supun; Govindarajan, Kannan; Wickramasinghe, Pulasthi
  • Concurrency and Computation: Practice and Experience, Vol. 32, Issue 3 https://doi.org/10.1002/cpe.5189
journal March 2019
LogGP: Incorporating Long Messages into the LogP Model for Parallel Computation journal July 1997
Performance analysis of MPI collective operations journal March 2007
On the versatility of parallel sorting by regular sampling journal October 1993
The communication challenge for MPP: Intel Paragon and Meiko CS-2 journal March 1994
Solving Problems On Concurrent Processors Vol. 1: General Techniques and Regular Problems journal January 1989
Efficient algorithms for all-to-all communications in multiport message-passing systems journal January 1997
A Fast, Scalable, Universal Approach For Distributed Data Aggregations conference December 2020
MapReduce: simplified data processing on large clusters journal January 2008
LogP: towards a realistic model of parallel computation
  • Culler, David; Karp, Richard; Patterson, David
  • Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming - PPOPP '93 https://doi.org/10.1145/155332.155333
conference January 1993
Implementing a classic conference June 2014
Photon: A Fast Query Engine for Lakehouse Systems conference June 2022
A bridging model for parallel computation journal August 1990
Optimization of Collective Communication Operations in MPICH journal February 2005
Distributed join algorithms on thousands of cores journal January 2017
Dask: Parallel Computation with Blocked algorithms and Task Scheduling conference January 2015

Similar Records

Towards scalable dataframe systems
Journal Article · Sun Sep 13 20:00:00 EDT 2020 · Proceedings of the VLDB Endowment · OSTI ID:1803099

Integrating the PanDA Workload Management System with the Vera C. Rubin Observatory
Conference · Sun Dec 31 23:00:00 EST 2023 · EPJ Web Conf. · OSTI ID:2468771

Integrating the PanDA Workload Management System with the Vera C. Rubin Observatory
Journal Article · Sun May 05 20:00:00 EDT 2024 · EPJ Web of Conferences (Online) · OSTI ID:2281342