Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

ParChain: a framework for parallel hierarchical agglomerative clustering using nearest-neighbor chain

Journal Article · · Proceedings of the VLDB Endowment

This paper studies the hierarchical clustering problem, where the goal is to produce a dendrogram that represents clusters at varying scales of a data set. We propose the ParChain framework for designing parallel hierarchical agglomerative clustering (HAC) algorithms, and using the framework we obtain novel parallel algorithms for the complete linkage, average linkage, and Ward's linkage criteria. Compared to most previous parallel HAC algorithms, which require quadratic memory, our new algorithms require only linear memory, and are scalable to large data sets. ParChain is based on our parallelization of the nearest-neighbor chain algorithm, and enables multiple clusters to be merged on every round. We introduce two key optimizations that are critical for efficiency: a range query optimization that reduces the number of distance computations required when finding nearest neighbors of clusters, and a caching optimization that stores a subset of previously computed distances, which are likely to be reused.

Experimentally, we show that our highly-optimized implementations using 48 cores with two-way hyper-threading achieve 5.8--110.1x speedup over state-of-the-art parallel HAC algorithms and achieve 13.75--54.23x self-relative speedup. Compared to state-of-the-art algorithms, our algorithms require up to 237.3x less space. Our algorithms are able to scale to data set sizes with tens of millions of points, which existing algorithms are not able to handle.

Research Organization:
Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States)
Sponsoring Organization:
USDOE Office of Science (SC)
DOE Contract Number:
SC0018947
OSTI ID:
1980994
Journal Information:
Proceedings of the VLDB Endowment, Journal Name: Proceedings of the VLDB Endowment Journal Issue: 2 Vol. 15; ISSN 2150-8097
Publisher:
Association for Computing Machinery (ACM)
Country of Publication:
United States
Language:
English

References (43)

Scalable Hierarchical Clustering with Tree Grafting conference July 2019
Reducing contention through priority updates conference July 2013
Parallel algorithms for hierarchical clustering journal August 1995
Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space journal June 2008
Parallel algorithms for hierarchical clustering and cluster validity journal January 1990
A Comprehensive Survey of Clustering Algorithms journal June 2015
Performance guarantees for hierarchical clustering journal June 2005
Fast Parallel Algorithms for Euclidean Minimum Spanning Tree and Hierarchical Spatial Clustering conference June 2021
A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems journal February 1967
Hierarchical Grouping to Optimize an Objective Function journal March 1963
The Cilk++ concurrency platform journal March 2010
Multi-Threaded Hierarchical Clustering by Parallel Nearest-Neighbor Chaining journal September 2015
Optimal implementations of UPGMA and other common clustering algorithms journal December 2007
Efficient Clustering and Matching for Object Class Recognition conference January 2006
An efficient algorithm for a complete link method journal April 1977
Correlation, hierarchies, and networks in financial markets journal July 2010
Parallel clustering algorithms journal August 1989
DHC: A Distributed Hierarchical Clustering Algorithm for Large Datasets journal March 2019
ConnectIt journal December 2020
A novel parallelization approach for hierarchical clustering journal May 2005
Methods for multidimensional event classification: a case study using images from a Cherenkov gamma-ray telescope
  • Bock, R. K.; Chilingarian, A.; Gaug, M.
  • Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, Vol. 516, Issue 2-3 https://doi.org/10.1016/j.nima.2003.08.157
journal January 2004
Twister Tries conference May 2015
fast_protein_cluster: parallel and optimized clustering of large-scale protein modeling data journal February 2014
Fast reciprocal nearest neighbors clustering journal January 2012
Valgrind: a framework for heavyweight dynamic binary instrumentation journal June 2007
ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences journal May 2009
Efficient parallel hierarchical clustering algorithms journal June 2005
Hierarchical Clustering journal June 2019
Algorithms for hierarchical clustering: an overview, II journal September 2017
A cost function for similarity-based hierarchical clustering conference June 2016
Online decorrelation of humidity and temperature in chemical sensors for continuous monitoring journal October 2016
Fast euclidean minimum spanning tree conference July 2010
SparseHC: A Memory-efficient Online Hierarchical Clustering Algorithm journal January 2014
Data clustering: a review journal September 1999
Learning transportation mode from raw gps data for geographic applications on the web conference April 2008
Cluster analysis and display of genome-wide expression patterns journal December 1998
Optimal algorithms for complete linkage clustering in d dimensions journal September 2002
A Survey of Recent Advances in Hierarchical Clustering Algorithms journal November 1983
Fast approximate hierarchical clustering using similarity heuristics journal September 2008
Efficient hierarchical clustering of large high dimensional datasets
  • Gilpin, Sean; Qian, Buyue; Davidson, Ian
  • Proceedings of the 22nd ACM international conference on Conference on information & knowledge management - CIKM '13 https://doi.org/10.1145/2505515.2505527
conference January 2013
fastcluster : Fast Hierarchical, Agglomerative Clustering Routines for R and Python journal January 2013
Reservoir computing compensates slow response of chemosensor arrays exposed to fast varying gas concentrations in continuous monitoring journal August 2015
SciPy 1.0: fundamental algorithms for scientific computing in Python journal February 2020

Similar Records

Delaunay walk for fast nearest neighbor: accelerating correspondence matching for ICP
Journal Article · Mon Feb 14 23:00:00 EST 2022 · Machine Vision and Applications · OSTI ID:1981469

PANDORA: A Parallel Dendrogram Construction Algorithm for Single Linkage Clustering on GPU
Conference · Thu Aug 01 00:00:00 EDT 2024 · OSTI ID:2438688

Theoretically and practically efficient parallel nucleus decomposition
Journal Article · Mon Nov 01 00:00:00 EDT 2021 · Proceedings of the VLDB Endowment · OSTI ID:1980995

Related Subjects