Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

ParChain: a framework for parallel hierarchical agglomerative clustering using nearest-neighbor chain

Journal Article · · Proceedings of the VLDB Endowment

This paper studies the hierarchical clustering problem, where the goal is to produce a dendrogram that represents clusters at varying scales of a data set. We propose the ParChain framework for designing parallel hierarchical agglomerative clustering (HAC) algorithms, and using the framework we obtain novel parallel algorithms for the complete linkage, average linkage, and Ward's linkage criteria. Compared to most previous parallel HAC algorithms, which require quadratic memory, our new algorithms require only linear memory, and are scalable to large data sets. ParChain is based on our parallelization of the nearest-neighbor chain algorithm, and enables multiple clusters to be merged on every round. We introduce two key optimizations that are critical for efficiency: a range query optimization that reduces the number of distance computations required when finding nearest neighbors of clusters, and a caching optimization that stores a subset of previously computed distances, which are likely to be reused.

Experimentally, we show that our highly-optimized implementations using 48 cores with two-way hyper-threading achieve 5.8--110.1x speedup over state-of-the-art parallel HAC algorithms and achieve 13.75--54.23x self-relative speedup. Compared to state-of-the-art algorithms, our algorithms require up to 237.3x less space. Our algorithms are able to scale to data set sizes with tens of millions of points, which existing algorithms are not able to handle.

Research Organization:
Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States)
Sponsoring Organization:
USDOE Office of Science (SC)
DOE Contract Number:
SC0018947
OSTI ID:
1980994
Journal Information:
Proceedings of the VLDB Endowment, Vol. 15, Issue 2; ISSN 2150-8097
Publisher:
Association for Computing Machinery (ACM)
Country of Publication:
United States
Language:
English

References (43)

Methods for multidimensional event classification: a case study using images from a Cherenkov gamma-ray telescope January 2004
Twister Tries May 2015
Hierarchical Clustering June 2019
A cost function for similarity-based hierarchical clustering June 2016
Performance guarantees for hierarchical clustering June 2005
An efficient algorithm for a complete link method April 1977
ConnectIt December 2020
A novel parallelization approach for hierarchical clustering May 2005
Cluster analysis and display of genome-wide expression patterns December 1998
Reservoir computing compensates slow response of chemosensor arrays exposed to fast varying gas concentrations in continuous monitoring August 2015
Efficient hierarchical clustering of large high dimensional datasets January 2013
Optimal implementations of UPGMA and other common clustering algorithms December 2007
Online decorrelation of humidity and temperature in chemical sensors for continuous monitoring October 2016
fast_protein_cluster: parallel and optimized clustering of large-scale protein modeling data February 2014
Data clustering: a review September 1999
Multi-Threaded Hierarchical Clustering by Parallel Nearest-Neighbor Chaining September 2015
Optimal algorithms for complete linkage clustering in d dimensions September 2002
Fast approximate hierarchical clustering using similarity heuristics September 2008
A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems February 1967
Efficient Clustering and Matching for Object Class Recognition January 2006
The Cilk++ concurrency platform March 2010
Parallel algorithms for hierarchical clustering and cluster validity January 1990
Parallel clustering algorithms August 1989
Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space June 2008
Fast reciprocal nearest neighbors clustering January 2012
Fast euclidean minimum spanning tree July 2010
Scalable Hierarchical Clustering with Tree Grafting July 2019
fastcluster : Fast Hierarchical, Agglomerative Clustering Routines for R and Python January 2013
A Survey of Recent Advances in Hierarchical Clustering Algorithms November 1983
Algorithms for hierarchical clustering: an overview, II September 2017
Valgrind: a framework for heavyweight dynamic binary instrumentation June 2007
SparseHC: A Memory-efficient Online Hierarchical Clustering Algorithm January 2014
Parallel algorithms for hierarchical clustering August 1995
Efficient parallel hierarchical clustering algorithms June 2005
Reducing contention through priority updates July 2013
ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences May 2009
Correlation, hierarchies, and networks in financial markets July 2010
SciPy 1.0: fundamental algorithms for scientific computing in Python February 2020
Fast Parallel Algorithms for Euclidean Minimum Spanning Tree and Hierarchical Spatial Clustering June 2021
Hierarchical Grouping to Optimize an Objective Function March 1963
A Comprehensive Survey of Clustering Algorithms June 2015
DHC: A Distributed Hierarchical Clustering Algorithm for Large Datasets March 2019
Learning transportation mode from raw gps data for geographic applications on the web April 2008