Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Significant DBSCAN+: Statistically Robust Density-based Clustering

Journal Article · · ACM Transactions on Intelligent Systems and Technology
DOI:https://doi.org/10.1145/3474842· OSTI ID:1980836
 [1];  [2];  [3];  [4];  [4]
  1. University of Maryland, College Park, MD
  2. University of Pittsburgh, S. Bouquet Street Pittsburgh, PA
  3. University of Minnesota, Minneapolis, MN
  4. University of Iowa, Iowa City, IA

Cluster detection is important and widely used in a variety of applications, including public health, public safety, transportation, and so on. Given a collection of data points, we aim to detect density-connected spatial clusters with varying geometric shapes and densities, under the constraint that the clusters are statistically significant. The problem is challenging, because many societal applications and domain science studies have low tolerance for spurious results, and clusters may have arbitrary shapes and varying densities. As a classical topic in data mining and learning, a myriad of techniques have been developed to detect clusters with both varying shapes and densities (e.g., density-based, hierarchical, spectral, or deep clustering methods). However, the vast majority of these techniques do not consider statistical rigor and are susceptible to detecting spurious clusters formed as a result of natural randomness. On the other hand, scan statistic approaches explicitly control the rate of spurious results, but they typically assume a single “hotspot” of over-density and many rely on further assumptions such as a tessellated input space. To unite the strengths of both lines of work, we propose a statistically robust formulation of a multi-scale DBSCAN, namely Significant DBSCAN+, to identify significant clusters that are density connected. As we will show, incorporation of statistical rigor is a powerful mechanism that allows the new Significant DBSCAN+ to outperform state-of-the-art clustering techniques in various scenarios. We also propose computational enhancements to speed-up the proposed approach. Experiment results show that Significant DBSCAN+ can simultaneously improve the success rate of true cluster detection (e.g., 10–20% increases in absolute F1 scores) and substantially reduce the rate of spurious results (e.g., from thousands/hundreds of spurious detections to none or just a few across 100 datasets), and the acceleration methods can improve the efficiency for both clustered and non-clustered data.

Research Organization:
Univ. of Minnesota, Minneapolis, MN (United States)
Sponsoring Organization:
USDOE Advanced Research Projects Agency - Energy (ARPA-E)
DOE Contract Number:
AR0000795
OSTI ID:
1980836
Journal Information:
ACM Transactions on Intelligent Systems and Technology, Vol. 12, Issue 5; ISSN 2157-6904
Publisher:
Association for Computing Machinery (ACM)
Country of Publication:
United States
Language:
English

References (29)

Clustering and projected clustering with adaptive neighbors conference August 2014
Survey of Clustering Algorithms journal May 2005
A Comprehensive Survey of Clustering Algorithms journal June 2015
DBSCAN Revisited conference May 2015
Density-Based Clustering Based on Hierarchical Density Estimates book January 2013
Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection
  • Campello, Ricardo J. G. B.; Moulavi, Davoud; Zimek, Arthur
  • ACM Transactions on Knowledge Discovery from Data, Vol. 10, Issue 1 https://doi.org/10.1145/2733381
journal July 2015
Ring-Shaped Hotspot Detection: A Summary of Results conference December 2014
Integrate and Conquer journal June 2018
Chameleon: hierarchical clustering using dynamic modeling journal January 1999
A spatial scan statistic for multiple clusters journal October 2011
The Block Criterion for Multiscale Inference About a Density, With Applications to Other Multiscale Problems journal January 2010
Rapid detection of significant spatial clusters conference August 2004
Significant Linear Hotspot Discovery journal June 2017
Nonnegative matrix factorization with local similarity learning journal July 2021
Significant DBSCAN towards Statistically Robust Clustering
  • Xie, Yiqun; Shekhar, Shashi
  • SSTD '19: 16th International Symposium on Spatial and Temporal Databases, Proceedings of the 16th International Symposium on Spatial and Temporal Databases https://doi.org/10.1145/3340964.3340968
conference August 2019
Feature Selection Embedded Robust K-Means journal January 2020
Spatiotemporal Data Mining: A Computational Perspective journal October 2015
Spatio-Temporal Data Mining journal August 2018
Deep Subspace Clustering journal December 2020
Spatial computing journal December 2015
A tutorial on spectral clustering journal August 2007
Kernel two-dimensional ridge regression for subspace clustering journal May 2021
Constrained spanning tree algorithms for irregularly-shaped spatial clustering journal June 2012
Hot spot or not: a comparison of spatial statistical methods to predict prospective malaria infections journal February 2014
A spatial scan statistic journal January 1997
A Unified Framework for Robust and Efficient Hotspot Detection in Smart Cities journal August 2020
Upper level set scan statistic for detecting arbitrarily shaped hotspots journal June 2004
Optimal detection of a jump in the intensity of a Poisson process or in a density with likelihood ratio statistics journal September 2013
Evaluation of Spatial Scan Statistics for Irregularly Shaped Clusters journal June 2006