Hierarchical DBSCAN (HDBSCAN)
Fast C++ implementation of the HDBSCAN (Hierarchical DBSCAN) and its related algorithms.
hdbscan(x, minPts, gen_hdbscan_tree = FALSE, gen_simplified_tree = FALSE) ## S3 method for class 'hdbscan' plot( x, scale = "suggest", gradient = c("yellow", "red"), show_flat = FALSE, ... ) coredist(x, minPts) mrdist(x, minPts, coredist = NULL) ## S3 method for class 'hdbscan' predict(object, newdata, data, ...)
x |
a data matrix (Euclidean distances are used) or a dist object calculated with an arbitrary distance metric. |
minPts |
integer; Minimum size of clusters. See details. |
gen_hdbscan_tree |
logical; should the robust single linkage tree be explicitly computed (see cluster tree in Chaudhuri et al, 2010). |
gen_simplified_tree |
logical; should the simplified hierarchy be explicitly computed (see Campello et al, 2013). |
scale |
integer; used to scale condensed tree based on the graphics device. Lower scale results in wider trees. |
gradient |
character vector; the colors to build the condensed tree coloring with. |
show_flat |
logical; whether to draw boxes indicating the most stable clusters. |
... |
additional arguments are passed on. |
coredist |
numeric vector with precomputed core distances (optional). |
object |
clustering object. |
newdata |
new data points for which the cluster membership should be predicted. |
data |
the data set used to create the clustering object. |
This fast implementation of HDBSCAN (Campello et al., 2013) computes the hierarchical cluster tree representing density estimates along with the stability-based flat cluster extraction. HDBSCAN essentially computes the hierarchy of all DBSCAN* clusterings, and then uses a stability-based extraction method to find optimal cuts in the hierarchy, thus producing a flat solution.
HDBSCAN performs the following steps:
Compute mutual reachability distance mrd between points (based on distances and core distances).
Use mdr as a distance measure to construct a minimum spanning tree.
Prune the tree using stability.
Extract the clusters.
Additional, related algorithms including the "Global-Local Outlier Score
from Hierarchies" (GLOSH; see section 6 of Campello et al., 2015)
is available in function glosh()
and the ability to cluster based on instance-level constraints (see
section 5.3 of Campello et al. 2015) are supported. The algorithms only need
the parameter minPts
.
Note that minPts
not only acts as a minimum cluster size to detect,
but also as a "smoothing" factor of the density estimates implicitly
computed from HDBSCAN.
coredist()
: The core distance is defined for each point as
the distance to the MinPts
's neighbor. It is a density estimate.
mrdist()
: The mutual reachability distance is defined between two points as
mrd(a, b) = max(coredist(a), coredist(b), dist(a, b))
. This distance metric is used by
HDBSCAN. It has the effect of increasing distances in low density areas.
predict()
assigns each new data point to the same cluster as the nearest point
if it is not more than that points core distance away. Otherwise the new point
is classified as a noise point (i.e., cluster ID 0).
hdbscan()
returns object of class hdbscan
with the following components:
cluster |
A integer vector with cluster assignments. Zero indicates noise points. |
minPts |
value of the |
cluster_scores |
The sum of the stability scores for each salient
(flat) cluster. Corresponds to cluster IDs given the in |
membership_prob |
The probability or individual stability of a point within its clusters. Between 0 and 1. |
outlier_scores |
The GLOSH outlier score of each point. |
hc |
An hclust object of the HDBSCAN hierarchy. |
coredist()
returns a vector with the core distance for each data point.
mrdist()
returns a dist object containing pairwise mutual reachability distances.
Matt Piekenbrock Campello RJGB, Moulavi D, Sander J (2013). Density-Based Clustering Based on Hierarchical Density Estimates. Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery in Databases, PAKDD 2013, Lecture Notes in Computer Science 7819, p. 160. doi: 10.1007/978-3-642-37456-2_14
Campello RJGB, Moulavi D, Zimek A, Sander J (2015). Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Transactions on Knowledge Discovery from Data (TKDD), 10(5):1-51. doi: 10.1145/2733381
Other clustering functions:
dbscan()
,
extractFOSC()
,
jpclust()
,
optics()
,
sNNclust()
## cluster the moons data set with HDBSCAN data(moons) res <- hdbscan(moons, minPts = 5) res plot(res) plot(moons, col = res$cluster + 1L) ## cluster the moons data set with HDBSCAN using Manhattan distances res <- hdbscan(dist(moons, method = "manhattan"), minPts = 5) plot(res) plot(moons, col = res$cluster + 1L) ## DS3 from Chameleon data("DS3") res <- hdbscan(DS3, minPts = 50) res ## Plot the simplified tree, highlight the most stable clusters plot(res, show_flat = TRUE) ## Plot the actual clusters (noise has cluster id 0 and is shown in black) plot(DS3, col = res$cluster + 1L, cex = .5)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.