Fast hierarchical, agglomerative clustering of vector data
This function implements hierarchical, agglomerative clustering with memory-saving algorithms.
hclust.vector(X, method="single", members=NULL, metric='euclidean', p=NULL)
X |
an (N×D) matrix of 'double' values: N observations in D variables. |
method |
the agglomeration method to be used. This must be (an
unambiguous abbreviation of) one of |
members |
|
metric |
the distance measure to be used. This must be one of
|
p |
parameter for the Minkowski metric. |
The function hclust.vector
provides clustering when the
input is vector data. It uses memory-saving algorithms which allow
processing of larger data sets than hclust
does.
The "ward"
, "centroid"
and "median"
methods
require metric="euclidean"
and cluster the data set with
respect to Euclidean distances.
For "single"
linkage clustering, any dissimilarity
measure may be chosen. Currently, the same metrics are implemented as the
dist
function provides.
The call
hclust.vector(X, method='single', metric=[...])
gives the same result as
hclust(dist(X, metric=[...]), method='single')
but uses less memory and is equally fast.
For the Euclidean methods, care must be taken since
hclust
expects squared Euclidean
distances. Hence, the call
hclust.vector(X, method='centroid')
is, aside from the lesser memory requirements, equivalent to
d = dist(X) hc = hclust(d^2, method='centroid') hc$height = sqrt(hc$height)
The same applies to the "median"
method. The "ward"
method in
hclust.vector
is equivalent to hclust
with method "ward.D2"
,
but to method "ward.D"
only after squaring as above.
More details are in the User's manual
fastcluster.pdf, which is available as
a vignette. Get this from the R command line with
vignette('fastcluster')
.
Daniel Müllner
# Taken and modified from stats::hclust ## Perform centroid clustering with squared Euclidean distances, ## cut the tree into ten clusters and reconstruct the upper part of the ## tree from the cluster centers. hc <- hclust.vector(USArrests, "cen") # squared Euclidean distances hc$height <- hc$height^2 memb <- cutree(hc, k = 10) cent <- NULL for(k in 1:10){ cent <- rbind(cent, colMeans(USArrests[memb == k, , drop = FALSE])) } hc1 <- hclust.vector(cent, method = "cen", members = table(memb)) # squared Euclidean distances hc1$height <- hc1$height^2 opar <- par(mfrow = c(1, 2)) plot(hc, labels = FALSE, hang = -1, main = "Original Tree") plot(hc1, labels = FALSE, hang = -1, main = "Re-start from 10 clusters") par(opar)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.