Cluster validation statistics (version for use with clusterbenchstats
This is a more sophisticated version of cluster.stats
for use with clusterbenchstats
, see Hennig (2017).
Computes a number of distance-based statistics, which can be used for cluster
validation, comparison between clusterings and decision about
the number of clusters: cluster sizes, cluster diameters,
average distances within and between clusters, cluster separation,
biggest within cluster gap,
average silhouette widths, the Calinski and Harabasz index,
a Pearson version of
Hubert's gamma coefficient, the Dunn index, further statistics
introduced
in Hennig (2017) and two indexes
to assess the similarity of two clusterings, namely the corrected Rand
index and Meila's VI.
cqcluster.stats(d = NULL, clustering, alt.clustering = NULL, noisecluster = FALSE, silhouette = TRUE, G2 = FALSE, G3 = FALSE, wgap = TRUE, sepindex = TRUE, sepprob = 0.1, sepwithnoise = TRUE, compareonly = FALSE, aggregateonly = FALSE, averagegap=FALSE, pamcrit=TRUE, dquantile=0.1, nndist=TRUE, nnk=2, standardisation="max", sepall=TRUE, maxk=10, cvstan=sqrt(length(clustering))) ## S3 method for class 'cquality' summary(object,stanbound=TRUE,largeisgood=TRUE, ...) ## S3 method for class 'summary.cquality' print(x, ...)
d |
a distance object (as generated by |
clustering |
an integer vector of length of the number of cases, which indicates a clustering. The clusters have to be numbered from 1 to the number of clusters. |
alt.clustering |
an integer vector such as for
|
noisecluster |
logical. If |
silhouette |
logical. If |
G2 |
logical. If |
G3 |
logical. If |
wgap |
logical. If |
sepindex |
logical. If |
sepprob |
numerical between 0 and 1, see |
sepwithnoise |
logical. If |
compareonly |
logical. If |
aggregateonly |
logical. If |
averagegap |
logical. If |
pamcrit |
logical. If |
dquantile |
numerical between 0 and 1; quantile used for kernel density estimator for density indexes, see Hennig (2019), Sec. 3.6. |
nndist |
logical. If |
nnk |
integer. Number of neighbours used in average and
coefficient of
variation of distance to nearest within cluster neighbour (clusters
with |
standardisation |
|
sepall |
logical. If |
maxk |
numeric. Parsimony is defined as the number of clusters
divided by |
cvstan |
numeric. |
object |
object of class |
x |
object of class |
stanbound |
logical. If |
largeisgood |
logical. If |
... |
no effect. |
The standardisation
-parameter governs the standardisation of
the index values.
standardisation="none"
means that unstandardised
raw values of indexes are given out. Otherwise, entropy
will be
standardised by the
maximum possible value for the given number of clusters;
within.cluster.ss
and between.cluster.ss
will be
standardised by the overall sum of squares; mnnd
will be
standardised by the maximum distance to the nnk
th nearest
neighbour within cluster; pearsongamma
will be standardised
by adding 1 and dividing by 2; cvnn
will be standardised by
cvstan
(the default is the possible maximum).
standardisation
allows options for the standardisation of
average.within, sindex, wgap, pamcrit, max.diameter,
min.separation
and can be "max"
(maximum distance),
"ave"
(average distance), q90
(0.9-quantile of
distances), or a positive number. "max"
is the default and
standardises all the listed indexes into the range [0,1].
cqcluster.stats
with compareonly=FALSE
and
aggregateonly=FALSE
returns a list of type
cquality
containing the components
n, cluster.number, cluster.size, min.cluster.size, noisen,
diameter,
average.distance, median.distance, separation, average.toother,
separation.matrix, ave.between.matrix, average.between, average.within,
n.between, n.within, max.diameter, min.separation,
within.cluster.ss, clus.avg.silwidths, avg.silwidth,
g2, g3, pearsongamma, dunn, dunn2, entropy, wb.ratio, ch, cwidegap,
widestgap, corrected.rand, vi, sindex, svec, psep, stan, nnk, mnnd,
pamc, pamcentroids, dindex, denscut, highdgap, npenalty, dpenalty,
withindensp, densoc, pdistto, pclosetomode, distto, percwdens,
percdensoc, parsimony, cvnnd, cvnndc
. Some of these are
standardised, see Details. If
compareonly=TRUE
, only corrected.rand, vi
are given
out. If aggregateonly=TRUE
, only n, cluster.number,
min.cluster.size, noisen, diameter,
average.between, average.within,
max.diameter, min.separation,
within.cluster.ss, avg.silwidth,
g2, g3, pearsongamma, dunn, dunn2, entropy, wb.ratio, ch,
widestgap, corrected.rand, vi, sindex, svec, psep, stan, nnk, mnnd,
pamc, pamcentroids, dindex, denscut, highdgap, parsimony, cvnnd,
cvnndc
are given out.
summary.cquality
returns a list of type summary.cquality
with components average.within,nnk,mnnd,
avg.silwidth,
widestgap,sindex,
pearsongamma,entropy,pamc,
within.cluster.ss,
dindex,denscut,highdgap,
parsimony,max.diameter,
min.separation,cvnnd
. These are as documented below for
cqcluster.stats
, but after transformation by stanbound
and largeisgood
, see arguments.
n |
number of points. |
cluster.number |
number of clusters. |
cluster.size |
vector of cluster sizes (number of points). |
min.cluster.size |
size of smallest cluster. |
noisen |
number of noise points, see argument |
diameter |
vector of cluster diameters (maximum within cluster distances). |
average.distance |
vector of clusterwise within cluster average distances. |
median.distance |
vector of clusterwise within cluster distance medians. |
separation |
vector of clusterwise minimum distances of a point in the cluster to a point of another cluster. |
average.toother |
vector of clusterwise average distances of a point in the cluster to the points of other clusters. |
separation.matrix |
matrix of separation values between all pairs of clusters. |
ave.between.matrix |
matrix of mean dissimilarities between points of every pair of clusters. |
avebetween |
average distance between clusters. |
avewithin |
average distance within clusters (reweighted so that every observation, rather than every distance, has the same weight). |
n.between |
number of distances between clusters. |
n.within |
number of distances within clusters. |
maxdiameter |
maximum cluster diameter. |
minsep |
minimum cluster separation. |
withinss |
a generalisation of the within clusters sum
of squares (k-means objective function), which is obtained if
|
clus.avg.silwidths |
vector of cluster average silhouette
widths. See
|
asw |
average silhouette
width. See |
g2 |
Goodman and Kruskal's Gamma coefficient. See Milligan and Cooper (1985), Gordon (1999, p. 62). |
g3 |
G3 coefficient. See Gordon (1999, p. 62). |
pearsongamma |
correlation between distances and a 0-1-vector where 0 means same cluster, 1 means different clusters. "Normalized gamma" in Halkidi et al. (2001). |
dunn |
minimum separation / maximum diameter. Dunn index, see Halkidi et al. (2002). |
dunn2 |
minimum average dissimilarity between two cluster / maximum average within cluster dissimilarity, another version of the family of Dunn indexes. |
entropy |
entropy of the distribution of cluster memberships, see Meila(2007). |
wb.ratio |
|
ch |
Calinski and Harabasz index (Calinski and Harabasz 1974, optimal in Milligan and Cooper 1985; generalised for dissimilarites in Hennig and Liao 2013). |
cwidegap |
vector of widest within-cluster gaps. |
widestgap |
widest within-cluster gap or average of cluster-wise
widest within-cluster gap, depending on parameter |
corrected.rand |
corrected Rand index (if |
vi |
variation of information (VI) index (if |
sindex |
separation index, see argument |
svec |
vector of smallest closest distances of points to next
cluster that are used in the computation of |
psep |
vector of all closest distances of points to next cluster. |
stan |
value by which som statistics were standardised, see Details. |
nnk |
value of input parameter |
mnnd |
average distance to |
pamc |
average distance to cluster centroid. |
pamcentroids |
index numbers of cluster centroids. |
dindex |
this index measures to what extent the density decreases from the cluster mode to the outskirts; I-densdec in Sec. 3.6 of Hennig (2019); low values are good. |
denscut |
this index measures whether cluster boundaries run through density valleys; I-densbound in Sec. 3.6 of Hennig (2019); low values are good. |
highdgap |
this measures whether there is a large within-cluster gap with high density on both sides; I-highdgap in Sec. 3.6 of Hennig (2019); low values are good. |
npenalty |
vector of penalties for all clusters that are used
in the computation of |
depenalty |
vector of penalties for all clusters that are used in
the computation of |
withindensp |
distance-based kernel density values for all points as computed in Sec. 3.6 of Hennig (2019). |
densoc |
contribution of points from other clusters than the one
to which a point is assigned to the density, for all points; called
|
pdistto |
list that for all clusters has a sequence of point numbers. These are the points already incorporated in the sequence of points constructed in the algorithm in Sec. 3.6 of Hennig (2019) to which the next point to be joined is connected. |
pclosetomode |
list that for all clusters has a sequence of point numbers. Sequence of points to be incorporated in the sequence of points constructed in the algorithm in Sec. 3.6 of Hennig (2019). |
distto |
list that for all clusters has a sequence of differences
between the standardised densities (see |
percwdens |
this is |
percdensoc |
this is |
parsimony |
number of clusters divided by |
cvnnd |
coefficient of variation of dissimilarities to
|
cvnndc |
vector of cluster-wise coefficients of variation of
dissimilarities to |
Because cqcluster.stats
processes a full dissimilarity matrix, it
isn't suitable for large data sets. You may consider
distcritmulti
in that case.
Akhanli, S. and Hennig, C. (2020) Calibrating and aggregating cluster validity indexes for context-adapted comparison of clusterings. Statistics and Computing, 30, 1523-1544, https://link.springer.com/article/10.1007/s11222-020-09958-2, https://arxiv.org/abs/2002.01822
Calinski, T., and Harabasz, J. (1974) A Dendrite Method for Cluster Analysis, Communications in Statistics, 3, 1-27.
Gordon, A. D. (1999) Classification, 2nd ed. Chapman and Hall.
Halkidi, M., Batistakis, Y., Vazirgiannis, M. (2001) On Clustering Validation Techniques, Journal of Intelligent Information Systems, 17, 107-145.
Hennig, C. and Liao, T. (2013) How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification, Journal of the Royal Statistical Society, Series C Applied Statistics, 62, 309-369.
Hennig, C. (2013) How many bee species? A case study in determining the number of clusters. In: Spiliopoulou, L. Schmidt-Thieme, R. Janning (eds.): "Data Analysis, Machine Learning and Knowledge Discovery", Springer, Berlin, 41-49.
Hennig, C. (2019) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Data Analysis and Applications 1: Clustering and Regression, Modeling-estimating, Forecasting and Data Mining, Volume 2, Wiley, New York 1-24, https://arxiv.org/abs/1703.09282
Kaufman, L. and Rousseeuw, P.J. (1990). "Finding Groups in Data: An Introduction to Cluster Analysis". Wiley, New York.
Meila, M. (2007) Comparing clusterings?an information based distance, Journal of Multivariate Analysis, 98, 873-895.
Milligan, G. W. and Cooper, M. C. (1985) An examination of procedures for determining the number of clusters. Psychometrika, 50, 159-179.
cluster.stats
,
silhouette
, dist
, calinhara
,
distcritmulti
.
clusterboot
computes clusterwise stability statistics by
resampling.
set.seed(20000) options(digits=3) face <- rFace(200,dMoNo=2,dNoEy=0,p=2) dface <- dist(face) complete3 <- cutree(hclust(dface),3) cqcluster.stats(dface,complete3, alt.clustering=as.integer(attr(face,"grouping")))
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.