Compute and format cluster validation statistics
clustatsum
computes cluster validation statistics by running
cqcluster.stats
,
and potentially distrsimilarity
, and collecting some key
statistics values with a somewhat different nomenclature.
This was implemented as a helper function for use inside of
clusterbenchstats
and cgrestandard
.
clustatsum(datadist=NULL,clustering,noisecluster=FALSE, datanp=NULL,npstats=FALSE,useboot=FALSE, bootclassif=NULL, bootmethod="nselectboot", bootruns=25, cbmethod=NULL,methodpars=NULL, distmethod=NULL,dnnk=2, pamcrit=TRUE,...)
datadist |
distances on which validation-measures are based, |
clustering |
an integer vector of length of the number of cases, which indicates a clustering. The clusters have to be numbered from 1 to the number of clusters. |
noisecluster |
logical. If |
datanp |
optional observations times variables data matrix, see
|
npstats |
logical. If |
useboot |
logical. If |
bootclassif |
If |
bootmethod |
either |
bootruns |
integer. Number of resampling runs. If
|
cbmethod |
CBI-function (see |
methodpars |
parameters to be passed on to |
distmethod |
logical. In case of |
dnnk |
|
pamcrit |
|
... |
further arguments to be passed on to
|
clustatsum
returns a list. The components, as listed below, are
outputs of summary.cquality
with default parameters,
which means that they are partly transformed versions of those given
out by cqcluster.stats
, i.e., their range is between 0
and 1 and large values are good. Those from
distrsimilarity
are computed with
largeisgood=TRUE
, correspondingly.
avewithin |
average distance within clusters (reweighted so that every observation, rather than every distance, has the same weight). |
mnnd |
average distance to |
cvnnd |
coefficient of variation of dissimilarities to
|
maxdiameter |
maximum cluster diameter. |
widestgap |
widest within-cluster gap or average of cluster-wise
widest within-cluster gap, depending on parameter |
sindex |
separation index, see argument |
minsep |
minimum cluster separation. |
asw |
average silhouette
width. See |
dindex |
this index measures to what extent the density decreases from the cluster mode to the outskirts; I-densdec in Sec. 3.6 of Hennig (2019); low values are good. |
denscut |
this index measures whether cluster boundaries run through density valleys; I-densbound in Sec. 3.6 of Hennig (2019); low values are good. |
highdgap |
this measures whether there is a large within-cluster gap with high density on both sides; I-highdgap in Sec. 3.6 of Hennig (2019); low values are good. |
pearsongamma |
correlation between distances and a 0-1-vector where 0 means same cluster, 1 means different clusters. "Normalized gamma" in Halkidi et al. (2001). |
withinss |
a generalisation of the within clusters sum
of squares (k-means objective function), which is obtained if
|
entropy |
entropy of the distribution of cluster memberships, see Meila(2007). |
pamc |
average distance to cluster centroid. |
kdnorm |
Kolmogorov distance between distribution of within-cluster Mahalanobis distances and appropriate chi-squared distribution, aggregated over clusters (I am grateful to Agustin Mayo-Iscar for the idea). |
kdunif |
Kolmogorov distance between distribution of distances to
|
boot |
if |
Akhanli, S. and Hennig, C. (2020) Calibrating and aggregating cluster validity indexes for context-adapted comparison of clusterings. Statistics and Computing, 30, 1523-1544, https://link.springer.com/article/10.1007/s11222-020-09958-2, https://arxiv.org/abs/2002.01822
Halkidi, M., Batistakis, Y., Vazirgiannis, M. (2001) On Clustering Validation Techniques, Journal of Intelligent Information Systems, 17, 107-145.
Hennig, C. (2019) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Data Analysis and Applications 1: Clustering and Regression, Modeling-estimating, Forecasting and Data Mining, Volume 2, Wiley, New York 1-24, https://arxiv.org/abs/1703.09282
Kaufman, L. and Rousseeuw, P.J. (1990). "Finding Groups in Data: An Introduction to Cluster Analysis". Wiley, New York.
Meila, M. (2007) Comparing clusterings?an information based distance, Journal of Multivariate Analysis, 98, 873-895.
set.seed(20000) options(digits=3) face <- rFace(20,dMoNo=2,dNoEy=0,p=2) dface <- dist(face) complete3 <- cutree(hclust(dface),3) clustatsum(dface,complete3)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.