Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

clustatsum

Compute and format cluster validation statistics


Description

clustatsum computes cluster validation statistics by running cqcluster.stats, and potentially distrsimilarity, and collecting some key statistics values with a somewhat different nomenclature.

This was implemented as a helper function for use inside of clusterbenchstats and cgrestandard.

Usage

clustatsum(datadist=NULL,clustering,noisecluster=FALSE,
                       datanp=NULL,npstats=FALSE,useboot=FALSE,
                       bootclassif=NULL,
                       bootmethod="nselectboot",
                       bootruns=25, cbmethod=NULL,methodpars=NULL,
                       distmethod=NULL,dnnk=2,
                       pamcrit=TRUE,...)

Arguments

datadist

distances on which validation-measures are based, dist object or distance matrix. If NULL, this is computed from datanp; at least one of datadist and datanp must be specified.

clustering

an integer vector of length of the number of cases, which indicates a clustering. The clusters have to be numbered from 1 to the number of clusters.

noisecluster

logical. If TRUE, it is assumed that the largest cluster number in clustering denotes a 'noise class', i.e. points that do not belong to any cluster. These points are not taken into account for the computation of all functions of within and between cluster distances including the validation indexes.

datanp

optional observations times variables data matrix, see npstats.

npstats

logical. If TRUE, distrsimilarity is called and the two statistics computed there are added to the output. These are based on datanp and require datanp to be specified.

useboot

logical. If TRUE, a stability index (either nselectboot or prediction.strength) will be involved.

bootclassif

If useboot=TRUE, a string indicating the classification method to be used with the stability index, see the classification argument of nselectboot and prediction.strength.

bootmethod

either "nselectboot" or "prediction.strength"; stability index to be used if useboot=TRUE.

bootruns

integer. Number of resampling runs. If useboot=TRUE, passed on as B to nselectboot or M to prediction.strength.

cbmethod

CBI-function (see kmeansCBI); clustering method to be used for stability assessment if useboot=TRUE.

methodpars

parameters to be passed on to cbmethod.

distmethod

logical. In case of useboot=TRUE indicates whether cbmethod will interpret data as distances.

dnnk

nnk-argument to be passed on to distrsimilarity.

pamcrit

pamcrit-argument to be passed on to cqcluster.stats.

...

further arguments to be passed on to cqcluster.stats.

Value

clustatsum returns a list. The components, as listed below, are outputs of summary.cquality with default parameters, which means that they are partly transformed versions of those given out by cqcluster.stats, i.e., their range is between 0 and 1 and large values are good. Those from distrsimilarity are computed with largeisgood=TRUE, correspondingly.

avewithin

average distance within clusters (reweighted so that every observation, rather than every distance, has the same weight).

mnnd

average distance to nnkth nearest neighbour within cluster.

cvnnd

coefficient of variation of dissimilarities to nnkth nearest wthin-cluster neighbour, measuring uniformity of within-cluster densities, weighted over all clusters, see Sec. 3.7 of Hennig (2019).

maxdiameter

maximum cluster diameter.

widestgap

widest within-cluster gap or average of cluster-wise widest within-cluster gap, depending on parameter averagegap.

sindex

separation index, see argument sepindex.

minsep

minimum cluster separation.

asw

average silhouette width. See silhouette.

dindex

this index measures to what extent the density decreases from the cluster mode to the outskirts; I-densdec in Sec. 3.6 of Hennig (2019); low values are good.

denscut

this index measures whether cluster boundaries run through density valleys; I-densbound in Sec. 3.6 of Hennig (2019); low values are good.

highdgap

this measures whether there is a large within-cluster gap with high density on both sides; I-highdgap in Sec. 3.6 of Hennig (2019); low values are good.

pearsongamma

correlation between distances and a 0-1-vector where 0 means same cluster, 1 means different clusters. "Normalized gamma" in Halkidi et al. (2001).

withinss

a generalisation of the within clusters sum of squares (k-means objective function), which is obtained if d is a Euclidean distance matrix. For general distance measures, this is half the sum of the within cluster squared dissimilarities divided by the cluster size.

entropy

entropy of the distribution of cluster memberships, see Meila(2007).

pamc

average distance to cluster centroid.

kdnorm

Kolmogorov distance between distribution of within-cluster Mahalanobis distances and appropriate chi-squared distribution, aggregated over clusters (I am grateful to Agustin Mayo-Iscar for the idea).

kdunif

Kolmogorov distance between distribution of distances to nnkth nearest within-cluster neighbor and appropriate Gamma-distribution, see Byers and Raftery (1998), aggregated over clusters.

boot

if useboot=TRUE, stability value; stabk for method nselectboot; mean.pred for method prediction.strength.

Author(s)

References

Akhanli, S. and Hennig, C. (2020) Calibrating and aggregating cluster validity indexes for context-adapted comparison of clusterings. Statistics and Computing, 30, 1523-1544, https://link.springer.com/article/10.1007/s11222-020-09958-2, https://arxiv.org/abs/2002.01822

Halkidi, M., Batistakis, Y., Vazirgiannis, M. (2001) On Clustering Validation Techniques, Journal of Intelligent Information Systems, 17, 107-145.

Hennig, C. (2019) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Data Analysis and Applications 1: Clustering and Regression, Modeling-estimating, Forecasting and Data Mining, Volume 2, Wiley, New York 1-24, https://arxiv.org/abs/1703.09282

Kaufman, L. and Rousseeuw, P.J. (1990). "Finding Groups in Data: An Introduction to Cluster Analysis". Wiley, New York.

Meila, M. (2007) Comparing clusterings?an information based distance, Journal of Multivariate Analysis, 98, 873-895.

See Also

Examples

set.seed(20000)
  options(digits=3)
  face <- rFace(20,dMoNo=2,dNoEy=0,p=2)
  dface <- dist(face)
  complete3 <- cutree(hclust(dface),3)
  clustatsum(dface,complete3)

fpc

Flexible Procedures for Clustering

v2.2-9
GPL
Authors
Christian Hennig <christian.hennig@unibo.it>
Initial release
2020-12-06

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.