Similarity of within-cluster distributions to normal and uniform
Two measures of dissimilarity between the within-cluster distributions of a dataset and normal or uniform distribution. For the normal it's the Kolmogorov dissimilarity between the Mahalanobis distances to the center and a chi-squared distribution. For the uniform it is the Kolmogorov distance between the distance to the kth nearest neighbour and a Gamma distribution (this is based on Byers and Raftery (1998)). The clusterwise values are aggregated by weighting with the cluster sizes.
distrsimilarity(x,clustering,noisecluster = FALSE, distribution=c("normal","uniform"),nnk=2, largeisgood=FALSE,messages=FALSE)
x |
the data matrix; a numerical object which can be coerced to a matrix. |
clustering |
integer vector of class numbers; length must equal
|
noisecluster |
logical. If |
distribution |
vector of |
nnk |
integer. Number of nearest neighbors to use for dissimilarity to the uniform. |
largeisgood |
logical. If |
messages |
logical. If |
List with the following components
kdnorm |
Kolmogorov distance between distribution of within-cluster Mahalanobis distances and appropriate chi-squared distribution, aggregated over clusters (I am grateful to Agustin Mayo-Iscar for the idea). |
kdunif |
Kolmogorov distance between distribution of distances to
|
kdnormc |
vector of cluster-wise Kolmogorov distances between distribution of within-cluster Mahalanobis distances and appropriate chi-squared distribution. |
kdunifc |
vector of cluster-wise Kolmogorov distances between
distribution of distances to |
xmahal |
vector of Mahalanobs distances to the respective cluster center. |
xdknn |
vector of distance to |
It is very hard to capture similarity to a multivariate normal or uniform in a single value, and both used here have their shortcomings. Particularly, the dissimilarity to the uniform can still indicate a good fit if there are holes or it's a uniform distribution concentrated on several not connected sets.
Byers, S. and Raftery, A. E. (1998) Nearest-Neighbor Clutter Removal for Estimating Features in Spatial Point Processes, Journal of the American Statistical Association, 93, 577-584.
Hennig, C. (2017) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Proceedings of ASMDA 2017, 501-520, https://arxiv.org/abs/1703.09282
cqcluster.stats
,cluster.stats
for more cluster validity statistics.
set.seed(20000) options(digits=3) face <- rFace(200,dMoNo=2,dNoEy=0,p=2) km3 <- kmeans(face,3) distrsimilarity(face,km3$cluster)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.