Run and validate many clusterings
This runs the methodology explained in Hennig (2019), Akhanli and
Hennig (2020). It runs a
user-specified set of clustering methods (CBI-functions, see
kmeansCBI
) with several numbers of clusters on a dataset,
and computes many cluster validation indexes. In order to explore the
variation of these indexes, random clusterings on the data are
generated, and validation indexes are standardised by use of the
random clusterings in order to make them comparable and differences
between values interpretable.
The function print.valstat
can be used to provide
weights for the cluster
validation statistics, and will then compute a weighted validation index
that can be used to compare all clusterings.
See the examples for how to get the indexes A1 and A2 from Akhanli and Hennig (2020).
clusterbenchstats(data,G,diss = inherits(data, "dist"), scaling=TRUE, clustermethod, methodnames=clustermethod, distmethod=rep(TRUE,length(clustermethod)), ncinput=rep(TRUE,length(clustermethod)), clustermethodpars, npstats=FALSE, useboot=FALSE, bootclassif=NULL, bootmethod="nselectboot", bootruns=25, trace=TRUE, pamcrit=TRUE,snnk=2, dnnk=2, nnruns=100,kmruns=100,fnruns=100,avenruns=100, multicore=FALSE,cores=detectCores()-1, useallmethods=TRUE, useallg=FALSE,...) ## S3 method for class 'clusterbenchstats' print(x,...)
data |
data matrix or |
G |
vector of integers. Numbers of clusters to consider. |
diss |
logical. If |
scaling |
either a logical or a numeric vector of length equal to
the number of columns of |
clustermethod |
vector of strings specifying names of
CBI-functions (see |
methodnames |
vector of strings with user-chosen names for
clustering methods, one for every method in
|
distmethod |
vector of logicals, of the same length as
|
ncinput |
vector of logicals, of the same length as
|
clustermethodpars |
list of the same length as
|
npstats |
logical. If |
useboot |
logical. If |
bootclassif |
If |
bootmethod |
either |
bootruns |
integer. Number of resampling runs. If
|
trace |
logical. If |
pamcrit |
logical. If |
snnk |
integer. Number of neighbours used in coefficient of
variation of distance to nearest within cluster neighbour, the
|
dnnk |
integer. Number of nearest neighbors to use for
dissimilarity to the uniform in case that |
nnruns |
integer. Number of runs of |
kmruns |
integer. Number of runs of
|
fnruns |
integer. Number of runs of |
avenruns |
integer. Number of runs of |
multicore |
logical. If |
cores |
integer. Number of cores for parallelisation. |
useallmethods |
logical, to be passed on to
|
useallg |
logical to be passed on to
|
... |
further arguments to be passed on to
|
x |
object of class |
The output of clusterbenchstats
is a
big list of lists comprising lists cm, stat, sim, qstat,
sstat
cm |
output object of |
.
stat |
object of class |
sim |
output object of |
qstat |
object of class |
sstat |
object of class |
This may require a lot of computing time and also memory for datasets that are not small, as most indexes require computation and storage of distances.
Hennig, C. (2019) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Data Analysis and Applications 1: Clustering and Regression, Modeling-estimating, Forecasting and Data Mining, Volume 2, Wiley, New York 1-24, https://arxiv.org/abs/1703.09282
Akhanli, S. and Hennig, C. (2020) Calibrating and aggregating cluster validity indexes for context-adapted comparison of clusterings. Statistics and Computing, 30, 1523-1544, https://link.springer.com/article/10.1007/s11222-020-09958-2, https://arxiv.org/abs/2002.01822
set.seed(20000) options(digits=3) face <- rFace(10,dMoNo=2,dNoEy=0,p=2) clustermethod=c("kmeansCBI","hclustCBI") # A clustering method can be used more than once, with different # parameters clustermethodpars <- list() clustermethodpars[[2]] <- list() clustermethodpars[[2]]$method <- "average" # Last element of clustermethodpars needs to have an entry! methodname <- c("kmeans","average") cbs <- clusterbenchstats(face,G=2:3,clustermethod=clustermethod, methodname=methodname,distmethod=rep(FALSE,2), clustermethodpars=clustermethodpars,nnruns=1,kmruns=1,fnruns=1,avenruns=1) print(cbs) print(cbs$qstat,aggregate=TRUE,weights=c(1,0,0,0,0,1,0,1,0,1,0,1,0,0,1,1)) # The weights are weights for the validation statistics ordered as in # cbs$qstat$statistics for computation of an aggregated index, see # ?print.valstat. # Now using bootstrap stability assessment as in Akhanli and Hennig (2020): bootclassif <- c("centroid","averagedist") cbsboot <- clusterbenchstats(face,G=2:3,clustermethod=clustermethod, methodname=methodname,distmethod=rep(FALSE,2), clustermethodpars=clustermethodpars, useboot=TRUE,bootclassif=bootclassif,bootmethod="nselectboot", bootruns=2,nnruns=1,kmruns=1,fnruns=1,avenruns=1,useallg=TRUE) print(cbsboot) ## Not run: # Index A1 in Akhanli and Hennig (2020) (need these weights choices): print(cbsboot$sstat,aggregate=TRUE,weights=c(1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0)) # Index A2 in Akhanli and Hennig (2020) (need these weights choices): print(cbsboot$sstat,aggregate=TRUE,weights=c(0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0)) ## End(Not run) # Results from nselectboot: plot(cbsboot$stat,cbsboot$sim,statistic="boot")
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.