Clusterwise cluster stability assessment by resampling
Assessment of the clusterwise stability of a clustering of data, which can be cases*variables or dissimilarity data. The data is resampled using several schemes (bootstrap, subsetting, jittering, replacement of points by noise) and the Jaccard similarities of the original clusters to the most similar clusters in the resampled data are computed. The mean over these similarities is used as an index of the stability of a cluster (other statistics can be computed as well). The methods are described in Hennig (2007).
clusterboot
is an integrated function that computes the
clustering as well, using interface functions for various
clustering methods implemented in R (several interface functions are
provided, but you can
implement further ones for your favourite clustering method). See the
documentation of the input parameter clustermethod
below.
Quite general clustering methods are possible, i.e. methods estimating or fixing the number of clusters, methods producing overlapping clusters or not assigning all cases to clusters (but declaring them as "noise"). Fuzzy clusterings cannot be processed and have to be transformed to crisp clusterings by the interface function.
clusterboot(data,B=100, distances=(inherits(data, "dist")), bootmethod="boot", bscompare=TRUE, multipleboot=FALSE, jittertuning=0.05, noisetuning=c(0.05,4), subtuning=floor(nrow(data)/2), clustermethod,noisemethod=FALSE,count=TRUE, showplots=FALSE,dissolution=0.5, recover=0.75,seed=NULL,datatomatrix=TRUE,...) ## S3 method for class 'clboot' print(x,statistics=c("mean","dissolution","recovery"),...) ## S3 method for class 'clboot' plot(x,xlim=c(0,1),breaks=seq(0,1,by=0.05),...)
data |
by default something that can be coerced into a
(numerical) matrix (data frames with non-numerical data are allowed
when using |
B |
integer. Number of resampling runs for each scheme, see
|
distances |
logical. If |
bootmethod |
vector of strings, defining the methods used for resampling. Possible methods:
Important: only the methods The results in Hennig (2007) indicate that |
bscompare |
logical. If |
multipleboot |
logical. If |
jittertuning |
positive numeric. Tuning for the
|
noisetuning |
A vector of two positive numerics. Tuning for the
|
subtuning |
integer. Size of subsets for |
clustermethod |
an interface function (the function name, not a string containing the name, has to be provided!). This defines the clustering method. See the "Details"-section for a list of available interface functions and guidelines how to write your own ones. |
noisemethod |
logical. If |
count |
logical. If |
showplots |
logical. If |
dissolution |
numeric between 0 and 1. If the Jaccard similarity between the resampling version of the original cluster and the most similar cluster on the resampled data is smaller or equal to this value, the cluster is considered as "dissolved". Numbers of dissolved clusters are recorded. |
recover |
numeric between 0 and 1. If the Jaccard similarity between the resampling version of the original cluster and the most similar cluster on the resampled data is larger than this value, the cluster is considered as "successfully recovered". Numbers of recovered clusters are recorded. |
seed |
integer. Seed for random generator (fed into
|
datatomatrix |
logical. If |
... |
additional parameters for the clustermethods called by
|
x |
object of class |
statistics |
specifies in |
xlim |
transferred to |
breaks |
transferred to |
Here are some guidelines for interpretation. There is some theoretical justification to consider a Jaccard similarity value smaller or equal to 0.5 as an indication of a "dissolved cluster", see Hennig (2008). Generally, a valid, stable cluster should yield a mean Jaccard similarity value of 0.75 or more. Between 0.6 and 0.75, clusters may be considered as indicating patterns in the data, but which points exactly should belong to these clusters is highly doubtful. Below average Jaccard values of 0.6, clusters should not be trusted. "Highly stable" clusters should yield average Jaccard similarities of 0.85 and above. All of this refers to bootstrap; for the other resampling schemes it depends on the tuning constants, though their default values should grant similar interpretations in most cases.
While B=100
is recommended, smaller run numbers could give
quite informative results as well, if computation times become too high.
Note that the stability of a cluster is assessed, but
stability is not the only important validity criterion - clusters
obtained by very inflexible clustering methods may be stable but not
valid, as discussed in Hennig (2007).
See plotcluster
for graphical cluster validation.
Information about interface functions for clustering methods:
The following interface functions are currently
implemented (in the present package; note that almost all of these
functions require the specification of some control parameters, so
if you use one of them, look up their common help page
kmeansCBI
) first:
an interface to the function
kmeans
for k-means clustering. This assumes a
cases*variables matrix as input.
an interface to the function
hclust
for agglomerative hierarchical clustering with
optional noise cluster. This
function produces a partition and assumes a cases*variables
matrix as input.
an interface to the function
hclust
for agglomerative hierarchical clustering. This
function produces a tree (not only a partition; therefore the
number of clusters can be huge!) and assumes a cases*variables
matrix as input.
an interface to the function
hclust
for agglomerative hierarchical clustering with
optional noise cluster. This
function produces a partition and assumes a dissimilarity
matrix as input.
an interface to the function
mclustBIC
for normal mixture model based
clustering. This assumes a cases*variables matrix as
input. Warning: mclustBIC
sometimes has
problems with multiple
points. It is recommended to use this only together with
multipleboot=FALSE
.
an interface to the function
mclustBIC
for normal mixture model based
clustering. This assumes a dissimilarity matrix as input and
generates a data matrix by multidimensional scaling first.
Warning: mclustBIC
sometimes has
problems with multiple
points. It is recommended to use this only together with
multipleboot=FALSE
.
an interface to the functions
pam
and clara
for partitioning around medoids. This can be used with
cases*variables as well as dissimilarity matrices as input.
an interface to the function
pamk
for partitioning around medoids. The number
of cluster is estimated by the average silhouette width.
This can be used with
cases*variables as well as dissimilarity matrices as input.
an interface to the function
tclust
in the tclust library for trimmed Gaussian
clustering. This assumes a cases*variables matrix as input. Note
that this function is not currently provided because the tclust
package is only available in the CRAN archives, but the code is
in the Examples-section of the kmeansCBI
-help page.
an interface to the function
dbscan
for density based
clustering. This can be used with
cases*variables as well as dissimilarity matrices as input..
an interface to the function
fixmahal
for fixed point
clustering. This assumes a cases*variables matrix as input.
an interface to the function
mergenormals
for clustering by merging Gaussian
mixture components.
an interface to the function
specc
for spectral clustering.
You can write your own interface function. The first argument of an
interface function should preferably be a data matrix (of class
"matrix", but it may be a symmetrical dissimilarity matrix). It can
be a data frame, but this restricts some of the functionality of
clusterboot
, see above. Further
arguments can be tuning constants for the clustering method. The
output of an interface function should be a list containing (at
least) the following components:
clustering result, usually a list with the full output of the clustering method (the precise format doesn't matter); whatever you want to use later.
number of clusters. If some points don't belong to any
cluster but are declared as "noise", nc
includes the
noise cluster, and there should be another component
nccl
, being the number of clusters not including the
noise cluster (note that it is not mandatory to define a noise
component if not all points are assigned to clusters, but if you
do it, the stability of the noise cluster is assessed as
well.)
this is a list consisting of a logical vectors
of length of the number of data points (n
) for each cluster,
indicating whether a point is a member of this cluster
(TRUE
) or not. If a noise cluster is included, it
should always be the last vector in this list.
an integer vector of length n
,
partitioning the data. If the method produces a partition, it
should be the clustering. This component is only used for plots,
so you could do something like rep(1,n)
for
non-partitioning methods. If a noise cluster is included,
nc=nccl+1
and the noise cluster is cluster no. nc
.
a string indicating the clustering method.
clusterboot
returns an object of class "clboot"
, which
is a list with components
result, partition, nc, clustermethod, B, noisemethod, bootmethod,
multipleboot, dissolution, recover, bootresult, bootmean, bootbrd,
bootrecover, jitterresult, jittermean, jitterbrd, jitterrecover,
subsetresult, subsetmean, subsetbrd, subsetrecover, bojitresult,
bojitmean, bojitbrd, bojitrecover, noiseresult, noisemean,
noisebrd, noiserecover
.
result |
clustering result; full output of the selected
|
partition |
partition parameter of the selected |
nc |
number of clusters in original data (including noise
component if |
nccl |
number of clusters in original data (not including noise
component if |
clustermethod, B, noisemethod, bootmethod, multipleboot, dissolution,
recover |
input parameters, see above. |
bootresult |
matrix of Jaccard similarities for
|
bootmean |
clusterwise means of the |
bootbrd |
clusterwise number of times a cluster has been dissolved. |
bootrecover |
clusterwise number of times a cluster has been successfully recovered. |
subsetresult, subsetmean, etc. |
same as |
Hennig, C. (2007) Cluster-wise assessment of cluster stability. Computational Statistics and Data Analysis, 52, 258-271.
Hennig, C. (2008) Dissolution point and isolation robustness: robustness criteria for general cluster analysis methods. Journal of Multivariate Analysis 99, 1154-1176.
dist
,
interface functions:
kmeansCBI
, hclustCBI
,
hclusttreeCBI
, disthclustCBI
,
noisemclustCBI
, distnoisemclustCBI
,
claraCBI
, pamkCBI
,
dbscanCBI
, mahalCBI
options(digits=3) set.seed(20000) face <- rFace(50,dMoNo=2,dNoEy=0,p=2) cf1 <- clusterboot(face,B=3,bootmethod= c("boot","noise","jitter"),clustermethod=kmeansCBI, krange=5,seed=15555) print(cf1) plot(cf1) cf2 <- clusterboot(dist(face),B=3,bootmethod= "subset",clustermethod=disthclustCBI, k=5, cut="number", method="average", showplots=TRUE, seed=15555) print(cf2) d1 <- c("a","b","a","c") d2 <- c("a","a","a","b") dx <- as.data.frame(cbind(d1,d2)) cpx <- clusterboot(dx,k=2,B=10,clustermethod=claraCBI, multipleboot=TRUE,usepam=TRUE,datatomatrix=FALSE) print(cpx)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.