General Trimming Approach to Robust Cluster Analysis
tclust
searches for k
(or less) clusters with different
covariance structures in a data matrix x
.
Relative cluster scatter can be restricted by a constant value
restr.fact
. For robustifying the estimation, a proportion
alpha
of observations may be trimmed.
In particular, the trimmed k-means method (tkmeans
)is represented by the tclust
method, setting parameters restr = "eigen"
, restr.fact = 1
and equal.weights = TRUE
.
tclust (x, k = 3, alpha = 0.05, nstart = 50, iter.max = 20, restr = c ("eigen", "deter", "sigma"), restr.fact = 12, equal.weights = FALSE, center, scale, store.x = TRUE, drop.empty.clust = TRUE, trace = 0, warnings = 3, zero.tol = 1e-16)
x |
A matrix or data.frame of dimension |
k |
The number of clusters initially searched for. |
alpha |
The proportion of observations to be trimmed. |
nstart |
The number of random initializations to be performed. |
iter.max |
The maximum number of concentration steps to be performed. The concentration steps are stopped, whenever two consecutive steps lead to the same data partition. |
restr |
The type of restriction to be applied on the cluster scatter matrices.
Valid values are |
restr.fact |
The constant |
equal.weights |
A logical value, specifying whether equal cluster weights ( |
center, scale |
A center and scale vector, each of length |
store.x |
A logical value, specifying whether the data matrix |
drop.empty.clust |
Logical value specifying, whether empty clusters shall be omitted in the
resulting object.
(The result structure does not contain center and covariance estimates of
empty clusters anymore.
Cluster names are reassigned such that the first |
trace |
Defines the tracing level, which is set to |
warnings |
The warning level (0: no warnings; 1: warnings on unexpected behavior;
2: warnings if |
zero.tol |
The zero tolerance used. By default set to 1e-16. |
This iterative algorithm initializes k
clusters randomly and performs
"concentration steps" in order to improve the current cluster assignment.
The number of maximum concentration steps to be performed is given by
iter.max
.
For approximately obtaining the global optimum, the system is initialized
nstart
times and concentration steps are performed until convergence
or iter.max
is reached.
When processing more complex data sets higher values of nstart
and
iter.max
have to be specified
(obviously implying extra computation time).
However, if more then half of the iterations would not converge, a warning
message is issued, indicating that nstart
has to be increased.
The parameter restr
defines the cluster's shape restrictions, which
are applied on all clusters during each iteration.
Options "eigen"
/"deter"
restrict the ratio between the maximum
and minimum eigenvalue/determinant of all cluster's covariance structures to
parameter restr.fact
. Setting restr.fact
to 1
, yields
the strongest restriction, forcing all eigenvalues/determinants to be equal
and so the method looks for similarly scattered (respectively spherical)
clusters.
Option "sigma"
is a simpler restriction, which averages the covariance
structures during each iteration (weighted by cluster sizes) in order to get
similar (equal) cluster scatters.
The function returns an S3 object of type tclust
, containing the
following values:
centers |
A matrix of size |
cov |
An array of size |
cluster |
A numerical vector of size |
par |
A list, containing the parameters the algorithm has been called with
( |
k |
The (final) resulting number of clusters.
Some solutions with a smaller number of clusters might be found when using
the option |
obj |
The value of the objective function of the best (returned) solution. |
size |
An integer vector of size k, returning the number of observations contained by each cluster. |
weights |
A numerical vector of length k, containing the weights of each cluster. |
int |
A list of values internally used by function related to |
Agustin Mayo Iscar, Luis Angel Garcia Escudero, Heinrich Fritz
Garcia-Escudero, L.A.; Gordaliza, A.; Matran, C. and Mayo-Iscar, A. (2008), "A General Trimming Approach to Robust Cluster Analysis". Annals of Statistics, Vol.36, 1324-1345. Technical Report available at www.eio.uva.es/inves/grupos/representaciones/trTCLUST.pdf
Fritz, H.; Garcia-Escudero, L.A.; Mayo-Iscar, A. (2012), "tclust: An R Package for a Trimming Approach to Cluster Analysis". Journal of Statistical Software, 47(12), 1-26. URL http://www.jstatsoft.org/v47/i12/
#--- EXAMPLE 1 ------------------------------------------ sig <- diag (2) cen <- rep (1,2) x <- rbind(mvtnorm::rmvnorm(360, cen * 0, sig), mvtnorm::rmvnorm(540, cen * 5, sig * 6 - 2), mvtnorm::rmvnorm(100, cen * 2.5, sig * 50) ) # Two groups and 10% trimming level clus <- tclust (x, k = 2, alpha = 0.1, restr.fact = 8) plot (clus) plot (clus, labels = "observation") plot (clus, labels = "cluster") # Three groups (one of them very scattered) and 0% trimming level clus <- tclust (x, k = 3, alpha=0.0, restr.fact = 100) plot (clus) #--- EXAMPLE 3 ------------------------------------------ data (M5data) x <- M5data[, 1:2] clus.a <- tclust (x, k = 3, alpha = 0.1, restr.fact = 1, restr = "eigen", equal.weights = TRUE, warnings = 1) clus.b <- tclust (x, k = 3, alpha = 0.1, restr.fact = 1, equal.weights = TRUE, warnings = 1) clus.c <- tclust (x, k = 3, alpha = 0.1, restr.fact = 1, restr = "deter", equal.weights = TRUE, iter.max = 100, warnings = 1) clus.d <- tclust (x, k = 3, alpha = 0.1, restr.fact = 50, restr = "eigen", equal.weights = FALSE) pa <- par (mfrow = c (2, 2)) plot (clus.a, main = "(a) tkmeans") plot (clus.b, main = "(b) Gallegos and Ritter") plot (clus.c, main = "(c) Gallegos") plot (clus.d, main = "(d) tclust") par (pa) #--- EXAMPLE 4 ------------------------------------------ data (swissbank) # Two clusters and 8% trimming level clus <- tclust (swissbank, k = 2, alpha = 0.08, restr.fact = 50) # Pairs plot of the clustering solution pairs (swissbank, col = clus$cluster + 1) # Two coordinates plot (swissbank[, 4], swissbank[, 6], col = clus$cluster + 1, xlab = "Distance of the inner frame to lower border", ylab = "Length of the diagonal") plot (clus) # Three clusters and 0% trimming level clus <- tclust (swissbank, k = 3, alpha = 0.0, restr.fact = 110) # Pairs plot of the clustering solution pairs (swissbank, col = clus$cluster + 1) # Two coordinates plot (swissbank[, 4], swissbank[, 6], col = clus$cluster + 1, xlab = "Distance of the inner frame to lower border", ylab = "Length of the diagonal") plot (clus)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.