tclust: tclust – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

General Trimming Approach to Robust Cluster Analysis

Description

tclust searches for k (or less) clusters with different covariance structures in a data matrix x. Relative cluster scatter can be restricted by a constant value restr.fact. For robustifying the estimation, a proportion alpha of observations may be trimmed. In particular, the trimmed k-means method (tkmeans)is represented by the tclust method, setting parameters restr = "eigen", restr.fact = 1 and equal.weights = TRUE.

Usage

tclust (x, k = 3, alpha = 0.05, nstart = 50, iter.max = 20, 
        restr = c ("eigen", "deter", "sigma"), restr.fact = 12, 
        equal.weights = FALSE, center, scale, store.x = TRUE, 
        drop.empty.clust = TRUE, trace = 0, warnings = 3, 
        zero.tol = 1e-16)

Arguments

`x`	A matrix or data.frame of dimension `n` x `p`, containing the observations (row-wise).
`k`	The number of clusters initially searched for.
`alpha`	The proportion of observations to be trimmed.

`nstart`	The number of random initializations to be performed.
`iter.max`	The maximum number of concentration steps to be performed. The concentration steps are stopped, whenever two consecutive steps lead to the same data partition.
`restr`	The type of restriction to be applied on the cluster scatter matrices. Valid values are `"eigen"` (default), `"deter"` and `"sigma"`. See the detail section for further explanation.
`restr.fact`	The constant `restr.fact >= 1` constrains the allowed differences among group scatters. Larger values imply larger differences of group scatters, a value of 1 specifies the strongest restriction. When using `restr = "sigma"` this parameter is not considered, as all cluster variances are averaged, always implying `restr.fact = 1`.
`equal.weights`	A logical value, specifying whether equal cluster weights (`TRUE`) or not (`FALSE`) shall be considered in the concentration and assignment steps.
`center, scale`	A center and scale vector, each of length `p` which can optionally be specified for centering and scaling `x` before calculation
`store.x`	A logical value, specifying whether the data matrix `x` shall be included in the result structure. By default this value is set to `TRUE`, because functions `plot.tclust` and `DiscrFact` depend on this information. However, when big data matrices are handled, the result structure's size can be decreased noticeably when setting this parameter to `FALSE`.
`drop.empty.clust`	Logical value specifying, whether empty clusters shall be omitted in the resulting object. (The result structure does not contain center and covariance estimates of empty clusters anymore. Cluster names are reassigned such that the first `l` clusters (`l <= k`) always have at least one observation.

trace

Defines the tracing level, which is set to 0 by default. Tracing level 2 gives additional information on the iteratively decreasing objective function's value.

`warnings`	The warning level (0: no warnings; 1: warnings on unexpected behavior; 2: warnings if `restr.fact` causes artificially restricted results).
`zero.tol`	The zero tolerance used. By default set to 1e-16.

Details

This iterative algorithm initializes k clusters randomly and performs "concentration steps" in order to improve the current cluster assignment. The number of maximum concentration steps to be performed is given by iter.max. For approximately obtaining the global optimum, the system is initialized nstart times and concentration steps are performed until convergence or iter.max is reached. When processing more complex data sets higher values of nstart and iter.max have to be specified (obviously implying extra computation time). However, if more then half of the iterations would not converge, a warning message is issued, indicating that nstart has to be increased.

The parameter restr defines the cluster's shape restrictions, which are applied on all clusters during each iteration. Options "eigen"/"deter" restrict the ratio between the maximum and minimum eigenvalue/determinant of all cluster's covariance structures to parameter restr.fact. Setting restr.fact to 1, yields the strongest restriction, forcing all eigenvalues/determinants to be equal and so the method looks for similarly scattered (respectively spherical) clusters. Option "sigma" is a simpler restriction, which averages the covariance structures during each iteration (weighted by cluster sizes) in order to get similar (equal) cluster scatters.

Value

The function returns an S3 object of type tclust, containing the following values:

`centers`	A matrix of size `p` x `k` containing the centers (column-wise) of each cluster.
`cov`	An array of size `p` x `p` x `k` containing the covariance matrices of each cluster.
`cluster`	A numerical vector of size `n` containing the cluster assignment for each observation. Cluster names are integer numbers from `1` to `k`, `0` indicates trimmed observations.
`par`	A list, containing the parameters the algorithm has been called with (`x`, if not suppressed by `store.x = FALSE`, `k`, `alpha`, `restr.fact`, `nstart`, `KStep`, and `equal.weights`).
`k`	The (final) resulting number of clusters. Some solutions with a smaller number of clusters might be found when using the option `equal.weights = FALSE`.
`obj`	The value of the objective function of the best (returned) solution.
`size`	An integer vector of size k, returning the number of observations contained by each cluster.
`weights`	A numerical vector of length k, containing the weights of each cluster.
`int`	A list of values internally used by function related to `tclust` objects.

Author(s)

Agustin Mayo Iscar, Luis Angel Garcia Escudero, Heinrich Fritz

References

Garcia-Escudero, L.A.; Gordaliza, A.; Matran, C. and Mayo-Iscar, A. (2008), "A General Trimming Approach to Robust Cluster Analysis". Annals of Statistics, Vol.36, 1324-1345. Technical Report available at www.eio.uva.es/inves/grupos/representaciones/trTCLUST.pdf
Fritz, H.; Garcia-Escudero, L.A.; Mayo-Iscar, A. (2012), "tclust: An R Package for a Trimming Approach to Cluster Analysis". Journal of Statistical Software, 47(12), 1-26. URL http://www.jstatsoft.org/v47/i12/

Examples

#--- EXAMPLE 1 ------------------------------------------
sig <- diag (2)
cen <- rep (1,2)
x <- rbind(mvtnorm::rmvnorm(360, cen * 0,   sig),
           mvtnorm::rmvnorm(540, cen * 5,   sig * 6 - 2),
           mvtnorm::rmvnorm(100, cen * 2.5, sig * 50)
           )

# Two groups and 10% trimming level
clus <- tclust (x, k = 2, alpha = 0.1, restr.fact = 8)

plot (clus)
plot (clus, labels = "observation")
plot (clus, labels = "cluster")

# Three groups (one of them very scattered) and 0% trimming level
clus <- tclust (x, k = 3, alpha=0.0, restr.fact = 100)

plot (clus)






#--- EXAMPLE 3 ------------------------------------------
data (M5data)
x <- M5data[, 1:2]

clus.a <- tclust (x, k = 3, alpha = 0.1, restr.fact =  1,
                  restr = "eigen", equal.weights = TRUE, warnings = 1)
clus.b <- tclust (x, k = 3, alpha = 0.1, restr.fact =  1,
                   equal.weights = TRUE, warnings = 1)
clus.c <- tclust (x, k = 3, alpha = 0.1, restr.fact =  1,
                  restr = "deter", equal.weights = TRUE, iter.max = 100,
		  warnings = 1)
clus.d <- tclust (x, k = 3, alpha = 0.1, restr.fact = 50,
                  restr = "eigen", equal.weights = FALSE)

pa <- par (mfrow = c (2, 2))
plot (clus.a, main = "(a) tkmeans")
plot (clus.b, main = "(b) Gallegos and Ritter")
plot (clus.c, main = "(c) Gallegos")
plot (clus.d, main = "(d) tclust")
par (pa)

#--- EXAMPLE 4 ------------------------------------------
data (swissbank)
# Two clusters and 8% trimming level
clus <- tclust (swissbank, k = 2, alpha = 0.08, restr.fact = 50)

                            # Pairs plot of the clustering solution
pairs (swissbank, col = clus$cluster + 1)
                                  # Two coordinates
plot (swissbank[, 4], swissbank[, 6], col = clus$cluster + 1,
     xlab = "Distance of the inner frame to lower border",
     ylab = "Length of the diagonal")
plot (clus)

# Three clusters and 0% trimming level
clus <- tclust (swissbank, k = 3, alpha = 0.0, restr.fact = 110)

                            # Pairs plot of the clustering solution
pairs (swissbank, col = clus$cluster + 1)

                                   # Two coordinates
plot (swissbank[, 4], swissbank[, 6], col = clus$cluster + 1, 
      xlab = "Distance of the inner frame to lower border", 
      ylab = "Length of the diagonal")

plot (clus)

tclust

Robust Trimmed Clustering

v1.4-2

GPL-3

Authors

Agustin Mayo Iscar, Luis Angel Garcia Escudero, Heinrich Fritz

Initial release

2020-09-28

tclust

Description

Usage

Arguments

Details

Value

Author(s)

References

Examples

tclust

We don't support your browser anymore