Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

validation_kproto

Validating k Prototypes Clustering


Description

Calculating the prefered validation index for a k-Prototypes clustering with k clusters or computing the optimal number of clusters based on the choosen index for k-Prototype clustering. Possible validation indices are: cindex, dunn, gamma, gplus, mcclain, ptbiserial, silhouette and tau.

Usage

validation_kproto(
  method = NULL,
  object = NULL,
  data = NULL,
  k = NULL,
  lambda = NULL,
  kp_obj = "optimal",
  ...
)

Arguments

method

character specifying the validation index: cindex, dunn, gamma, gplus, mcclain, ptbiserial, silhouette or tau.

object

Object of class kproto resulting from a call with kproto(..., keep.data=TRUE)

data

Original data; only required if object == NULL and neglected if object != NULL.

k

Vector specifying the search range for optimum number of clusters; if NULL the range will set as 2:sqrt(n). Only required if object == NULL and neglected if object != NULL.

lambda

Factor to trade off between Euclidean distance of numeric variables and simple matching coefficient between categorical variables.

kp_obj

character either "optimal" or "all": Output of the index-optimal clustering (kp_obj == "optimal") or all computed clusterpartitions (kp_obj == "all"); only required if object != NULL.

...

Further arguments passed to kproto, like:

  • nstart: If > 1 repetetive computations of kproto with random initializations are computed.

  • verbose: Logical whether information about the cluster procedure should be given. Caution: If verbose=FALSE, the reduction of the number of clusters is not mentioned.

Details

More information about the implemented validation indices:

  • cindex

    Cindex = \frac{S_w-S_{min}}{S_{max}-S_{min}}


    For S_{min} and S_{max} it is nessesary to calculate the distances between all pairs of points in the entire data set (\frac{n(n-1)}{2}). S_{min} is the sum of the "total number of pairs of objects belonging to the same cluster" smallest distances and S_{max} is the sum of the "total number of pairs of objects belonging to the same cluster" largest distances. S_w is the sum of the within-cluster distances.
    The minimum value of the index is used to indicate the optimal number of clusters.

  • dunn

    Dunn = \frac{\min_{1 ≤q i < j ≤q q} d(C_i, C_j)}{\max_{1 ≤q k ≤q q} diam(C_k)}


    The following applies: The dissimilarity between the two clusters C_i and C_j is defined as d(C_i, C_j)=\min_{x \in C_i, y \in C_j} d(x,y) and the diameter of a cluster is defined as diam(C_k)=\max_{x,y \in C} d(x,y).
    The maximum value of the index is used to indicate the optimal number of clusters.

  • gamma

    Gamma = \frac{s(+)-s(-)}{s(+)+s(-)}


    Comparisons are made between all within-cluster dissimilarities and all between-cluster dissimilarities. s(+) is the number of concordant comparisons and s(-) is the number of discordant comparisons. A comparison is named concordant (resp. discordant) if a within-cluster dissimilarity is strictly less (resp. strictly greater) than a between-cluster dissimilarity.
    The maximum value of the index is used to indicate the optimal number of clusters.

  • gplus

    Gplus = \frac{2 \cdot s(-)}{\frac{n(n-1)}{2} \cdot (\frac{n(n-1)}{2}-1)}


    Comparisons are made between all within-cluster dissimilarities and all between-cluster dissimilarities. s(-) is the number of discordant comparisons and a comparison is named discordant if a within-cluster dissimilarity is strictly greater than a between-cluster dissimilarity.
    The minimum value of the index is used to indicate the optimal number of clusters.

  • mcclain

    McClain = \frac{\bar{S}_w}{\bar{S}_b}


    \bar{S}_w is the sum of within-cluster distances divided by the number of within-cluster distances and \bar{S}_b is the sum of between-cluster distances divided by the number of between-cluster distances.
    The minimum value of the index is used to indicate the optimal number of clusters.

  • ptbiserial

    Ptbiserial = \frac{(\bar{S}_b-\bar{S}_w) \cdot (\frac{N_w \cdot N_b}{N_t^2})^{0.5}}{s_d}


    \bar{S}_w is the sum of within-cluster distances divided by the number of within-cluster distances and \bar{S}_b is the sum of between-cluster distances divided by the number of between-cluster distances.
    N_t is the total number of pairs of objects in the data, N_w is the total number of pairs of objects belonging to the samecluster and N_b is the total number of pairs of objects belonging to different clusters. s_d is the standard deviation of all distances.
    The maximum value of the index is used to indicate the optimal number of clusters.

  • silhouette

    Silhouette = \frac{1}{n} ∑_{i=1}^n \frac{b(i)-a(i)}{max(a(i),b(i))}


    a(i) is the average dissimilarity of the ith object to all other objects of the same/own cluster. b(i)=min(d(i,C)), where d(i,C) is the average dissimilarity of the ith object to all the other clusters except the own/same cluster.
    The maximum value of the index is used to indicate the optimal number of clusters.

  • tau

    Tau = \frac{s(+) - s(-)}{((\frac{N_t(N_t-1)}{2}-t)\frac{N_t(N_t-1)}{2})^{0.5}}


    Comparisons are made between all within-cluster dissimilarities and all between-cluster dissimilarities. s(+) is the number of concordant comparisons and s(-) is the number of discordant comparisons. A comparison is named concordant (resp. discordant) if a within-cluster dissimilarity is strictly less (resp. strictly greater) than a between-cluster dissimilarity.
    N_t is the total number of distances \frac{n(n-1)}{2} and t is the number of comparisons of two pairs of objects where both pairs represent within-cluster comparisons or both pairs are between-cluster comparisons.
    The maximum value of the index is used to indicate the optimal number of clusters.

Value

For computing the optimal number of clusters based on the choosen validation index for k-Prototype clustering the output contains:

k_opt

optimal number of clusters (sampled in case of ambiguity)

index_opt

index value of the index optimal clustering

indices

calculated indices for k=2,...,k_{max}

kp_obj

if(kp_obj == "optimal") the kproto object of the index optimal clustering and if(kp_obj == "all") all kproto which were calculated

For computing the index-value for a given k-Prototype clustering the output contains:

index

calculated index-value

Author(s)

Rabea Aschenbruck

References

Examples

# generate toy data with factors and numerics
n   <- 10
prb <- 0.99
muk <- 2.5 

x1 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb))
x1 <- c(x1, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb)))
x1 <- as.factor(x1)
x2 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb))
x2 <- c(x2, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb)))
x2 <- as.factor(x2)
x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk))
x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk))
x <- data.frame(x1,x2,x3,x4)


# calculate optimal number of cluster, index values and clusterpartition with Silhouette-index
val <- validation_kproto(method = "silhouette", data = x, k = 3:5, nstart = 5)


# apply k-prototypes
kpres <- kproto(x, 4, keep.data = TRUE)

# calculate cindex-value for the given clusterpartition
cindex_value <- validation_kproto(method = "cindex", object = kpres)

clustMixType

k-Prototypes Clustering for Mixed Variable-Type Data

v0.2-11
GPL (>= 2)
Authors
Gero Szepannek [aut, cre], Rabea Aschenbruck [aut]
Initial release
2021-03-09

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.