Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

purity

Purity and Entropy of a Clustering


Description

The functions purity and entropy respectively compute the purity and the entropy of a clustering given a priori known classes.

The purity and entropy measure the ability of a clustering method, to recover known classes (e.g. one knows the true class labels of each sample), that are applicable even when the number of cluster is different from the number of known classes. Kim et al. (2007) used these measures to evaluate the performance of their alternate least-squares NMF algorithm.

Usage

purity(x, y, ...)

  entropy(x, y, ...)

  ## S4 method for signature 'NMFfitXn,ANY'
purity(x, y, method = "best",
    ...)

  ## S4 method for signature 'NMFfitXn,ANY'
entropy(x, y, method = "best",
    ...)

Arguments

x

an object that can be interpreted as a factor or can generate such an object, e.g. via a suitable method predict, which gives the cluster membership for each sample.

y

a factor or an object coerced into a factor that gives the true class labels for each sample. It may be missing if x is a contingency table.

...

extra arguments to allow extension, and usually passed to the next method.

method

a character string that specifies how the value is computed. It may be either 'best' or 'mean' to compute the best or mean purity respectively.

Details

Suppose we are given l categories, while the clustering method generates k clusters.

The purity of the clustering with respect to the known categories is given by:

Purity = \frac{1}{n} ∑_{q=1}^k \max_{1 ≤q j ≤q l} n_q^j

,

where:

  • n is the total number of samples;

  • n_q^j is the number of samples in cluster q that belongs to original class j (1 ≤q j ≤q l).

The purity is therefore a real number in [0,1]. The larger the purity, the better the clustering performance.

The entropy of the clustering with respect to the known categories is given by:

- 1/(n log2(l) ) sum_q sum_j n(q,j) log2( n(q,j) / n_q )

,

where:

  • n is the total number of samples;

  • n_q is the total number of samples in cluster q (1 ≤q q ≤q k);

  • n(q,j) is the number of samples in cluster q that belongs to original class j (1 ≤q j ≤q l).

The smaller the entropy, the better the clustering performance.

Value

a single numeric value

the entropy (i.e. a single numeric value)

Methods

entropy

signature(x = "table", y = "missing"): Computes the purity directly from the contingency table x.

This is the workhorse method that is eventually called by all other methods.

entropy

signature(x = "factor", y = "ANY"): Computes the purity on the contingency table of x and y, that is coerced into a factor if necessary.

entropy

signature(x = "ANY", y = "ANY"): Default method that should work for results of clustering algorithms, that have a suitable predict method that returns the cluster membership vector: the purity is computed between x and predict{y}

entropy

signature(x = "NMFfitXn", y = "ANY"): Computes the best or mean entropy across all NMF fits stored in x.

purity

signature(x = "table", y = "missing"): Computes the purity directly from the contingency table x

purity

signature(x = "factor", y = "ANY"): Computes the purity on the contingency table of x and y, that is coerced into a factor if necessary.

purity

signature(x = "ANY", y = "ANY"): Default method that should work for results of clustering algorithms, that have a suitable predict method that returns the cluster membership vector: the purity is computed between x and predict{y}

purity

signature(x = "NMFfitXn", y = "ANY"): Computes the best or mean purity across all NMF fits stored in x.

References

Kim H and Park H (2007). "Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis." _Bioinformatics (Oxford, England)_, *23*(12), pp. 1495-502. ISSN 1460-2059, <URL: http://dx.doi.org/10.1093/bioinformatics/btm134>, <URL: http://www.ncbi.nlm.nih.gov/pubmed/17483501>.

See Also

Other assess: sparseness

Examples

# generate a synthetic dataset with known classes: 50 features, 18 samples (5+5+8)
n <- 50; counts <- c(5, 5, 8);
V <- syntheticNMF(n, counts)
cl <- unlist(mapply(rep, 1:3, counts))

# perform default NMF with rank=2
x2 <- nmf(V, 2)
purity(x2, cl)
entropy(x2, cl)
# perform default NMF with rank=2
x3 <- nmf(V, 3)
purity(x3, cl)
entropy(x3, cl)

NMF

Algorithms and Framework for Nonnegative Matrix Factorization (NMF)

v0.23.0
GPL (>= 2)
Authors
Renaud Gaujoux, Cathal Seoighe
Initial release
2020-07-30

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.