find.cluster: cluster identification using successive K-means
These functions implement the clustering procedure used in Discriminant Analysis
of Principal Components (DAPC, Jombart et al. 2010). This procedure consists in
running successive K-means with an increasing number of clusters (k
),
after transforming data using a principal component analysis (PCA). For each
model, a statistical measure of goodness of fit (by default, BIC) is computed,
which allows to choose the optimal k
. See details
for a
description of how to select the optimal k
and
vignette("adegenet-dapc")
for a tutorial.
Optionally, hierarchical clustering can be sought by providing a prior
clustering of individuals (argument clust
). In such case, clusters will
be sought within each prior group.
find.clusters
is a generic function with methods for the
following types of objects:
## S3 method for class 'data.frame' find.clusters(x, clust = NULL, n.pca = NULL, n.clust = NULL, method = c("kmeans", "ward"), stat = c("BIC","AIC", "WSS"), choose.n.clust = TRUE, criterion = c("diffNgroup", "min","goesup", "smoothNgoesup", "goodfit"), max.n.clust = round(nrow(x)/10), n.iter = 1e5, n.start = 10, center = TRUE, scale = TRUE, pca.select = c("nbEig","percVar"), perc.pca = NULL, ..., dudi = NULL) ## S3 method for class 'matrix' find.clusters(x, ...) ## S3 method for class 'genind' find.clusters(x, clust = NULL, n.pca = NULL, n.clust = NULL, method = c("kmeans", "ward"), stat = c("BIC","AIC", "WSS"), choose.n.clust = TRUE, criterion = c("diffNgroup", "min","goesup", "smoothNgoesup", "goodfit"), max.n.clust = round(nrow(x@tab)/10), n.iter = 1e5, n.start = 10, scale = FALSE, truenames = TRUE, ...) ## S3 method for class 'genlight' find.clusters(x, clust = NULL, n.pca = NULL, n.clust = NULL, method = c("kmeans", "ward"), stat = c("BIC", "AIC", "WSS"), choose.n.clust = TRUE, criterion = c("diffNgroup", "min","goesup","smoothNgoesup", "goodfit"), max.n.clust = round(nInd(x)/10), n.iter = 1e5,n.start = 10, scale = FALSE, pca.select = c("nbEig","percVar"), perc.pca = NULL,glPca=NULL, ...)
x |
|
clust |
an optional |
n.pca |
an |
n.clust |
an optinal |
method |
a |
stat |
a |
choose.n.clust |
a |
criterion |
a |
max.n.clust |
an |
n.iter |
an |
n.start |
an |
center |
a |
scale |
a |
pca.select |
a |
perc.pca |
a |
truenames |
a |
... |
further arguments to be passed to other functions. For
|
dudi |
optionally, a multivariate analysis with the class |
glPca |
an optional |
=== ON THE SELECTION OF K ===
(where K is the 'optimal' number of clusters)
So far, the analysis of data simulated under various population genetics models (see reference) suggested an ad hoc rule for the selection of the optimal number of clusters. First important result is that BIC seems more efficient than AIC and WSS to select the appropriate number of clusters (see example). The rule of thumb consists in increasing K until it no longer leads to an appreciable improvement of fit (i.e., to a decrease of BIC). In the most simple models (island models), BIC decreases until it reaches the optimal K, and then increases. In these cases, our rule amounts to choosing the lowest K. In other models such as stepping stones, the decrease of BIC often continues after the optimal K, but is much less steep.
An alternative approach is the automatic selection based on a fixed
criterion. Note that, in any case, it is highly recommended to look at
the graph of the BIC for different numbers of clusters as displayed
during the interactive cluster selection.
To use automated selection, set choose.n.clust
to FALSE and specify
the criterion
you want to use, from the following values:
- "diffNgroup": differences between successive values of the summary
statistics (by default, BIC) are splitted into two groups using a
Ward's clustering method (see ?hclust
), to differentiate sharp
decrease from mild decreases or increases. The retained K is the one
before the first group switch. Appears to work well for
island/hierarchical models, and decently for isolation by distance
models, albeit with some unstability. Can be impacted by an initial,
very sharp decrease of the test statistics. IF UNSURE ABOUT THE
CRITERION TO USE, USE THIS ONE.
- "min": the model with the minimum summary statistics (as specified
by stat
argument, BIC by default) is retained. Is likely to
work for simple island model, using BIC. It is likely to fail in
models relating to stepping stones, where the BIC always decreases
(albeit by a small amount) as K increases. In general, this approach
tends to over-estimate the number of clusters.
- "goesup": the selected model is the K after which increasing the number of clusters leads to increasing the summary statistics. Suffers from inaccuracy, since i) a steep decrease might follow a small 'bump' of increase of the statistics, and ii) increase might never happen, or happen after negligible decreases. Is likely to work only for clear-cut island models.
- "smoothNgoesup": a variant of "goesup", in which the summary statistics is first smoothed using a lowess approach. Is meant to be more accurate than "goesup" as it is less prone to stopping to small 'bumps' in the decrease of the statistics.
- "goodfit": another criterion seeking a good fit with a minimum number of clusters. This approach does not rely on differences between successive statistics, but on absolute fit. It selects the model with the smallest K so that the overall fit is above a given threshold.
The class find.clusters
is a list with the following
components:
Kstat |
a |
stat |
a |
grp |
a |
size |
an |
Thibaut Jombart t.jombart@imperial.ac.uk
Jombart T, Devillard S and Balloux F (2010) Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genetics 11:94. doi:10.1186/1471-2156-11-94
- dapc
: implements the DAPC.
- scatter.dapc
: graphics for DAPC.
- dapcIllus
: dataset illustrating the DAPC and find.clusters
.
- eHGDP
: dataset illustrating the DAPC and find.clusters
.
- kmeans
: implementation of K-means in the stat package.
- dudi.pca
: implementation of PCA in the ade4 package.
## Not run: ## THIS ONE TAKES A FEW MINUTES TO RUN ## data(eHGDP) ## here, n.clust is specified, so that only on K value is used grp <- find.clusters(eHGDP, max.n=30, n.pca=200, scale=FALSE, n.clust=4) # takes about 2 minutes names(grp) grp$Kstat grp$stat ## to try different values of k (interactive) grp <- find.clusters(eHGDP, max.n=50, n.pca=200, scale=FALSE) ## and then, to plot BIC values: plot(grp$Kstat, type="b", col="blue") ## ANOTHER SIMPLE EXAMPLE ## data(sim2pop) # this actually contains 2 pop ## DETECTION WITH BIC (clear result) foo.BIC <- find.clusters(sim2pop, n.pca=100, choose=FALSE) plot(foo.BIC$Kstat, type="o", xlab="number of clusters (K)", ylab="BIC", col="blue", main="Detection based on BIC") points(2, foo.BIC$Kstat[2], pch="x", cex=3) mtext(3, tex="'X' indicates the actual number of clusters") ## DETECTION WITH AIC (less clear-cut) foo.AIC <- find.clusters(sim2pop, n.pca=100, choose=FALSE, stat="AIC") plot(foo.AIC$Kstat, type="o", xlab="number of clusters (K)", ylab="AIC", col="purple", main="Detection based on AIC") points(2, foo.AIC$Kstat[2], pch="x", cex=3) mtext(3, tex="'X' indicates the actual number of clusters") ## DETECTION WITH WSS (less clear-cut) foo.WSS <- find.clusters(sim2pop, n.pca=100, choose=FALSE, stat="WSS") plot(foo.WSS$Kstat, type="o", xlab="number of clusters (K)", ylab="WSS (residual variance)", col="red", main="Detection based on WSS") points(2, foo.WSS$Kstat[2], pch="x", cex=3) mtext(3, tex="'X' indicates the actual number of clusters") ## TOY EXAMPLE FOR GENLIGHT OBJECTS ## x <- glSim(100,500,500) x plot(x) grp <- find.clusters(x, n.pca = 100, choose = FALSE, stat = "BIC") plot(grp$Kstat, type = "o", xlab = "number of clusters (K)", ylab = "BIC", main = "find.clusters on a genlight object\n(two groups)") ## End(Not run)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.