DBSCAN density reachability and connectivity clustering
Generates a density based clustering of arbitrary shape as introduced in Ester et al. (1996).
dbscan(data, eps, MinPts = 5, scale = FALSE, method = c("hybrid", "raw", "dist"), seeds = TRUE, showplot = FALSE, countmode = NULL) ## S3 method for class 'dbscan' print(x, ...) ## S3 method for class 'dbscan' plot(x, data, ...) ## S3 method for class 'dbscan' predict(object, data, newdata = NULL, predict.max=1000, ...)
data |
data matrix, data.frame, dissimilarity matrix or
|
eps |
Reachability distance, see Ester et al. (1996). |
MinPts |
Reachability minimum no. of points, see Ester et al. (1996). |
scale |
scale the data if |
method |
"dist" treats data as distance matrix (relatively fast but memory expensive), "raw" treats data as raw data and avoids calculating a distance matrix (saves memory but may be slow), "hybrid" expects also raw data, but calculates partial distance matrices (very fast with moderate memory requirements). |
seeds |
FALSE to not include the |
showplot |
0 = no plot, 1 = plot per iteration, 2 = plot per subiteration. |
countmode |
NULL or vector of point numbers at which to report progress. |
x |
object of class |
object |
object of class |
newdata |
matrix or data.frame with raw data to predict. |
predict.max |
max. batch size for predictions. |
... |
Further arguments transferred to plot methods. |
Clusters require a minimum no of points (MinPts) within a maximum distance (eps) around one of its members (the seed). Any point within eps around any point which satisfies the seed condition is a cluster member (recursively). Some points may not belong to any clusters (noise).
We have clustered a 100.000 x 2 dataset in 40 minutes on a Pentium M 1600 MHz.
print.dbscan
shows a statistic of the number of points
belonging to the clusters that are seeds and border points.
plot.dbscan
distinguishes between seed and border points by
plot symbol.
predict.dbscan
gives out a vector of predicted clusters for the
points in newdata
.
dbscan
gives out
an object of class 'dbscan' which is a LIST with components
cluster |
integer vector coding cluster membership with noise observations (singletons) coded as 0 |
isseed |
logical vector indicating whether a point is a seed (not border, not noise) |
eps |
parameter eps |
MinPts |
parameter MinPts |
this is a simplified version of the original algorithm (no K-D-trees used), thus we have o(n^2) instead of o(n*log(n))
Jens Oehlschlaegel, based on a draft by Christian Hennig.
Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Institute for Computer Science, University of Munich. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96).
set.seed(665544) n <- 600 x <- cbind(runif(10, 0, 10)+rnorm(n, sd=0.2), runif(10, 0, 10)+rnorm(n, sd=0.2)) par(bg="grey40") ds <- dbscan(x, 0.2) # run with showplot=1 to see how dbscan works. ds plot(ds, x) x2 <- matrix(0,nrow=4,ncol=2) x2[1,] <- c(5,2) x2[2,] <- c(8,3) x2[3,] <- c(4,4) x2[4,] <- c(9,9) predict(ds, x, x2) n <- 600 x <- cbind((1:3)+rnorm(n, sd=0.2), (1:3)+rnorm(n, sd=0.2)) # Not run, but results from my machine are 0.105 - 0.068 - 0.255: # system.time(ds <- dbscan(x, 0.3, countmode=NULL, method="raw"))[3] # system.time(dsb <- dbscan(x, 0.3, countmode=NULL, method="hybrid"))[3] # system.time(dsc <- dbscan(dist(x), 0.3, countmode=NULL, # method="dist"))[3]
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.