caret: maxDissim – R documentation

Pricing

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

Get Started for Free

Documentation

caret

maxDissim

Maximum Dissimilarity Sampling

Description

Functions to create a sub-sample by maximizing the dissimilarity between new samples and the existing subset.

Usage

maxDissim(
  a,
  b,
  n = 2,
  obj = minDiss,
  useNames = FALSE,
  randomFrac = 1,
  verbose = FALSE,
  ...
)

minDiss(u)

sumDiss(u)

Arguments

`a`	a matrix or data frame of samples to start
`b`	a matrix or data frame of samples to sample from
`n`	the size of the sub-sample
`obj`	an objective function to measure overall dissimilarity
`useNames`	a logical: should the function return the row names (as opposed ot the row index)
`randomFrac`	a number in (0, 1] that can be used to sub-sample from the remaining candidate values
`verbose`	a logical; should each step be printed?
`...`	optional arguments to pass to dist
`u`	a vector of dissimilarities

Details

Given an initial set of m samples and a larger pool of n samples, this function iteratively adds points to the smaller set by finding with of the n samples is most dissimilar to the initial set. The argument obj measures the overall dissimilarity between the initial set and a candidate point. For example, maximizing the minimum or the sum of the m dissimilarities are two common approaches.

This algorithm tends to select points on the edge of the data mainstream and will reliably select outliers. To select more samples towards the interior of the data set, set randomFrac to be small (see the examples below).

Value

a vector of integers or row names (depending on useNames) corresponding to the rows of b that comprise the sub-sample.

Author(s)

Max Kuhn max.kuhn@pfizer.com

References

Willett, P. (1999), "Dissimilarity-Based Algorithms for Selecting Structurally Diverse Sets of Compounds," Journal of Computational Biology, 6, 447-457.

Examples

example <- function(pct = 1, obj = minDiss, ...)
{
  tmp <- matrix(rnorm(200 * 2), nrow = 200)

  ## start with 15 data points
  start <- sample(1:dim(tmp)[1], 15)
  base <- tmp[start,]
  pool <- tmp[-start,]
  
  ## select 9 for addition
  newSamp <- maxDissim(
                       base, pool, 
                       n = 9, 
                       randomFrac = pct, obj = obj, ...)
  
  allSamp <- c(start, newSamp)
  
  plot(
       tmp[-newSamp,], 
       xlim = extendrange(tmp[,1]), ylim = extendrange(tmp[,2]), 
       col = "darkgrey", 
       xlab = "variable 1", ylab = "variable 2")
  points(base, pch = 16, cex = .7)
  
  for(i in seq(along = newSamp))
    points(
           pool[newSamp[i],1], 
           pool[newSamp[i],2], 
           pch = paste(i), col = "darkred") 
}

par(mfrow=c(2,2))

set.seed(414)
example(1, minDiss)
title("No Random Sampling, Min Score")

set.seed(414)
example(.1, minDiss)
title("10 Pct Random Sampling, Min Score")

set.seed(414)
example(1, sumDiss)
title("No Random Sampling, Sum Score")

set.seed(414)
example(.1, sumDiss)
title("10 Pct Random Sampling, Sum Score")

caret

Classification and Regression Training

v6.0-86

GPL (>= 2)

Authors

Max Kuhn [aut, cre], Jed Wing [ctb], Steve Weston [ctb], Andre Williams [ctb], Chris Keefer [ctb], Allan Engelhardt [ctb], Tony Cooper [ctb], Zachary Mayer [ctb], Brenton Kenkel [ctb], R Core Team [ctb], Michael Benesty [ctb], Reynald Lescarbeau [ctb], Andrew Ziem [ctb], Luca Scrucca [ctb], Yuan Tang [ctb], Can Candan [ctb], Tyler Hunt [ctb]

Initial release

maxDissim

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

caret

We don't support your browser anymore