Bootstrap Resampling
Generate R
bootstrap replicates of a statistic applied to data. Both
parametric and nonparametric resampling are possible. For the nonparametric
bootstrap, possible resampling methods are the ordinary bootstrap, the
balanced bootstrap, antithetic resampling, and permutation.
For nonparametric multi-sample problems stratified resampling is used:
this is specified by including a vector of strata in the call to boot.
Importance resampling weights may be specified.
boot(data, statistic, R, sim = "ordinary", stype = c("i", "f", "w"), strata = rep(1,n), L = NULL, m = 0, weights = NULL, ran.gen = function(d, p) d, mle = NULL, simple = FALSE, ..., parallel = c("no", "multicore", "snow"), ncpus = getOption("boot.ncpus", 1L), cl = NULL)
data |
The data as a vector, matrix or data frame. If it is a matrix or data frame then each row is considered as one multivariate observation. |
statistic |
A function which when applied to data returns a vector containing
the statistic(s) of interest. When |
R |
The number of bootstrap replicates. Usually this will be a single
positive integer. For importance resampling, some resamples may use
one set of weights and others use a different set of weights. In
this case |
sim |
A character string indicating the type of simulation required.
Possible values are |
stype |
A character string indicating what the second argument of |
strata |
An integer vector or factor specifying the strata for multi-sample
problems. This may be specified for any simulation, but is ignored
when |
L |
Vector of influence values evaluated at the observations. This is
used only when |
m |
The number of predictions which are to be made at each bootstrap
replicate. This is most useful for (generalized) linear models.
This can only be used when |
weights |
Vector or matrix of importance weights. If a vector then it should
have as many elements as there are observations in |
ran.gen |
This function is used only when |
mle |
The second argument to be passed to |
simple |
logical, only allowed to be |
... |
Other named arguments for |
parallel |
The type of parallel operation to be used (if any). If missing, the
default is taken from the option |
ncpus |
integer: number of processes to be used in parallel operation: typically one would chose this to the number of available CPUs. |
cl |
An optional parallel or snow cluster for use if
|
The statistic to be bootstrapped can be as simple or complicated as
desired as long as its arguments correspond to the dataset and (for
a nonparametric bootstrap) a vector of indices, frequencies or
weights. statistic
is treated as a black box by the
boot
function and is not checked to ensure that these
conditions are met.
The first order balanced bootstrap is described in Davison, Hinkley and Schechtman (1986). The antithetic bootstrap is described by Hall (1989) and is experimental, particularly when used with strata. The other non-parametric simulation types are the ordinary bootstrap (possibly with unequal probabilities), and permutation which returns random permutations of cases. All of these methods work independently within strata if that argument is supplied.
For the parametric bootstrap it is necessary for the user to specify
how the resampling is to be conducted. The best way of
accomplishing this is to specify the function ran.gen
which
will return a simulated data set from the observed data set and a
set of parameter estimates specified in mle
.
The returned value is an object of class "boot"
, containing the
following components:
t0 |
The observed value of |
t |
A matrix with |
R |
The value of |
data |
The |
seed |
The value of |
statistic |
The function |
sim |
Simulation type used. |
stype |
Statistic type as passed to |
call |
The original call to |
strata |
The strata used. This is the vector passed to |
weights |
The importance sampling weights as passed to |
pred.i |
If predictions are required ( |
L |
The influence values used when |
ran.gen |
The random generator function used if |
mle |
The parameter estimates passed to |
There are c
, plot
and print
methods for this class.
When parallel = "multicore"
is used (not available on Windows),
each worker process inherits the environment of the current session,
including the workspace and the loaded namespaces and attached
packages (but not the random number seed: see below).
More work is needed when parallel = "snow"
is used: the worker
processes are newly created R processes, and statistic
needs
to arrange to set up the environment it needs: often a good way to do
that is to make use of lexical scoping since when statistic
is
sent to the worker processes its enclosing environment is also sent.
(E.g. see the example for jack.after.boot
where
ancillary functions are nested inside the statistic
function.)
parallel = "snow"
is primarily intended to be used on
multi-core Windows machine where parallel = "multicore"
is not
available.
For most of the boot
methods the resampling is done in the
master process, but not if simple = TRUE
nor sim =
"parametric"
. In those cases (or where statistic
itself uses
random numbers), more care is needed if the results need to be
reproducible. Resampling is done in the worker processes by
censboot(sim = "wierd")
and by most of the schemes in
tsboot
(the exceptions being sim == "fixed"
and
sim == "geom"
with the default ran.gen
).
Where random-number generation is done in the worker processes, the
default behaviour is that each worker chooses a separate seed,
non-reproducibly. However, with parallel = "multicore"
or
parallel = "snow"
using the default cluster, a second approach
is used if RNGkind("L'Ecuyer-CMRG")
has been selected.
In that approach each worker gets a different subsequence of the RNG
stream based on the seed at the time the worker is spawned and so the
results will be reproducible if ncpus
is unchanged, and for
parallel = "multicore"
if parallel::mc.reset.stream()
is
called: see the examples for mclapply
.
Note that loading the parallel namespace may change the random seed, so for maximum reproducibility this should be done before calling this function.
There are many references explaining the bootstrap and its variations. Among them are :
Booth, J.G., Hall, P. and Wood, A.T.A. (1993) Balanced importance resampling for the bootstrap. Annals of Statistics, 21, 286–298.
Davison, A.C. and Hinkley, D.V. (1997) Bootstrap Methods and Their Application. Cambridge University Press.
Davison, A.C., Hinkley, D.V. and Schechtman, E. (1986) Efficient bootstrap simulation. Biometrika, 73, 555–566.
Efron, B. and Tibshirani, R. (1993) An Introduction to the Bootstrap. Chapman & Hall.
Gleason, J.R. (1988) Algorithms for balanced bootstrap simulations. American Statistician, 42, 263–266.
Hall, P. (1989) Antithetic resampling for the bootstrap. Biometrika, 73, 713–724.
Hinkley, D.V. (1988) Bootstrap methods (with Discussion). Journal of the Royal Statistical Society, B, 50, 312–337, 355–370.
Hinkley, D.V. and Shi, S. (1989) Importance sampling and the nested bootstrap. Biometrika, 76, 435–446.
Johns M.V. (1988) Importance sampling for bootstrap confidence intervals. Journal of the American Statistical Association, 83, 709–714.
Noreen, E.W. (1989) Computer Intensive Methods for Testing Hypotheses. John Wiley & Sons.
# Usual bootstrap of the ratio of means using the city data ratio <- function(d, w) sum(d$x * w)/sum(d$u * w) boot(city, ratio, R = 999, stype = "w") # Stratified resampling for the difference of means. In this # example we will look at the difference of means between the final # two series in the gravity data. diff.means <- function(d, f) { n <- nrow(d) gp1 <- 1:table(as.numeric(d$series))[1] m1 <- sum(d[gp1,1] * f[gp1])/sum(f[gp1]) m2 <- sum(d[-gp1,1] * f[-gp1])/sum(f[-gp1]) ss1 <- sum(d[gp1,1]^2 * f[gp1]) - (m1 * m1 * sum(f[gp1])) ss2 <- sum(d[-gp1,1]^2 * f[-gp1]) - (m2 * m2 * sum(f[-gp1])) c(m1 - m2, (ss1 + ss2)/(sum(f) - 2)) } grav1 <- gravity[as.numeric(gravity[,2]) >= 7,] boot(grav1, diff.means, R = 999, stype = "f", strata = grav1[,2]) # In this example we show the use of boot in a prediction from # regression based on the nuclear data. This example is taken # from Example 6.8 of Davison and Hinkley (1997). Notice also # that two extra arguments to 'statistic' are passed through boot. nuke <- nuclear[, c(1, 2, 5, 7, 8, 10, 11)] nuke.lm <- glm(log(cost) ~ date+log(cap)+ne+ct+log(cum.n)+pt, data = nuke) nuke.diag <- glm.diag(nuke.lm) nuke.res <- nuke.diag$res * nuke.diag$sd nuke.res <- nuke.res - mean(nuke.res) # We set up a new data frame with the data, the standardized # residuals and the fitted values for use in the bootstrap. nuke.data <- data.frame(nuke, resid = nuke.res, fit = fitted(nuke.lm)) # Now we want a prediction of plant number 32 but at date 73.00 new.data <- data.frame(cost = 1, date = 73.00, cap = 886, ne = 0, ct = 0, cum.n = 11, pt = 1) new.fit <- predict(nuke.lm, new.data) nuke.fun <- function(dat, inds, i.pred, fit.pred, x.pred) { lm.b <- glm(fit+resid[inds] ~ date+log(cap)+ne+ct+log(cum.n)+pt, data = dat) pred.b <- predict(lm.b, x.pred) c(coef(lm.b), pred.b - (fit.pred + dat$resid[i.pred])) } nuke.boot <- boot(nuke.data, nuke.fun, R = 999, m = 1, fit.pred = new.fit, x.pred = new.data) # The bootstrap prediction squared error would then be found by mean(nuke.boot$t[, 8]^2) # Basic bootstrap prediction limits would be new.fit - sort(nuke.boot$t[, 8])[c(975, 25)] # Finally a parametric bootstrap. For this example we shall look # at the air-conditioning data. In this example our aim is to test # the hypothesis that the true value of the index is 1 (i.e. that # the data come from an exponential distribution) against the # alternative that the data come from a gamma distribution with # index not equal to 1. air.fun <- function(data) { ybar <- mean(data$hours) para <- c(log(ybar), mean(log(data$hours))) ll <- function(k) { if (k <= 0) 1e200 else lgamma(k)-k*(log(k)-1-para[1]+para[2]) } khat <- nlm(ll, ybar^2/var(data$hours))$estimate c(ybar, khat) } air.rg <- function(data, mle) { # Function to generate random exponential variates. # mle will contain the mean of the original data out <- data out$hours <- rexp(nrow(out), 1/mle) out } air.boot <- boot(aircondit, air.fun, R = 999, sim = "parametric", ran.gen = air.rg, mle = mean(aircondit$hours)) # The bootstrap p-value can then be approximated by sum(abs(air.boot$t[,2]-1) > abs(air.boot$t0[2]-1))/(1+air.boot$R)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.