recipes: step_sample – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

step_sample

Sample rows using dplyr

Description

step_sample creates a specification of a recipe step that will sample rows using dplyr::sample_n() or dplyr::sample_frac().

Usage

step_sample(
  recipe,
  ...,
  role = NA,
  trained = FALSE,
  size = NULL,
  replace = FALSE,
  skip = TRUE,
  id = rand_id("sample")
)

## S3 method for class 'step_sample'
tidy(x, ...)

Arguments

`recipe`	A recipe object. The step will be added to the sequence of operations for this recipe.
`...`	Argument ignored; included for consistency with other step specification functions. For the `tidy` method, these are not currently used.
`role`	Not used by this step since no new variables are created.
`trained`	A logical to indicate if the quantities for preprocessing have been estimated.
`size`	An integer or fraction. If the value is within (0, 1), `dplyr::sample_frac()` is applied to the data. If an integer value of 1 or greater is used, `dplyr::sample_n()` is applied. The default of `NULL` uses `dplyr::sample_n()` with the size of the training set (or smaller for smaller `new_data`).
`replace`	Sample with or without replacement?
`skip`	A logical. Should the step be skipped when the recipe is baked by `bake.recipe()`? While all operations are baked when `prep.recipe()` is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when using `skip = FALSE`.
`id`	A character string that is unique to this step to identify it.
`x`	A `step_sample` object

Value

An updated version of recipe with the new step added to the sequence of existing steps (if any). For the tidy method, a tibble with columns size, replace, and id.

Row Filtering

This step can entirely remove observations (rows of data), which can have unintended and/or problematic consequences when applying the step to new data later via bake.recipe(). Consider whether skip = TRUE or skip = FALSE is more appropriate in any given use case. In most instances that affect the rows of the data being predicted, this step probably should not be applied at all; instead, execute operations like this outside and before starting a preprocessing recipe().

Examples

# Uses `sample_n`
recipe( ~ ., data = mtcars) %>%
  step_sample(size = 1) %>%
  prep(training = mtcars) %>%
  bake(new_data = NULL) %>%
  nrow()

# Uses `sample_frac`
recipe( ~ ., data = mtcars) %>%
  step_sample(size = 0.9999) %>%
  prep(training = mtcars) %>%
  bake(new_data = NULL) %>%
  nrow()

# Uses `sample_n` and returns _at maximum_ 20 samples.
smaller_cars <-
  recipe( ~ ., data = mtcars) %>%
  step_sample() %>%
  prep(training = mtcars %>% slice(1:20))

bake(smaller_cars, new_data = NULL) %>% nrow()
bake(smaller_cars, new_data = mtcars %>% slice(21:32)) %>% nrow()

recipes

Preprocessing Tools to Create Design Matrices

v0.1.16

MIT + file LICENSE

Authors

Max Kuhn [aut, cre], Hadley Wickham [aut], RStudio [cph]

Initial release