Sample rows using dplyr
step_sample
creates a specification of a recipe step
that will sample rows using dplyr::sample_n()
or
dplyr::sample_frac()
.
step_sample( recipe, ..., role = NA, trained = FALSE, size = NULL, replace = FALSE, skip = TRUE, id = rand_id("sample") ) ## S3 method for class 'step_sample' tidy(x, ...)
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
Argument ignored; included for consistency with other step
specification functions. For the |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
size |
An integer or fraction. If the value is within (0, 1),
|
replace |
Sample with or without replacement? |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
x |
A |
An updated version of recipe
with the new step
added to the sequence of existing steps (if any). For the
tidy
method, a tibble with columns size
, replace
,
and id
.
This step can entirely remove observations (rows of data), which can have
unintended and/or problematic consequences when applying the step to new
data later via bake.recipe()
. Consider whether skip = TRUE
or
skip = FALSE
is more appropriate in any given use case. In most instances
that affect the rows of the data being predicted, this step probably should
not be applied at all; instead, execute operations like this outside and
before starting a preprocessing recipe()
.
# Uses `sample_n` recipe( ~ ., data = mtcars) %>% step_sample(size = 1) %>% prep(training = mtcars) %>% bake(new_data = NULL) %>% nrow() # Uses `sample_frac` recipe( ~ ., data = mtcars) %>% step_sample(size = 0.9999) %>% prep(training = mtcars) %>% bake(new_data = NULL) %>% nrow() # Uses `sample_n` and returns _at maximum_ 20 samples. smaller_cars <- recipe( ~ ., data = mtcars) %>% step_sample() %>% prep(training = mtcars %>% slice(1:20)) bake(smaller_cars, new_data = NULL) %>% nrow() bake(smaller_cars, new_data = mtcars %>% slice(21:32)) %>% nrow()
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.