rsample: vfold_cv – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

vfold_cv

V-Fold Cross-Validation

Description

V-fold cross-validation randomly splits the data into V groups of roughly equal size (called "folds"). A resample of the analysis data consisted of V-1 of the folds while the assessment set contains the final fold. In basic V-fold cross-validation (i.e. no repeats), the number of resamples is equal to V.

Usage

vfold_cv(data, v = 10, repeats = 1, strata = NULL, breaks = 4, pool = 0.1, ...)

Arguments

`data`	A data frame.
`v`	The number of partitions of the data set.
`repeats`	The number of times to repeat the V-fold partitioning.
`strata`	A variable that is used to conduct stratified sampling to create the folds. This could be a single character value or a variable name that corresponds to a variable that exists in the data frame.
`breaks`	A single number giving the number of bins desired to stratify a numeric stratification variable.
`pool`	A proportion of data used to determine if a particular group is too small and should be pooled into another group. We do not recommend decreasing this argument below its default of 0.1 because of the dangers of stratifying groups that are too small.
`...`	Not currently used.

Details

The strata argument causes the random sampling to be conducted within the stratification variable. This can help ensure that the number of data points in the analysis data is equivalent to the proportions in the original data set. (Strata below 10% of the total are pooled together by default.) When more than one repeat is requested, the basic V-fold cross-validation is conducted each time. For example, if three repeats are used with v = 10, there are a total of 30 splits which as three groups of 10 that are generated separately.

Value

A tibble with classes vfold_cv, rset, tbl_df, tbl, and data.frame. The results include a column for the data split objects and one or more identification variables. For a single repeat, there will be one column called id that has a character string with the fold identifier. For repeats, id is the repeat number and an additional column called id2 that contains the fold information (within repeat).

Examples

vfold_cv(mtcars, v = 10)
vfold_cv(mtcars, v = 10, repeats = 2)

library(purrr)
data(wa_churn, package = "modeldata")

set.seed(13)
folds1 <- vfold_cv(wa_churn, v = 5)
map_dbl(folds1$splits,
        function(x) {
          dat <- as.data.frame(x)$churn
          mean(dat == "Yes")
        })

set.seed(13)
folds2 <- vfold_cv(wa_churn, strata = churn, v = 5)
map_dbl(folds2$splits,
        function(x) {
          dat <- as.data.frame(x)$churn
          mean(dat == "Yes")
        })

set.seed(13)
folds3 <- vfold_cv(wa_churn, strata = tenure, breaks = 6, v = 5)
map_dbl(folds3$splits,
        function(x) {
          dat <- as.data.frame(x)$churn
          mean(dat == "Yes")
        })

rsample

General Resampling Infrastructure

v0.1.0

MIT + file LICENSE

Authors

Julia Silge [aut, cre] (<https://orcid.org/0000-0002-3671-836X>), Fanny Chow [aut], Max Kuhn [aut], Hadley Wickham [aut], RStudio [cph]

Initial release