Ensure that data contains required column names
validate - asserts the following:
The column names of data
must contain all original_names
.
check - returns the following:
ok
A logical. Does the check pass?
missing_names
A character vector. The missing column names.
validate_column_names(data, original_names) check_column_names(data, original_names)
data |
A data frame to check. |
original_names |
A character vector. The original column names. |
A special error is thrown if the missing column is named ".outcome"
. This
only happens in the case where mold()
is called using the xy-method, and
a vector y
value is supplied rather than a data frame or matrix. In that
case, y
is coerced to a data frame, and the automatic name ".outcome"
is
added, and this is what is looked for in forge()
. If this happens, and the
user tries to request outcomes using forge(..., outcomes = TRUE)
but
the supplied new_data
does not contain the required ".outcome"
column,
a special error is thrown telling them what to do. See the examples!
validate_column_names()
returns data
invisibly.
check_column_names()
returns a named list of two components,
ok
, and missing_names
.
hardhat provides validation functions at two levels.
check_*()
: check a condition, and return a list. The list
always contains at least one element, ok
, a logical that specifies if the
check passed. Each check also has check specific elements in the returned
list that can be used to construct meaningful error messages.
validate_*()
: check a condition, and error if it does not pass. These
functions call their corresponding check function, and
then provide a default error message. If you, as a developer, want a
different error message, then call the check_*()
function yourself,
and provide your own validation function.
Other validation functions:
validate_no_formula_duplication()
,
validate_outcomes_are_binary()
,
validate_outcomes_are_factors()
,
validate_outcomes_are_numeric()
,
validate_outcomes_are_univariate()
,
validate_prediction_size()
,
validate_predictors_are_numeric()
# --------------------------------------------------------------------------- original_names <- colnames(mtcars) test <- mtcars bad_test <- test[, -c(3, 4)] # All good check_column_names(test, original_names) # Missing 2 columns check_column_names(bad_test, original_names) # Will error try(validate_column_names(bad_test, original_names)) # --------------------------------------------------------------------------- # Special error when `.outcome` is missing train <- iris[1:100,] test <- iris[101:150,] train_x <- subset(train, select = -Species) train_y <- train$Species # Here, y is a vector processed <- mold(train_x, train_y) # So the default column name is `".outcome"` processed$outcomes # It doesn't affect forge() normally forge(test, processed$blueprint) # But if the outcome is requested, and `".outcome"` # is not present in `new_data`, an error is thrown # with very specific instructions try(forge(test, processed$blueprint, outcomes = TRUE)) # To get this to work, just create an .outcome column in new_data test$.outcome <- test$Species forge(test, processed$blueprint, outcomes = TRUE)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.