General Interface for Decision Tree Models
decision_tree()
is a way to generate a specification of a model
before fitting and allows the model to be created using
different packages in R or via Spark. The main arguments for the
model are:
cost_complexity
: The cost/complexity parameter (a.k.a. Cp
)
used by CART models (rpart
only).
tree_depth
: The maximum depth of a tree (rpart
and
spark
only).
min_n
: The minimum number of data points in a node
that are required for the node to be split further.
These arguments are converted to their specific names at the
time that the model is fit. Other options and arguments can be
set using set_engine()
. If left to their defaults
here (NULL
), the values are taken from the underlying model
functions. If parameters need to be modified, update()
can be used
in lieu of recreating the object from scratch.
decision_tree( mode = "unknown", cost_complexity = NULL, tree_depth = NULL, min_n = NULL ) ## S3 method for class 'decision_tree' update( object, parameters = NULL, cost_complexity = NULL, tree_depth = NULL, min_n = NULL, fresh = FALSE, ... )
mode |
A single character string for the type of model. Possible values for this model are "unknown", "regression", or "classification". |
cost_complexity |
A positive number for the the cost/complexity
parameter (a.k.a. |
tree_depth |
An integer for maximum depth of the tree. |
min_n |
An integer for the minimum number of data points in a node that are required for the node to be split further. |
object |
A decision tree model specification. |
parameters |
A 1-row tibble or named list with main
parameters to update. If the individual arguments are used,
these will supersede the values in |
fresh |
A logical for whether the arguments should be modified in-place of or replaced wholesale. |
... |
Not used for |
The model can be created using the fit()
function using the
following engines:
R: "rpart"
(the default) or "C5.0"
(classification only)
Spark: "spark"
Note that, for rpart
models, but cost_complexity
and
tree_depth
can be both be specified but the package will give
precedence to cost_complexity
. Also, tree_depth
values
greater than 30 rpart
will give nonsense results on 32-bit
machines.
Engines may have pre-set default arguments when executing the model fit call. For this type of model, the template of the fit calls are below:
decision_tree() %>% set_engine("rpart") %>% set_mode("regression") %>% translate()
## Decision Tree Model Specification (regression) ## ## Computational engine: rpart ## ## Model fit template: ## rpart::rpart(formula = missing_arg(), data = missing_arg(), weights = missing_arg())
decision_tree() %>% set_engine("rpart") %>% set_mode("classification") %>% translate()
## Decision Tree Model Specification (classification) ## ## Computational engine: rpart ## ## Model fit template: ## rpart::rpart(formula = missing_arg(), data = missing_arg(), weights = missing_arg())
Note that rpart::rpart()
does not require factor
predictors to be converted to indicator variables. fit()
does not
affect the encoding of the predictor values (i.e. factors stay factors)
for this model
decision_tree() %>% set_engine("C5.0") %>% set_mode("classification") %>% translate()
## Decision Tree Model Specification (classification) ## ## Computational engine: C5.0 ## ## Model fit template: ## parsnip::C5.0_train(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## trials = 1)
Note that C50::C5.0()
does not require factor
predictors to be converted to indicator variables. fit()
does not
affect the encoding of the predictor values (i.e. factors stay factors)
for this model
decision_tree() %>% set_engine("spark") %>% set_mode("regression") %>% translate()
## Decision Tree Model Specification (regression) ## ## Computational engine: spark ## ## Model fit template: ## sparklyr::ml_decision_tree_regressor(x = missing_arg(), formula = missing_arg(), ## seed = sample.int(10^5, 1))
decision_tree() %>% set_engine("spark") %>% set_mode("classification") %>% translate()
## Decision Tree Model Specification (classification) ## ## Computational engine: spark ## ## Model fit template: ## sparklyr::ml_decision_tree_classifier(x = missing_arg(), formula = missing_arg(), ## seed = sample.int(10^5, 1))
fit()
does not affect the encoding of the predictor values
(i.e. factors stay factors) for this model
The standardized parameter names in parsnip can be mapped to their original names in each engine that has main parameters. Each engine typically has a different default value (shown in parentheses) for each parameter.
parsnip | rpart | C5.0 | spark |
tree_depth | maxdepth (30) | NA | max_depth (5) |
min_n | minsplit (20) | minCases (2) | min_instances_per_node (1) |
cost_complexity | cp (0.01) | NA | NA |
For models created using the spark engine, there are
several differences to consider. First, only the formula
interface to via fit()
is available; using fit_xy()
will
generate an error. Second, the predictions will always be in a
spark table format. The names will be the same as documented but
without the dots. Third, there is no equivalent to factor
columns in spark tables so class predictions are returned as
character columns. Fourth, to retain the model object for a new
R session (via save()
), the model$fit
element of the parsnip
object should be serialized via ml_save(object$fit)
and
separately saved to disk. In a new session, the object can be
reloaded and reattached to the parsnip
object.
show_engines("decision_tree") decision_tree(mode = "classification", tree_depth = 5) # Parameters can be represented by a placeholder: decision_tree(mode = "regression", cost_complexity = varying()) model <- decision_tree(cost_complexity = 10, min_n = 3) model update(model, cost_complexity = 1) update(model, cost_complexity = 1, fresh = TRUE)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.