parsnip: boost_tree – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

boost_tree

General Interface for Boosted Trees

Description

boost_tree() is a way to generate a specification of a model before fitting and allows the model to be created using different packages in R or via Spark. The main arguments for the model are:

mtry: The number of predictors that will be randomly sampled at each split when creating the tree models.
trees: The number of trees contained in the ensemble.
min_n: The minimum number of data points in a node that is required for the node to be split further.
tree_depth: The maximum depth of the tree (i.e. number of splits).
learn_rate: The rate at which the boosting algorithm adapts from iteration-to-iteration.
loss_reduction: The reduction in the loss function required to split further.
sample_size: The amount of data exposed to the fitting routine.
stop_iter: The number of iterations without improvement before stopping.

These arguments are converted to their specific names at the time that the model is fit. Other options and arguments can be set using the set_engine() function. If left to their defaults here (NULL), the values are taken from the underlying model functions. If parameters need to be modified, update() can be used in lieu of recreating the object from scratch.

Usage

boost_tree(
  mode = "unknown",
  mtry = NULL,
  trees = NULL,
  min_n = NULL,
  tree_depth = NULL,
  learn_rate = NULL,
  loss_reduction = NULL,
  sample_size = NULL,
  stop_iter = NULL
)

## S3 method for class 'boost_tree'
update(
  object,
  parameters = NULL,
  mtry = NULL,
  trees = NULL,
  min_n = NULL,
  tree_depth = NULL,
  learn_rate = NULL,
  loss_reduction = NULL,
  sample_size = NULL,
  stop_iter = NULL,
  fresh = FALSE,
  ...
)

Arguments

`mode`	A single character string for the type of model. Possible values for this model are "unknown", "regression", or "classification".
`mtry`	A number for the number (or proportion) of predictors that will be randomly sampled at each split when creating the tree models (`xgboost` only).
`trees`	An integer for the number of trees contained in the ensemble.
`min_n`	An integer for the minimum number of data points in a node that is required for the node to be split further.
`tree_depth`	An integer for the maximum depth of the tree (i.e. number of splits) (`xgboost` only).
`learn_rate`	A number for the rate at which the boosting algorithm adapts from iteration-to-iteration (`xgboost` only).
`loss_reduction`	A number for the reduction in the loss function required to split further (`xgboost` only).
`sample_size`	A number for the number (or proportion) of data that is exposed to the fitting routine. For `xgboost`, the sampling is done at each iteration while `C5.0` samples once during training.
`stop_iter`	The number of iterations without improvement before stopping (`xgboost` only).
`object`	A boosted tree model specification.
`parameters`	A 1-row tibble or named list with main parameters to update. If the individual arguments are used, these will supersede the values in `parameters`. Also, using engine arguments in this object will result in an error.
`fresh`	A logical for whether the arguments should be modified in-place of or replaced wholesale.
`...`	Not used for `update()`.

Details

The data given to the function are not saved and are only used to determine the mode of the model. For boost_tree(), the possible modes are "regression" and "classification".

The model can be created using the fit() function using the following engines:

R: "xgboost" (the default), "C5.0"
Spark: "spark"

For this model, other packages may add additional engines. Use show_engines() to see the current set of engines.

Value

An updated model specification.

Engine Details

Engines may have pre-set default arguments when executing the model fit call. For this type of model, the template of the fit calls are below:

xgboost

boost_tree() %>% 
  set_engine("xgboost") %>% 
  set_mode("regression") %>% 
  translate()

## Boosted Tree Model Specification (regression)
## 
## Computational engine: xgboost 
## 
## Model fit template:
## parsnip::xgb_train(x = missing_arg(), y = missing_arg(), nthread = 1, 
##     verbose = 0)

boost_tree() %>% 
  set_engine("xgboost") %>% 
  set_mode("classification") %>% 
  translate()

## Boosted Tree Model Specification (classification)
## 
## Computational engine: xgboost 
## 
## Model fit template:
## parsnip::xgb_train(x = missing_arg(), y = missing_arg(), nthread = 1, 
##     verbose = 0)

Note that, for most engines to boost_tree(), the sample_size argument is in terms of the number of training set points. The xgboost package parameterizes this as the proportion of training set samples instead. When using the tune, this occurs automatically.

If you would like to use a custom range when tuning sample_size, the dials::sample_prop() function can be used in that case. For example, using a parameter set:

mod <- 
  boost_tree(sample_size = tune()) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")

# update the parameters using the `dials` function
mod_param <- 
  mod %>% 
  parameters() %>% 
  update(sample_size = sample_prop(c(0.4, 0.9)))

For this engine, tuning over trees is very efficient since the same model object can be used to make predictions over multiple values of trees.

Finally, note that xgboost models require that non-numeric predictors (e.g., factors) must be converted to dummy variables or some other numeric representation. By default, when using fit() with xgboost, a one-hot encoding is used to convert factor predictors to indicator variables.

C5.0

boost_tree() %>% 
  set_engine("C5.0") %>% 
  set_mode("classification") %>% 
  translate()

## Boosted Tree Model Specification (classification)
## 
## Computational engine: C5.0 
## 
## Model fit template:
## parsnip::C5.0_train(x = missing_arg(), y = missing_arg(), weights = missing_arg())

Note that C50::C5.0() does not require factor predictors to be converted to indicator variables. fit() does not affect the encoding of the predictor values (i.e. factors stay factors) for this model.

For this engine, tuning over trees is very efficient since the same model object can be used to make predictions over multiple values of trees.

spark

boost_tree() %>% 
  set_engine("spark") %>% 
  set_mode("regression") %>% 
  translate()

## Boosted Tree Model Specification (regression)
## 
## Computational engine: spark 
## 
## Model fit template:
## sparklyr::ml_gradient_boosted_trees(x = missing_arg(), formula = missing_arg(), 
##     type = "regression", seed = sample.int(10^5, 1))

boost_tree() %>% 
  set_engine("spark") %>% 
  set_mode("classification") %>% 
  translate()

## Boosted Tree Model Specification (classification)
## 
## Computational engine: spark 
## 
## Model fit template:
## sparklyr::ml_gradient_boosted_trees(x = missing_arg(), formula = missing_arg(), 
##     type = "classification", seed = sample.int(10^5, 1))

fit() does not affect the encoding of the predictor values (i.e. factors stay factors) for this model.

Parameter translations

The standardized parameter names in parsnip can be mapped to their original names in each engine that has main parameters. Each engine typically has a different default value (shown in parentheses) for each parameter.

parsnip	xgboost	C5.0	spark
tree_depth	max_depth (6)	NA	max_depth (5)
trees	nrounds (15)	trials (15)	max_iter (20)
learn_rate	eta (0.3)	NA	step_size (0.1)
mtry	colsample_bytree (1)	NA	feature_subset_strategy (see below)
min_n	min_child_weight (1)	minCases (2)	min_instances_per_node (1)
loss_reduction	gamma (0)	NA	min_info_gain (0)
sample_size	subsample (1)	sample (0)	subsampling_rate (1)
stop_iter	early_stop	NA	NA

For spark, the default mtry is the square root of the number of predictors for classification, and one-third of the predictors for regression.

Note

For models created using the spark engine, there are several differences to consider. First, only the formula interface to via fit() is available; using fit_xy() will generate an error. Second, the predictions will always be in a spark table format. The names will be the same as documented but without the dots. Third, there is no equivalent to factor columns in spark tables so class predictions are returned as character columns. Fourth, to retain the model object for a new R session (via save()), the model$fit element of the parsnip object should be serialized via ml_save(object$fit) and separately saved to disk. In a new session, the object can be reloaded and reattached to the parsnip object.

Examples

show_engines("boost_tree")

boost_tree(mode = "classification", trees = 20)
# Parameters can be represented by a placeholder:
boost_tree(mode = "regression", mtry = varying())
model <- boost_tree(mtry = 10, min_n = 3)
model
update(model, mtry = 1)
update(model, mtry = 1, fresh = TRUE)

param_values <- tibble::tibble(mtry = 10, tree_depth = 5)

model %>% update(param_values)
model %>% update(param_values, mtry = 3)

param_values$verbose <- 0
# Fails due to engine argument
# model %>% update(param_values)

parsnip

A Common API to Modeling and Analysis Functions

v0.1.5

GPL-2

Authors

Max Kuhn [aut, cre], Davis Vaughan [aut], RStudio [cph]

Initial release