Binary and Multiway Splits
A class for representing multiway splits and functions for computing on splits.
partysplit(varid, breaks = NULL, index = NULL, right = TRUE, prob = NULL, info = NULL) kidids_split(split, data, vmatch = 1:length(data), obs = NULL) character_split(split, data = NULL, digits = getOption("digits") - 2) varid_split(split) breaks_split(split) index_split(split) right_split(split) prob_split(split) info_split(split)
varid |
an integer specifying the variable to split in, i.e.,
a column number in |
breaks |
a numeric vector of split points. |
index |
an integer vector containing a contiguous sequence
from one to the number of kid nodes. May contain |
right |
a logical, indicating if the intervals defined by
|
prob |
a numeric vector representing a probability distribution over kid nodes. |
info |
additional information. |
split |
an object of class |
data |
a |
vmatch |
a permutation of the variable numbers in |
obs |
a logical or integer vector indicating a subset of the
observations in |
digits |
minimal number of significant digits. |
A split is basically a function that maps data,
more specifically a partitioning variable,
to a set of integers indicating the kid nodes to send observations to.
Objects of class partysplit
describe such a function and can
be set-up via the partysplit()
constructor.
The variables are available in a list
or data.frame
(here called data
) and varid
specifies the
partitioning variable, i.e., the variable or list element to split in.
The constructor partysplit()
doesn't have access
to the actual data, i.e., doesn't estimate splits.
kidids_split(split, data)
actually partitions the data
data[obs,varid_split(split)]
and assigns an integer (giving the
kid node number) to each observation. If vmatch
is given,
the variable vmatch[varid_split(split)]
is used.
character_split()
returns a character representation
of its split
argument. The remaining functions
defined here are accessor functions for partysplit
objects.
The numeric vector breaks
defines how the range of
the partitioning variable (after coercing to a numeric via
as.numeric
) is divided into intervals
(like in cut
) and may be
NULL
. These intervals are represented by the
numbers one to length(breaks) + 1
.
index
assigns these length(breaks) + 1
intervals to one of at least two kid nodes. Thus, index
is a vector of integers where each element corresponds
to one element in a list kids
containing partynode
objects, see partynode
for details. The vector
index
may contain NA
s, in that case, the corresponding
values of the splitting variable are treated as missings (for
example factor levels that are not present in the learning sample).
Either breaks
or index
must be given.
When breaks
is NULL
, it is assumed that
the partitioning variable itself has storage mode integer
(e.g., is a factor
).
prob
defines a probability distribution over
all kid nodes which is used for random splitting
when a deterministic split isn't possible (due to missing
values, for example).
info
takes arbitrary user-specified information.
The constructor partysplit()
returns an object of class partysplit
:
varid |
an integer specifying the variable to split in, i.e.,
a column number in |
breaks |
a numeric vector of split points, |
index |
an integer vector containing a contiguous sequence from one to the number of kid nodes, |
right |
a logical, indicating if the intervals defined by
|
prob |
a numeric vector representing a probability distribution over kid nodes, |
info |
additional information. |
kidids_split()
returns an integer vector describing
the partition of the observations into kid nodes.
character_split()
gives a character representation of the
split and the remaining functions return the corresponding slots
of partysplit
objects.
Hothorn T, Zeileis A (2015). partykit: A Modular Toolkit for Recursive Partytioning in R. Journal of Machine Learning Research, 16, 3905–3909.
data("iris", package = "datasets") ## binary split in numeric variable `Sepal.Length' sl5 <- partysplit(which(names(iris) == "Sepal.Length"), breaks = 5) character_split(sl5, data = iris) table(kidids_split(sl5, data = iris), iris$Sepal.Length <= 5) ## multiway split in numeric variable `Sepal.Width', ## higher values go to the first kid, smallest values ## to the last kid sw23 <- partysplit(which(names(iris) == "Sepal.Width"), breaks = c(3, 3.5), index = 3:1) character_split(sw23, data = iris) table(kidids_split(sw23, data = iris), cut(iris$Sepal.Width, breaks = c(-Inf, 2, 3, Inf))) ## binary split in factor `Species' sp <- partysplit(which(names(iris) == "Species"), index = c(1L, 1L, 2L)) character_split(sp, data = iris) table(kidids_split(sp, data = iris), iris$Species) ## multiway split in factor `Species' sp <- partysplit(which(names(iris) == "Species"), index = 1:3) character_split(sp, data = iris) table(kidids_split(sp, data = iris), iris$Species) ## multiway split in numeric variable `Sepal.Width' sp <- partysplit(which(names(iris) == "Sepal.Width"), breaks = quantile(iris$Sepal.Width)) character_split(sp, data = iris)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.