Turn a categorical variable into a set of binary indicators
Given a categorical variable, this function creates a set of indicator variables for the various possible sets of levels.
binarizeCategoricalVariable( x, levelOrder = NULL, ignore = NULL, minCount = 3, val1 = 0, val2 = 1, includePairwise = TRUE, includeLevelVsAll = FALSE, dropFirstLevelVsAll = FALSE, dropUninformative = TRUE, namePrefix = "", levelSep = NULL, nameForAll = "all", levelSep.pairwise = if (length(levelSep)==0) ".vs." else levelSep, levelSep.vsAll = if (length(levelSep)==0) (if (nameForAll=="") "" else ".vs.") else levelSep, checkNames = FALSE, includeLevelInformation = TRUE)
x |
A vector with categorical values. |
levelOrder |
Optional specification of the levels (unique values) of |
ignore |
Optional specification of levels of |
minCount |
Levels of |
val1 |
Value for the lower level in binary comparisons. |
val2 |
Value for the higher level in binary comparisons. |
includePairwise |
Logical: should pairwise binary indicators be included? For each pair of levels, the indicator is |
includeLevelVsAll |
Logical: should binary indicators for each level be included? The indicator is |
dropFirstLevelVsAll |
Logical: should the column representing first level vs. all be dropped? This makes the resulting matrix of indicators usable for regression models. |
dropUninformative |
Logical: should uninformative (constant) columns be dropped? |
namePrefix |
Prefix to be used in column names of the output. |
nameForAll |
When naming columns that represent a level vs. all others, |
levelSep |
Separator for levels to be used in column names of the output. If |
levelSep.pairwise |
Separator for levels to be used in column names for pairwise indicators in the output. |
levelSep.vsAll |
Separator for levels to be used in column names for level vs. all indicators in the output. |
checkNames |
Logical: should the names of the output be made into syntactically correct R language names? |
includeLevelInformation |
Logical: should information about which levels are represented by which columns be included in the attributes of the output? |
The function creates two types of indicators. The first is one level (unique value) of x
vs. all
others, i.e., for a given level, the indicator is val2
(usually 1) for all elements of x
that
equal the level, and val1
(usually 0)
otherwise. Column names for these indicators are the concatenation of namePrefix
, the level,
nameSep
and nameForAll
. The level vs. all indicators are created for all levels that have at
least minCounts
samples, are present in levelOrder
(if it is non-NULL) and are not included in
ignore
.
The second type of indicator encodes binary comparisons. For each pair of levels (both with at least
minCount
samples), the indicator is val2
(usually 1) for the higher level and val1
(usually 0) for the lower level. The level order is given by levelOrder
(which defaults to the sorted
levels of x
), assumed to be sorted in increasing order. All levels with at least minCount
samples that are included in levelOrder
and not included in ignore
are included.
A matrix containing the indicators variabels, one in each column. When includeLevelInformation
is
TRUE
, the attribute includedLevels
is a table with one column per output column and two rows,
giving the two levels (unique values of x) represented by the column.
Peter Langfelder
Variations and wrappers for this function:
binarizeCategoricalColumns
for binarizing several columns of a matrix or data frame
set.seed(2); x = sample(c("A", "B", "C"), 15, replace = TRUE); out = binarizeCategoricalVariable(x, includePairwise = TRUE, includeLevelVsAll = TRUE); data.frame(x, out); attr(out, "includedLevels") # A different naming for level vs. all columns binarizeCategoricalVariable(x, includeLevelVsAll = TRUE, nameForAll = "");
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.