Hot deck imputation
Hot-deck imputation methods include random and sequential hot deck, k-nearest neighbours imputation and predictive mean matching.
impute_rhd( dat, formula, pool = c("complete", "univariate", "multivariate"), prob, backend = getOption("simputation.hdbackend", default = c("simputation", "VIM")), ... ) impute_shd( dat, formula, pool = c("complete", "univariate", "multivariate"), order = c("locf", "nocb"), backend = getOption("simputation.hdbackend", default = c("simputation", "VIM")), ... ) impute_pmm( dat, formula, predictor = impute_lm, pool = c("complete", "univariate", "multivariate"), ... ) impute_knn( dat, formula, pool = c("complete", "univariate", "multivariate"), k = 5, backend = getOption("simputation.hdbackend", default = c("simputation", "VIM")), ... )
dat |
|
formula |
|
pool |
|
prob |
|
backend |
|
... |
further arguments passed to
|
order |
|
predictor |
|
k |
|
Formulas are of the form
IMPUTED_VARIABLES ~ MODEL_SPECIFICATION [ | GROUPING_VARIABLES ]
The left-hand-side of the formula object lists the variable or variables to be imputed. The interpretation of the independent variables on the right-hand-side depends on the imputation method.
impute_rhd
Variables in MODEL_SPECIFICATION
and/or
GROUPING_VARIABLES
are used to split the data set into groups prior to
imputation. Use ~ 1
to specify that no grouping is to be applied.
impute_shd
Variables in MODEL_SPECIFICATION
are used to
sort the data. When multiple variables are specified, each variable after
the first serves as tie-breaker for the previous one.
impute_knn
The predictors are used to determine Gower's distance
between records (see gower_topn
). This may include the
variables to be imputed..
impute_pmm
Predictive mean matching. The
MODEL_SPECIFICATION
is passed through to the predictor
function.
If grouping variables are specified, the data set is split according to the values of those variables, and model estimation and imputation occur independently for each group.
Grouping using dplyr::group_by
is also supported. If groups are
defined in both the formula and using dplyr::group_by
, the data is
grouped by the union of grouping variables. Any missing value in one of the
grouping variables results in an error.
Random hot deck imputation with impute_rhd
can be applied to
numeric, categorical or mixed data. A missing value is copied from a sampled
record. Optionally samples are taken within a group, or with non-uniform
sampling probabilities. See Andridge and Little (2010) for an overview
of hot deck imputation methods.
Sequential hot deck imputation with impute_rhd
can be applied
to numeric, categorical, or mixed data. The dataset is sorted using the
‘predictor variables’. Missing values or combinations thereof are copied
from the previous record where the value(s) are available in the case
of LOCF and from the next record in the case of NOCF.
Predictive mean matching with impute_pmm
can be applied to
numeric data. Missing values or combinations thereof are first imputed using
a predictive model. Next, these predictions are replaced with observed
(combinations of) values nearest to the prediction. The nearest value is the
observed value with the smallest absolute deviation from the prediction.
K-nearest neighbour imputation with impute_knn
can be applied
to numeric, categorical, or mixed data. For each record containing missing
values, the k most similar completed records are determined based on
Gower's (1977) similarity coefficient. From these records the actual donor is
sampled.
The VIM package has efficient implementations of several popular imputation methods. In particular, its random and sequential hotdeck implementation is faster and more memory-efficient than that of the current package. Moreover, VIM offers more fine-grained control over the imputation process then simputation.
If you have this package installed, it can be used by setting
backend="VIM"
for functions supporting this option. Alternatively, one
can set options(simputation.hdbackend="VIM")
so it becomes the
default.
Simputation will map the simputation call to a function in the VIM package. In particular:
impute_rhd
is mapped to VIM::hotdeck
where imputed
variables are passed to the variable
argument and the union of
predictor and grouping variables are passed to domain_var
.
Extra arguments in ...
are passed to VIM::hotdeck
as well.
Argument pool
is ignored.
impute_shd
is mapped to VIM::hotdeck
where
imputed variables are passed to the variable
argument, predictor
variables to ord_var
and grouping variables to domain_var
.
Extra arguments in ...
are passed to VIM::hotdeck
as well.
Arguments pool
and order
are ignored. In VIM
the donor pool
is determined on a per-variable basis, equivalent to setting pool="univariate"
with the simputation backend. VIM is LOCF-based. Differences between
simputation and VIM
likely occurr when the sorting variables contain missings.
impute_knn
is mapped to VIM::kNN
where imputed variables
are passed to variable
, predictor variables are passed to dist_var
and grouping variables are ignored with a message.
Extra arguments in ...
are passed to VIM::kNN
as well.
Argument pool
is ignored.
Note that simputation adheres stricktly to the Gower's original
definition of the distance measure, while VIM uses a generalized variant
that can take ordered factors into account.
By default, VIM's imputation functions add indicator variables to the
original data to trace what values have been imputed. This is switched off by
default for consistency with the rest of the simputation package, but it may
be turned on again by setting imp_var=TRUE
.
Andridge, R.R. and Little, R.J., 2010. A review of hot deck imputation for survey non-response. International statistical review, 78(1), pp.40-64.
Gower, J.C., 1971. A general coefficient of similarity and some of its properties. Biometrics, pp.857–871.
Other imputation:
impute_cart()
,
impute_lm()
,
impute()
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.