k-Nearest Neighbour Imputation
k-Nearest Neighbour Imputation based on a variation of the Gower Distance for numerical, categorical, ordered and semi-continous variables.
kNN( data, variable = colnames(data), metric = NULL, k = 5, dist_var = colnames(data), weights = NULL, numFun = median, catFun = maxCat, makeNA = NULL, NAcond = NULL, impNA = TRUE, donorcond = NULL, mixed = vector(), mixed.constant = NULL, trace = FALSE, imp_var = TRUE, imp_suffix = "imp", addRF = FALSE, onlyRF = FALSE, addRandom = FALSE, useImputedDist = TRUE, weightDist = FALSE )
data |
data.frame or matrix |
variable |
variables where missing values should be imputed |
metric |
metric to be used for calculating the distances between |
k |
number of Nearest Neighbours used |
dist_var |
names or variables to be used for distance calculation |
weights |
weights for the variables for distance calculation.
If |
numFun |
function for aggregating the k Nearest Neighbours in the case of a numerical variable |
catFun |
function for aggregating the k Nearest Neighbours in the case of a categorical variable |
makeNA |
list of length equal to the number of variables, with values, that should be converted to NA for each variable |
NAcond |
list of length equal to the number of variables, with a condition for imputing a NA |
impNA |
TRUE/FALSE whether NA should be imputed |
donorcond |
condition for the donors e.g. list(">5"), must be NULL or a list of same length as variable |
mixed |
names of mixed variables |
mixed.constant |
vector with length equal to the number of semi-continuous variables specifying the point of the semi-continuous distribution with non-zero probability |
trace |
TRUE/FALSE if additional information about the imputation process should be printed |
imp_var |
TRUE/FALSE if a TRUE/FALSE variables for each imputed variable should be created show the imputation status |
imp_suffix |
suffix for the TRUE/FALSE variables showing the imputation status |
addRF |
TRUE/FALSE each variable will be modelled using random forest regression ( |
onlyRF |
TRUE/FALSE if TRUE only additional distance variables created from random forest regression will be used as distance variables. |
addRandom |
TRUE/FALSE if an additional random variable should be added for distance calculation |
useImputedDist |
TRUE/FALSE if an imputed value should be used for distance calculation for imputing another variable. Be aware that this results in a dependency on the ordering of the variables. |
weightDist |
TRUE/FALSE if the distances of the k nearest neighbours should be used as weights in the aggregation step |
the imputed data set.
Alexander Kowarik, Statistik Austria
A. Kowarik, M. Templ (2016) Imputation with R package VIM. Journal of Statistical Software, 74(7), 1-16.
Other imputation methods:
hotdeck()
,
irmi()
,
matchImpute()
,
rangerImpute()
,
regressionImp()
data(sleep) kNN(sleep) library(laeken) kNN(sleep, numFun = weightedMean, weightDist=TRUE)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.