Estimate best number of Components for missing value estimation
Perform cross validation to estimate the optimal number of components for missing value estimation. Cross validation is done for the complete subset of a variable.
kEstimate(Matrix, method = "ppca", evalPcs = 1:3, segs = 3, nruncv = 5, em = "q2", allVariables = FALSE, verbose = interactive(), ...)
Matrix |
|
method |
|
evalPcs |
|
segs |
|
nruncv |
|
em |
|
allVariables |
|
verbose |
|
... |
Further arguments to |
The assumption hereby is that variables that are highly correlated in a distinct region (here the non-missing observations) are also correlated in another (here the missing observations). This also implies that the complete subset must be large enough to be representative. For each incomplete variable, the available values are divided into a user defined number of cv-segments. The segments have equal size, but are chosen from a random equal distribution. The non-missing values of the variable are covered completely. PPCA, BPCA, SVDimpute, Nipals PCA, llsImpute an NLPCA may be used for imputation.
The whole cross validation is repeated several times so, depending on the parameters, the calculations can take very long time. As error measure the NRMSEP (see Feten et. al, 2005) or the Q2 distance is used. The NRMSEP basically normalises the RMSD between original data and estimate by the variable-wise variance. The reason for this is that a higher variance will generally lead to a higher estimation error. If the number of samples is small, the variable - wise variance may become an unstable criterion and the Q2 distance should be used instead. Also if variance normalisation was applied previously.
The method proceeds variable - wise, the NRMSEP / Q2 distance is
calculated for each incomplete variable and averaged
afterwards. This allows to easily see for wich set of variables
missing value imputation makes senes and for wich set no
imputation or something like mean-imputation should be used. Use
kEstimateFast
or Q2
if you are not interested in
variable wise CV performance estimates.
Run time may be very high on large data sets. Especially when used with complex methods like BPCA or Nipals PCA. For PPCA, BPCA, Nipals PCA and NLPCA the estimation method is called (v\_miss * segs * nruncv) times as the error for all numbers of principal components can be calculated at once. For LLSimpute and SVDimpute this is not possible, and the method is called (v\_miss * segs * nruncv * length(evalPcs)) times. This should still be fast for LLSimpute because the method allows to choose to only do the estimation for one particular variable. This saves a lot of iterations. Here, v\_miss is the number of variables showing missing values.
As cross validation is done variable-wise, in this function Q2 is defined on single variables, not on the entire data set. This is Q2 is calculated as as sum(x - xe)^2 \ sum(x^2), where x is the currently used variable and xe it's estimate. The values are then averaged over all variables. The NRMSEP is already defined variable-wise. For a single variable it is then sqrt(sum(x - xe)^2 \ (n * var(x))), where x is the variable and xe it's estimate, n is the length of x. The variable wise estimation errors are returned in parameter variableWiseError.
A list with:
bestNPcs |
number of PCs or k for which the minimal average NRMSEP or the maximal Q2 was obtained. |
eError |
an array of of size length(evalPcs). Contains the average error of the cross validation runs for each number of components. |
variableWiseError |
Matrix of size
|
evalPcs |
The evaluated numbers of components or number of neighbours (the same as the evalPcs input parameter). |
variableIx |
Index of the incomplete variables. This can be used to map the variable wise error to the original data. |
Wolfram Stacklies
kEstimateFast, Q2, pca, nni
.
## Load a sample metabolite dataset with 5\% missing values (metaboliteData) data(metaboliteData) # Do cross validation with ppca for component 2:4 esti <- kEstimate(metaboliteData, method = "ppca", evalPcs = 2:4, nruncv=1, em="nrmsep") # Plot the average NRMSEP barplot(drop(esti$eError), xlab = "Components",ylab = "NRMSEP (1 iterations)") # The best result was obtained for this number of PCs: print(esti$bestNPcs) # Now have a look at the variable wise estimation error barplot(drop(esti$variableWiseError[, which(esti$evalPcs == esti$bestNPcs)]), xlab = "Incomplete variable Index", ylab = "NRMSEP")
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.