Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

insertMVs

Generates missing data in a complete data matrix.


Description

This function generates missing data in a complete data matrix. Both random and left-censored missing data can be generated. The percentage of all missing data is controlled by mean.THR. The percentage of missing data which are left-censored is controlled by MNAR.rate.

Usage

insertMVs(original, mean.THR, sd.THR, MNAR.rate)

Arguments

original

Original complete data matrix of peptide/protein expression.

mean.THR

Mean value of the threshold distribution which controls the total missing data rate (mean.THR should be initially set such that the result of the initial thresholding, in terms of no. of NAs, equals the total missing values rate). Example: if one wants to generate 30% missing data, mean.THR can be set as follows: mean.THR = quantile(exprsData, probs = 0.3).

sd.THR

Standard deviation of the threshold distribution which controls the total missing data rate. sd.THR is usually set to a small value (e.g. 0.1).

MNAR.rate

Percentage of MVs which are missing not at random. Among the total number of missing data (NA_rate_TOTAL) generated by the initial threshold, a percentage of missing data that equals MNAR.rate is preserved as such; the remaining missing data are replaced with the original values. Next, a proportion of (NA_rate_TOTAL - MNAR.rate) missing data are generated randomly.

Value

A list including elements:

original

Original complete data matrix

original.mvs

Data matrix derived from the original by generating missing data

pNaNs

The percetage of missing data generated in the original complete dataset

Author(s)

Cosmin Lazar

See Also

Examples

# generate expression data matrix
exprsDataObj = generate.ExpressionData(nSamples1 = 6, nSamples2 = 6,
                          meanSamples = 0, sdSamples = 0.2,
                          nFeatures = 1000, nFeaturesUp = 50, nFeaturesDown = 50,
                          meanDynRange = 20, sdDynRange = 1,
                          meanDiffAbund = 1, sdDiffAbund = 0.2)
exprsData = exprsDataObj[[1]]
  
# insert 15% missing data with 50% missing not at random

m.THR = quantile(exprsData, probs = 0.15)
sd.THR = 0.1
MNAR.rate = 50
exprsData.MD.obj = insertMVs(exprsData,m.THR,sd.THR,MNAR.rate)
exprsData.MD = exprsData.MD.obj[[2]]

## Not run: 
hist(exprsData[,1])
hist(exprsData.MD[,1])
hist(exprsData.imputed[,1])

## End(Not run)

## The function is currently defined as
function (original, mean.THR, sd.THR, MNAR.rate) 
{
    originalNaNs = original
    nProt = nrow(original)
    nSamples = ncol(original)
    thr = matrix(rnorm(nSamples * nProt, mean.THR, sd.THR), nProt, 
        nSamples)
    indices.MNAR = which(original < thr)
    no.MNAR = round(MNAR.rate/100 * length(indices.MNAR))
    temp = matrix(original, 1, nSamples * nProt)
    temp[sample(indices.MNAR, no.MNAR)] = NaN
    indices.MCAR = which(!is.na(temp))
    no.MCAR = floor((100 - MNAR.rate)/100 * length(indices.MNAR))
    print(no.MCAR + no.MNAR)
    temp[sample(indices.MCAR, no.MCAR)] = NaN
    originalNaNs = matrix(temp, nProt, nSamples)
    originalNaNs_adjusted = originalNaNs
    noNaNs_Var = rowSums(is.na(originalNaNs))
    allNaNs_Vars = which(noNaNs_Var == nSamples)
    sampleIndexToReplace = sample(1:nSamples, length(allNaNs_Vars), 
        replace = T)
    for (i in 0:length(sampleIndexToReplace)) {
        originalNaNs_adjusted[allNaNs_Vars[i], sampleIndexToReplace[i]] = original[allNaNs_Vars[i], 
            sampleIndexToReplace[i]]
    }
    pNaNs = length(which(is.na(originalNaNs_adjusted)))/(nSamples * 
        nProt)
    print(pNaNs)
    return(list(original, originalNaNs_adjusted, pNaNs))
  }

imputeLCMD

A collection of methods for left-censored missing data imputation

v2.0
GPL (>= 2)
Authors
Cosmin Lazar
Initial release
2015-01-18

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.