Imputation of left-censored missing data using stochastic minimal value approach.
Performs the imputation of left-censored missing data by random draws from a Gaussian distribution centered in a minimal value. Considering a peptide/protein expression data matrix with n
columns corresponding to biological samples and p
lines corresponding to peptides/proteins, for each sample (column), the mean value of the Gaussian distribution is set to a minimal value observed in that sample. The minimal value observed is estimated as being the q-th
quantile (e.g. q = 0.01
) of the observed values in that sample. The standard deviation is estimated as the median of the peptide/protein-wise standard deviations. Note that when estimating the standard deviation of the Gaussian distribution, only the peptides/proteins which present more than 50%
recorded values are considered.
impute.MinProb(dataSet.mvs, q = 0.01, tune.sigma = 1)
dataSet.mvs |
A data matrix containing left-censored missing data. |
q |
A scalar used to determine a low expression value to be used for missing data imputation. |
tune.sigma |
A scalar used to control the standard deviation of the Gaussian distribution used for random draws. If the sd is overestimated, than |
A complete expression data matrix with missing values imputed.
Cosmin Lazar
# generate expression data matrix exprsDataObj = generate.ExpressionData(nSamples1 = 6, nSamples2 = 6, meanSamples = 0, sdSamples = 0.2, nFeatures = 1000, nFeaturesUp = 50, nFeaturesDown = 50, meanDynRange = 20, sdDynRange = 1, meanDiffAbund = 1, sdDiffAbund = 0.2) exprsData = exprsDataObj[[1]] # insert 15% missing data with 100% missing not at random m.THR = quantile(exprsData, probs = 0.15) sd.THR = 0.1 MNAR.rate = 50 exprsData.MD.obj = insertMVs(exprsData,m.THR,sd.THR,MNAR.rate) exprsData.MD = exprsData.MD.obj[[2]] # perform missing data imputation exprsData.imputed = impute.MinProb(exprsData.MD,0.01,1) ## Not run: hist(exprsData[,1]) hist(exprsData.MD[,1]) hist(exprsData.imputed[,1]) ## End(Not run) ## The function is currently defined as function (dataSet.mvs, q = 0.01, tune.sigma = 1) { nSamples = dim(dataSet.mvs)[2] nFeatures = dim(dataSet.mvs)[1] dataSet.imputed = dataSet.mvs min.samples = apply(dataSet.imputed, 2, quantile, prob = q, na.rm = T) count.NAs = apply(!is.na(dataSet.mvs), 1, sum) count.NAs = count.NAs/nSamples dataSet.filtered = dataSet.mvs[which(count.NAs > 0.5), ] protSD = apply(dataSet.filtered, 1, sd) sd.temp = median(protSD, na.rm = T) * tune.sigma print(sd.temp) for (i in 1:(nSamples)) { dataSet.to.impute.temp = rnorm(nFeatures, mean = min.samples[i], sd = sd.temp) dataSet.imputed[which(is.na(dataSet.mvs[, i])), i] = dataSet.to.impute.temp[which(is.na(dataSet.mvs[,i]))] } return(dataSet.imputed) }
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.