Iterative filtering of samples and genes with too many missing entries across multiple data sets
This function checks data for missing entries and zero variance across multiple data sets and returns a list of samples and genes that pass criteria maximum number of missing values. If weights are given, entries whose relative weight is below a threshold will be considered missing. The filtering is iterated until convergence.
goodSamplesGenesMS( multiExpr, multiWeights = NULL, minFraction = 1/2, minNSamples = ..minNSamples, minNGenes = ..minNGenes, tol = NULL, minRelativeWeight = 0.1, verbose = 2, indent = 0)
multiExpr |
expression data in the multi-set format (see |
multiWeights |
optional observation weights in the same format (and dimensions) as |
minFraction |
minimum fraction of non-missing samples for a gene to be considered good. |
minNSamples |
minimum number of non-missing samples for a gene to be considered good. |
minNGenes |
minimum number of good genes for the data set to be considered fit for analysis. If the actual number of good genes falls below this threshold, an error will be issued. |
tol |
an optional 'small' number to compare the variance against. For each set in |
minRelativeWeight |
observations whose relative weight is below this threshold will be considered missing. Here relative weight is weight divided by the maximum weight in the column (gene). |
verbose |
integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose. |
indent |
indentation for diagnostic messages. Zero means no indentation, each unit adds two spaces. |
This function iteratively identifies samples and genes with too many missing entries, and genes with
zero variance; iterations are necessary
since excluding samples effectively changes criteria on genes and vice versa. The process is
repeated until the lists of good samples and genes are stable. If weights are given, entries whose relative
weight (i.e., weight divided by maximum weight in the column or gene)
is below a threshold will be considered missing.
The constants ..minNSamples
and ..minNGenes
are both set to the value 4.
A list with the foolowing components:
goodSamples |
A list with one component per given set. Each component is a logical vector with
one entry per sample in the corresponding set that is |
goodGenes |
A logical vector with one entry per gene that is |
Peter Langfelder
goodGenes
, goodSamples
, goodSamplesGenes
for cleaning
individual sets separately;
goodSamplesMS
, goodGenesMS
for additional cleaning of multiple data
sets together.
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.