WGCNA: goodSamplesGenesMS – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

goodSamplesGenesMS

Iterative filtering of samples and genes with too many missing entries across multiple data sets

Description

This function checks data for missing entries and zero variance across multiple data sets and returns a list of samples and genes that pass criteria maximum number of missing values. If weights are given, entries whose relative weight is below a threshold will be considered missing. The filtering is iterated until convergence.

Usage

goodSamplesGenesMS(
  multiExpr,
  multiWeights = NULL,
  minFraction = 1/2,
  minNSamples = ..minNSamples,
  minNGenes = ..minNGenes,
  tol = NULL,
  minRelativeWeight = 0.1,
  verbose = 2, indent = 0)

Arguments

`multiExpr`	expression data in the multi-set format (see `checkSets`). A vector of lists, one per set. Each set must contain a component `data` that contains the expression data, with rows corresponding to samples and columns to genes or probes.
`multiWeights`	optional observation weights in the same format (and dimensions) as `multiExpr`.
`minFraction`	minimum fraction of non-missing samples for a gene to be considered good.
`minNSamples`	minimum number of non-missing samples for a gene to be considered good.
`minNGenes`	minimum number of good genes for the data set to be considered fit for analysis. If the actual number of good genes falls below this threshold, an error will be issued.
`tol`	an optional 'small' number to compare the variance against. For each set in `multiExpr`, the default value is `1e-10 * max(abs(multiExpr[[set]]$data), na.rm = TRUE)`. The reason of comparing the variance to this number, rather than zero, is that the fast way of computing variance used by this function sometimes causes small numerical overflow errors which make variance of constant vectors slightly non-zero; comparing the variance to `tol` rather than zero prevents the retaining of such genes as 'good genes'.
`minRelativeWeight`	observations whose relative weight is below this threshold will be considered missing. Here relative weight is weight divided by the maximum weight in the column (gene).
`verbose`	integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.
`indent`	indentation for diagnostic messages. Zero means no indentation, each unit adds two spaces.

Details

This function iteratively identifies samples and genes with too many missing entries, and genes with zero variance; iterations are necessary since excluding samples effectively changes criteria on genes and vice versa. The process is repeated until the lists of good samples and genes are stable. If weights are given, entries whose relative weight (i.e., weight divided by maximum weight in the column or gene) is below a threshold will be considered missing. The constants ..minNSamples and ..minNGenes are both set to the value 4.

Value

A list with the foolowing components:

`goodSamples`	A list with one component per given set. Each component is a logical vector with one entry per sample in the corresponding set that is `TRUE` if the sample is considered good and `FALSE` otherwise.
`goodGenes`	A logical vector with one entry per gene that is `TRUE` if the gene is considered good and `FALSE` otherwise.

Author(s)

Peter Langfelder