Hierarchical consensus network construction and module identification
Hierarchical consensus network construction and module identification across multiple data sets.
hierarchicalConsensusModules( multiExpr, multiWeights = NULL, multiExpr.imputed = NULL, # Data checking options checkMissingData = TRUE, # Blocking options blocks = NULL, maxBlockSize = 5000, blockSizePenaltyPower = 5, nPreclusteringCenters = NULL, randomSeed = 12345, # Network construction options. networkOptions, # Save individual TOMs? saveIndividualTOMs = TRUE, individualTOMFileNames = "individualTOM-Set%s-Block%b.RData", keepIndividualTOMs = FALSE, # Consensus calculation options consensusTree = NULL, # Return options saveConsensusTOM = TRUE, consensusTOMFilePattern = "consensusTOM-%a-Block%b.RData", # Keep the consensus? keepConsensusTOM = saveConsensusTOM, # Internal handling of TOMs useDiskCache = NULL, chunkSize = NULL, cacheBase = ".blockConsModsCache", cacheDir = ".", # Alternative consensus TOM input from a previous calculation consensusTOMInfo = NULL, # Basic tree cut options deepSplit = 2, detectCutHeight = 0.995, minModuleSize = 20, checkMinModuleSize = TRUE, # Advanced tree cut opyions maxCoreScatter = NULL, minGap = NULL, maxAbsCoreScatter = NULL, minAbsGap = NULL, minSplitHeight = NULL, minAbsSplitHeight = NULL, useBranchEigennodeDissim = FALSE, minBranchEigennodeDissim = mergeCutHeight, stabilityLabels = NULL, stabilityCriterion = c("Individual fraction", "Common fraction"), minStabilityDissim = NULL, pamStage = TRUE, pamRespectsDendro = TRUE, iteratePruningAndMerging = FALSE, minCoreKME = 0.5, minCoreKMESize = minModuleSize/3, minKMEtoStay = 0.2, # Module eigengene calculation options impute = TRUE, trapErrors = FALSE, excludeGrey = FALSE, # Module merging options calibrateMergingSimilarities = FALSE, mergeCutHeight = 0.15, # General options collectGarbage = TRUE, verbose = 2, indent = 0, ...)
multiExpr |
Expression data in the multi-set format (see |
multiWeights |
optional observation weights in the same format (and dimensions) as |
multiExpr.imputed |
If |
checkMissingData |
Logical: should data be checked for excessive numbers of missing entries in genes and samples, and for genes with zero variance? See details. |
blocks |
Optional specification of blocks in which hierarchical clustering and module detection
should be performed. If given, must be a numeric vector with one entry per gene
of |
maxBlockSize |
Integer giving maximum block size for module detection. Ignored if |
blockSizePenaltyPower |
Number specifying how strongly blocks should be penalized for exceeding the
maximum size. Set to a lrge number or |
nPreclusteringCenters |
Number of centers to be used in the preclustering. Defaults to smaller of
|
randomSeed |
Integer to be used as seed for the random number generator before the function
starts. If a current seed exists, it is saved and restored upon exit. If |
networkOptions |
A single list of class |
saveIndividualTOMs |
Logical: should individual TOMs be saved to disk ( |
individualTOMFileNames |
Character string giving the file names to save individual TOMs into. The
following tags should be used to make the file names unique for each set and block: |
keepIndividualTOMs |
Logical: should individual TOMs be retained after the calculation is finished? |
consensusTree |
A list specifying the consensus calculation. See details. |
saveConsensusTOM |
Logical: should the consensus TOM be saved to disk? |
consensusTOMFilePattern |
Character string giving the file names to save consensus TOMs into. The
following tags should be used to make the file names unique for each set and block: |
keepConsensusTOM |
Logical: should consensus TOM be retained after the calculation ends? Depending on |
useDiskCache |
Logical: should disk cache be used for consensus calculations? The disk cache can be used to store chunks of
calibrated data that are small enough to fit one chunk from each set into memory (blocks may be small enough
to fit one block of one set into memory, but not small enough to fit one block from all sets in a consensus
calculation into memory at the same time). Using disk cache is slower but lessens the memory footprint of
the calculation.
As a general guide, if individual data are split into blocks, we
recommend setting this argument to |
chunkSize |
Integer giving the chunk size. If left |
cacheDir |
Directory in which to save cache files. The files are deleted on normal exit but persist if the function terminates abnormally. |
cacheBase |
Base for the file names of cache files. |
consensusTOMInfo |
If the consensus TOM has been pre-calculated using function |
deepSplit |
Numeric value between 0 and 4. Provides a simplified control over how sensitive
module detection should be to module splitting, with 0 least and 4 most sensitive. See
|
detectCutHeight |
Dendrogram cut height for module detection. See
|
minModuleSize |
Minimum module size for module detection. See
|
checkMinModuleSize |
logical: should sanity checks be performed on |
maxCoreScatter |
maximum scatter of the core for a branch to be a cluster, given as the fraction
of |
minGap |
minimum cluster gap given as the fraction of the difference between |
maxAbsCoreScatter |
maximum scatter of the core for a branch to be a cluster given as absolute
heights. If given, overrides |
minAbsGap |
minimum cluster gap given as absolute height difference. If given, overrides
|
minSplitHeight |
Minimum split height given as the fraction of the difference between
|
minAbsSplitHeight |
Minimum split height given as an absolute height.
Branches merging below this height will automatically be merged. If not given (default), will be determined
from |
useBranchEigennodeDissim |
Logical: should branch eigennode (eigengene) dissimilarity be considered when merging branches in Dynamic Tree Cut? |
minBranchEigennodeDissim |
Minimum consensus branch eigennode (eigengene) dissimilarity for
branches to be considerd separate. The branch eigennode dissimilarity in individual sets
is simly 1-correlation of the
eigennodes; the consensus is defined as quantile with probability |
stabilityLabels |
Optional matrix of cluster labels that are to be used for calculating branch
dissimilarity based on split stability. The number of rows must equal the number of genes in
|
stabilityCriterion |
One of |
minStabilityDissim |
Minimum stability dissimilarity criterion for two branches to be considered
separate. Should be a number between 0 (essentially no dissimilarity required) and 1 (perfect dissimilarity
or distinguishability based on |
pamStage |
logical. If TRUE, the second (PAM-like) stage of module detection will be performed.
See |
pamRespectsDendro |
Logical, only used when |
iteratePruningAndMerging |
Logical: should pruning of low-KME genes and module merging be iterated?
For backward compatibility, the default is |
minCoreKME |
a number between 0 and 1. If a detected module does not have at least
|
minCoreKMESize |
see |
minKMEtoStay |
genes whose eigengene connectivity to their module eigengene is lower than
|
impute |
logical: should imputation be used for module eigengene calculation? See
|
trapErrors |
logical: should errors in calculations be trapped? |
excludeGrey |
logical: should the returned module eigengenes exclude the eigengene of the "module" that contains unassigned genes? |
calibrateMergingSimilarities |
Logical: should module eigengene similarities be calibrataed before calculating the consensus? Although calibration is in principle desirable, the calibration methods currently available assume large data and do not work very well on eigengene similarities. |
mergeCutHeight |
Dendrogram cut height for module merging. |
collectGarbage |
Logical: should garbage be collected after some of the memory-intensive steps? |
verbose |
integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose. |
indent |
indentation for diagnostic messages. Zero means no indentation, each unit adds two spaces. |
... |
Other arguments. Currently ignored. |
This function calculates a consensus network with a flexible, possibly hierarchical consensus specification, identifies (consensus) modules in the network, and calculates their eigengenes. "Blockwise" calculation is available for large data sets for which a full network (TOM or adjacency matrix) would not fit into avilable RAM.
The input can be either several numerical data sets (expression etc) in the argument multiExpr
together with all necessary network construction options, or a pre-calculated network, typically the result
of a call to hierarchicalConsensusTOM
.
Steps in the network construction include the following: (1) optional filtering of variables (genes) and observations (samples) that contain too many missing values or have zero variance; (2) optional pre-clustering to split data into blocks of manageable size; (3) calculation of adjacencies and optionally of TOMs in each individual data set; (4) calculation of consensus network from the individual networks; (5) hierarchical clustering and module identification; (6) trimming of modules by removing genes with low correlation with the eigengene of the module; and (7) merging of modules whose eigengenes are strongly correlated.
Steps 1-4 (up to and including the calculation of consensus network from the individual networks) are
handled by the function hierarchicalConsensusTOM
.
Variables (genes) are clustered using average-linkage hierarchical clustering and modules are identified in the resulting dendrogram by the Dynamic Hybrid tree cut.
Found modules are trimmed of genes whose
consensus module membership kME (that is, correlation with module eigengene)
is less than minKMEtoStay
.
Modules in which
fewer than minCoreKMESize
genes have consensus KME higher than minCoreKME
are disbanded, i.e., their constituent genes are pronounced
unassigned.
After all blocks have been processed, the function checks whether there are genes whose KME in the module
they assigned is lower than KME to another module. If p-values of the higher correlations are smaller
than those of the native module by the factor reassignThresholdPS
(in every set),
the gene is re-assigned to the closer module.
In the last step, modules whose eigengenes are highly correlated are merged. This is achieved by
clustering module eigengenes using the dissimilarity given by one minus their correlation,
cutting the dendrogram at the height mergeCutHeight
and merging all modules on each branch. The
process is iterated until no modules are merged. See mergeCloseModules
for more details on
module merging.
The module trimming and merging process is optionally iterated. Iterations are recommended but are (for now) not the default for backward compatibility.
List with the following components:
labels |
A numeric vector with one component per variable (gene), giving the module label of each variable (gene). Label 0 is reserved for unassigned variables; module labels are sequential and smaller numbers are used for larger modules. |
unmergedLabels |
A numeric vector with one component per variable (gene), giving the unmerged module label of each variable (gene), i.e., module labels before the call to module merging. |
colors |
A character vector with one component per variable (gene),
giving the module colors. The labels are mapped to colors using |
unmergedColors |
A character vector with one component per variable (gene), giving the unmerged module colors. |
multiMEs |
Module eigengenes corresponding to the modules returned in |
dendrograms |
A list with one component for each block of genes. Each component is the hierarchical clustering dendrogram obtained by clustering the consensus gene dissimilarity in the corresponding block. |
consensusTOMInfo |
A list detailing various aspects of the consensus TOM. See
|
blockInfo |
A list with information about blocks as well as the vriables and observations (genes and samples) retained after filtering out those with zero variance and too many missing values. |
moduleIdentificationArguments |
A list with the module identification arguments supplied to this
function. Contains
|
If the input datasets have large numbers of genes, consider carefully the maxBlockSize
as it
significantly affects the memory footprint (and whether the function will fail with a memory allocation
error). From a theoretical point of view it is advantageous to use blocks as large as possible; on the
other hand, using smaller blocks is substantially faster and often the only way to work with large
numbers of genes. As a rough guide, when 4GB of memory are available, blocks should be no larger than 8,000
genes; with 8GB one can handle some 13,000 genes; with 16GB around 20,000; and with 32GB around 30,000.
Depending on the operating system and its setup, these numbers may vary substantially.
Peter Langfelder
Non-hierarchical consensus networks are described in Langfelder P, Horvath S (2007), Eigengene networks for studying the relationships between co-expression modules. BMC Systems Biology 2007, 1:54.
More in-depth discussion of selected topics can be found at http://www.peterlangfelder.com/ , and an FAQ at https://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/faq.html .
hierarchicalConsensusTOM
for calculation of hierarchical consensus networks (adjacency and
TOM), and a more detailed description of the calculation;
hclust
and cutreeHybrid
for hierarchical clustering
and the Dynamic Tree Cut branch cutting method;
mergeCloseModules
for module merging;
blockwiseModules
for an analogous analysis on a single data set.
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.