Function to estimate a test statistics joint null distribution for t-statistics via the vector influence curve
For a broad class of testing problems, such as the test of single-parameter null hypotheses using t-statistics, a proper, asymptotically valid test statistics joint null distribution is the multivariate Gaussian distribution with mean vector zero and covariance matrix equal to the correlation matrix of the vector influence curve for the estimator of the parameter of interest. The function corr.null
estimates the correlation matrix of the vector influence curve for such parameters and returns samples from the corresponding normal distribution. Arguments to the function allow for refinements in calculating the resulting null distribution estimate.
corr.null(X, W = NULL, Y = NULL, Z = NULL, test = "t.twosamp.unequalvar", alternative = "two-sided", use = "pairwise", B = 1000, MVN.method = "mvrnorm", penalty = 1e-06, ic.quant.trans = FALSE, marg.null = NULL, marg.par = NULL, perm.mat = NULL)
X |
A matrix, data.frame or ExpressionSet containing the raw data. In the case of an ExpressionSet, |
W |
A matrix containing non-negative weights to be used in computing the test statistics. Must be same dimension as |
Y |
A vector, factor, or |
Z |
A vector, factor, or matrix containing covariate data to be used in linear regression models. Each variable should be in one column, so that |
test |
Character string specifying the test statistics to use, by default 't.twosamp.unequalvar'. See details (below) for a list of tests. |
alternative |
Character string indicating the alternative hypotheses, by default 'two.sided'. For one-sided tests, use 'less' or 'greater' for null hypotheses of 'greater than or equal' (i.e. alternative is 'less') and 'less than or equal', respectively. |
use |
Similar to the options in |
B |
The number of samples to be drawn from the normal distribution. Default is 1000. |
MVN.method |
Character string of either of 'mvrnorm' or 'Cholesky' designating how correlated normal test statistics are to be generated. Selecting 'mvrnorm' uses the function of the same name found in the |
penalty |
If |
ic.quant.trans |
A logical indicating whether or not a marginal quantile transformation using a t-distribution or user-supplied marginal distribution (stored in |
marg.null |
If |
marg.par |
If |
perm.mat |
If |
This function is called internally when the argument nulldist='ic'
is evaluated in the main user-level functions MTP
or EBMTP
. Formatting of the data objects X
, W
, Y
, and especially Z
occurs at execution begin of the main user-level functions.
Based on the value of test
, the appropriate correlation matrix of the vector influence curve is calculated. Once the correlation matrix is obtained, one may sample vectors of null test statistics directly from a multivariate normal distribution rather than relying on permutation-based or bootstrap-based resampling. Because the Gaussian distribution is continuous, we expect this choice of null distribution to suffer less from discreteness than either the permutation or the bootstrap distribution. Additionally, in large-scale settings, use of null distributions derived from the vector influence function typically reduce computational bottlenecks associated with resampling methods.
Because the influence curve null distributions have been implemented for parametric, standardized t-statistics, the options robust
and standardize
are not allowed. Influence curve null distributions are available for the following values of test
: 't.onesamp', 't.pair', 't.twosamp.equalvar', 't.twosamp.unequalvar', 'lm.XvsZ', 'lm.YvsXZ', 't.cor', and 'z.cor'.
In the simpler cases involving one-sample and two-sample tests of means, the correlation matrices are obtained via calls to cor
. For two-sample tests, the correlation matrix corresponds to the following transformation of the group-specific covariance matrices: cov(X(group1))/n1 + cov(X(group2))/n2, where n1 and n2 are sample sizes of each group. When weights are present, the internal function IC.CorXW.NA
is called to calculate weighted estimates of the (group) covariance matrices from each subject's estimated vector influence curve. The calculations are similar in spirit to those in cov.wt
, but they are done in a way which allows for handling NA
elements in the estimated vector influence curve IC_n. The correlation matrix corresponding to IC_n * (IC_n)^t is calculated.
For linear regression models, corr.null
calculates the vector influence curve associated
with each subject/sample. The vector has length equal to the number of hypotheses. The internal function IC.Cor.NA
is used to calculate IC_n * (IC_n)^t in a manner which allows for NA-handling when the influence curve may contain missing elements. For linear regression models of the form E[Y|X], IC_n takes the form (E[((X^t)X)^(-1)] (X^t)_i Y_i) - Y_i-hat. Influence curves for correlation parameters are more complicated, and the user is referred to the references below.
Once the correlation matrix sigma' corresponding to the variance covariance matrix of the vector influence curve sigma =IC_n * (IC_n)^t is obtained, one may sample from N(0,sigma') to obtain null test statistics.
If ic.quant.trans=TRUE
, the matrix of null test statistics can be quantile transformed to produce a matrix which accounts for the joint dependencies between test statistics (down columns), but which has marginal t-distributions (across rows). If marg.null
and marg.par
are not specified (=NULL), the following default t-distributions are applied:
df=n-1;
df=n-1, where n is the number of unique samples, i.e., the number of observed differences between paired samples;
df=n-2;
df=n-1; N.B., this is not recommended, since the effective degrees of freedom are unknown. With sufficiently large n, a normal approximation should yield similar results.
df=n-p, where p is the number of variables in the regression equation;
df=n-p, where p is the number of variables in the regression equation;
df=n-2;
N.B., also not recommended. Fisher's z-statistics are already normally distributed. Marginal transformation to a t-distribution makes little sense.
A matrix of null test statistics with dimension the number of hypotheses (typically nrow(X)
) by the number of desired samples (B
).
Houston N. Gilbert
K.S. Pollard and Mark J. van der Laan, "Resampling-based Multiple Testing: Asymptotic Control of Type I Error and Applications to Gene Expression Data" (June 24, 2003). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 121. http://www.bepress.com/ucbbiostat/paper121
S. Dudoit and M.J. van der Laan. Multiple Testing Procedures and Applications to Genomics. Springer Series in Statistics. Springer, New York, 2008.
H.N. Gilbert, M.J. van der Laan, and S. Dudoit, "Joint Multiple Testing Procedures for Inferring Genetic Networks from Lower-Order Conditional Independence Graphs" (2009). In preparation.
set.seed(99) data <- matrix(rnorm(10*50),nr=10,nc=50) nulldistn.mvrnorm <- corr.null(data,t="t.onesamp",alternative="greater",B=5000) nulldistn.chol <- corr.null(data,t="t.onesamp",MVN.method="Cholesky",penalty=1e-9) nulldistn.t <- corr.null(data,t="t.onesamp",ic.quant.trans=TRUE) dim(nulldistn.mvrnorm)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.