Generate a random data.frame with tunable characteristics
This function generates a random data.frame
with a
missingness mechanism that is used to impose a missingness pattern. The primary
purpose of this function is for use in simulations
rdata.frame(N = 1000, restrictions = c("none", "MARish", "triangular", "stratified", "MCAR"), last_CPC = NA_real_, strong = FALSE, pr_miss = .25, Sigma = NULL, alpha = NULL, experiment = FALSE, treatment_cor = c(rep(0, n_full - 1), rep(NA, 2 * n_partial)), n_full = 1, n_partial = 1, n_cat = NULL, eta = 1, df = Inf, types = "continuous", estimate_CPCs = TRUE)
N |
integer indicating the number of observations |
restrictions |
character string indicating what restrictions to impose on the the missing data mechansim, see the Details section |
last_CPC |
a numeric scalar between -1 and 1 exclusive or
|
strong |
Integer among 0, 1, and 2 indicating how strong to
make the instruments with multiple partially observed variables,
in which case the missingness indicators for each partially observed variable
can be used as instruments when predicting missingness on other partially
observed variables. Only applies when |
pr_miss |
numeric scalar on the (0,1) interval or vector
of length |
Sigma |
Either |
alpha |
Either |
experiment |
logical indicating whether to simulate a randomized experiment |
treatment_cor |
Numeric vector of appropriate length indicating the
correlations between the treatment variable and the other variables, which
is only relevant if |
n_full |
integer indicating the number of fully observed variables |
n_partial |
integer indicating the number of partially observed variables |
n_cat |
Either |
eta |
Positive numeric scalar which serves as a hyperparameter in the data-generating process. The default value of 1 implies that the correlation matrix among the variables is jointly uniformally distributed, using essentially the same logic as in the clusterGeneration package |
df |
positive numeric scalar indicating the degress of freedom for the
(possibly skewed) multivariate t distribution, which defaults to
|
types |
a character vector (possibly of length one, in which case it
is recycled) indicating the type for each fully observed and partially
observed variable, which currently can be among |
estimate_CPCs |
A logical indicating whether the canonical partial correlations
between the partially observed variables and the latent missingnesses should
be estimated. The default is |
By default, the correlation matrix among the variables and missingness indicators
is intended to be close to uniform, although it is often not possible to achieve
exactly. If restrictions = "none"
, the data will be Not Missing At Random
(NMAR). If restrictions = "MARish"
, the departure from Missing At Random
(MAR) will be minimized via a call to optim
, but generally will
not fully achieve MAR. If restrictions = "triangular"
, the MAR assumption
will hold but the missingness of each partially observed variable will only
depend on the fully observed variables and the other latent missingness indicators.
If restrictions = "stratified"
, the MAR assumption will hold but the
missingness of each partially observed variable will only depend on the fully
observed variables. If restrictions = "MCAR"
, the Missing Completely At
Random (MCAR) assumption holds, which is much more restrictive than MAR.
There are some rules to follow, particularly when specifying types
.
First, if experiment = TRUE
, there must be exactly one treatment
variable (taken to be binary) and it must come first to ensure that the
elements of treatment_cor
are handled properly. Second, if there are any
partially observed nominal variables, they must come last; this is to ensure
that they are conditionally uncorrelated with each other. Third, fully observed
nominal variables are not supported, but they can be made into ordinal variables
and then converted to nominal after the fact. Fourth, including both ordinal and
nominal partially observed variables is not supported yet, Finally, if any
variable is specified as a count, it will not be exactly consistent with the
data-generating process. Essentially, a count variable is constructed from a
continuous variable by evaluating pt
on it and passing that to
qpois
with an intensity parameter of 5. The other non-continuous
variables are constructed via some transformation or discretization of a continuous
variable.
If some partially observed variables are either ordinal or nominal (but not both),
then the n_cat
argument governs how many categories there are. If n_cat
is NULL
, then the number of categories defaults to three. If
n_cat
has length one, then that number of categories will be used for all
categorical variables but must be greater than two. Otherwise, the length of
n_cat
must match the number of partially observed categorical variables and
the number of categories for the ith such variable will be the ith element
of n_cat
.
A list with the following elements:
true a data.frame
containing no NA
values
obs a data.frame
derived from the previous with some
NA
values that represents a dataset that could be observed
empirical_CPCs a numeric vector of empirical Canonical Partial
Correlations, which should differ only randomly from zero iff
MAR = TRUE
and the data-generating process is multivariate normal
L a Cholesky factor of the correlation matrix used to generate the true data
In addition, if alpha
is not NULL
, then the following
elements are also included:
alpha the alpha
vector utilized
sn_skewness the skewness of the multivariate skewed normal distribution
in the population; note that this value is only an approximation of the
skewness when df < Inf
sn_kurtosis the kurtosis of the multivariate skewed normal distribution
in the population; note that this value is only an approximation of the
kurtosis when df < Inf
Ben Goodrich and Jonathan Kropko, for this version, based on earlier versions written by Yu-Sung Su, Masanao Yajima, Maria Grazia Pittau, Jennifer Hill, and Andrew Gelman.
rdf <- rdata.frame(n_partial = 2, df = 5, alpha = rnorm(5)) print(rdf$empirical_CPCs) # not zero rdf <- rdata.frame(n_partial = 2, restrictions = "triangular", alpha = NA) print(rdf$empirical_CPCs) # only randomly different from zero print(rdf$L == 0) # some are exactly zero by construction mdf <- missing_data.frame(rdf$obs) show(mdf) hist(mdf) image(mdf) # a randomized experiment rdf <- rdata.frame(n_full = 2, n_partial = 2, restrictions = "triangular", experiment = TRUE, types = c("t", "ord", "con", "pos"), treatment_cor = c(0, 0, NA, 0, NA)) Sigma <- tcrossprod(rdf$L) rownames(Sigma) <- colnames(Sigma) <- c("treatment", "X_2", "y_1", "Y_2", "missing_y_1", "missing_Y_2") print(round(Sigma, 3))
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.