Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

01missing_variable

Class "missing_variable" and Inherited Classes


Description

The missing_variable class is essentially the data comprising a variable plus all the metadata needed to understand how its missing values will be imputed. However, no variable is merely of missing_variable class; rather every variable is of a class that inherits from the missing_variable class. Even if a variable has no missing values, it needs to be coerced to a class that inherits from the missing_variable class before it can be used to impute values of other missing_variables. Understanding the properties of different subclasses of the missing_variable class is essential for modeling and imputing them. The missing_data.frame-class is essentially a list of objects that inherit from the missing_variable class, plus metadata need to understand how these missing_variables relate to each other. Most users will never need to call missing_variable directly since it is called by missing_data.frame.

Usage

missing_variable(y, type, ...)
## Hidden arugments not included in the signature:
## favor_ordered = TRUE, favor_positive = FALSE, 
## variable_name = deparse(substitute(y))

Arguments

y

Can be any vector, some of whose values may be NA, which will comprise the raw_data slot of a missing_variable (see the Slots section). It is recommended that this vector not have any transformations, such as a log-transformation. Any continuous variable can be transformed using the function in its transformation slot. The transformations and other discretionary aspects of a missing_variable are typically changed by calling the change function on a missing_data.frame See the Slots section for more details.

type

Missing or a character string among the classes that inherit from the missing_variable class. If missing, the constructor will guess (sometimes incorrectly) based on the characteristics of the variable. The best way to improve the guessing of categorical variables is to use the factor function — possibly with ordered = TRUE — to create (possibly ordered) factors that will correctly be coerced to objects of unordered-categorical-class and ordered-categorical-class respectively. If you fail to do so, the hidden arguments that are not included in the signature affect the guesses. If favor_ordered = TRUE, which is the default, it will tend to guess that variables with few unique values are should be coerced to ordered-categorical-class and unordered-categorical-class otherwise. If favor_positive = FALSE, which is the default, it will tend to guess that variables with many unique values are merely continuous, whether or not all the observed values are positive. If favor_positive = TRUE nonnegative or positive variables will get coerced to nonnegative-continuous-class or positive-continuous-class. See the Slots section and the specific help pages for more details on the subclasses.

...

Further hidden arguments that are not in the signature. The favor_ordered and favor_positive arguments are documented immediately above. The variable name argument can be used to control what gets put in the variable_name slot, see the Slots section below.

Value

The missing_variable function returns an object that inherits from the missing_variable class.

Objects from the Classes

The missing_variable class is virtual, so no objects may be created from it. However, the missing_variable generic function can be used to instantiate an object that inherits from the missing_variable class by specifying its type argument. A user would call the missing_data.frame function on a data.frame, which in turn calls the missing_variable function on each column of the data.frame using various heuristics to guess the type argument.

Slots

In the following table, indentation indicates inheritance from the class with less indentation, and italics indicates that the class is virtual so no variables can be created with that class. Inherited classes inherit the transformations, families, link functions, and fit_model-methods from their parent class, although these are often superceeded by analogues that are tailored for the inherited class. Also note, the default transformation for the continuous class is a standardization using twice the standard deviation of the observed values.

The distinction between the transformation entailed by the family and the transformation entailed by the function in the tranformation slot may be confusing at this point. The former pertains to how the linear predictor of a variable is mapped to the space of a variable when it is on the left-hand side of a generalized linear model. The latter pertains — for continuous variables only — to how the values in the raw_data slot are mapped into those in the data and thus affects how a continuous variable enters into the model whether it is on the left or right-hand side. The classes are discussed in much more detail below.

Class name [transformation] Default family and link Default fit_model
missing_variable none throws error
categorical none throws error
unordered-categorical binomial(link = 'logit') multinom
ordered-categorical binomial(link = 'logit') bayespolr
binary binomial(link = 'logit') bayesglm
interval gaussian{link = 'identity'} survreg
continuous[standardize] gaussian{link = 'identity'} bayesglm
semi-continuous[identity]
nonnegative-continuous[logshift]
SC_proportion[squeeze] binomial(link = 'logit') betareg
positive-continuous[log]
proportion[identity] binomial(link = 'logit') betareg
bounded-continuous[identity]
count quasipoisson{link = 'log'} bayesglm
irrelevant throws error
fixed throws error

The missing_variable class is virtual and has the following slots (this information is primarily directed at developeRs):

variable_name:

Object of class character of length one naming the variable

raw_data:

Object of class "ANY" representing the observations on a variable, some of which may be NA. No method should ever change this slot at all. Instead, methods should change the data slot.

data:

Object of class "ANY", which is initially a copy of the raw_data slot — transformed by the function in the transformation slot for continuous variables only — and whose NA values are replaced during the multiple imputation process. See mi

n_total:

Object of class "integer" which is the length of the data slot

all_obs:

Object of class "logical" of length one indicating whether all values of the data slot are observed and thus not NA

n_obs:

Object of class "integer" of length one indicating the number of values of the data slot that are observed and thus not NA

which_obs:

Object of class "integer", which is a vector indicating the positions of the observed values in the data slot

all_miss:

Object of class "logical" of length one indicating whether all values of the data slot are NA

n_miss:

Object of class "integer" of length one indicating the number of values of the data slot that are NA

which_miss:

Object of class "integer", which is a vector indicating the positions of the missing values in the data slot

n_extra:

Object of class "integer" of length one indicating how many (missing) observations have been added to the end of the data slot that are not included in the raw_data slot. Although the extra values will be imputed, they are not considered to be “missing” for the purposes of defining the previous three slots

which_extra:

Object of class "integer", which is a vector indicating the positions of the extra values at the end of the data slot

n_unpossible:

Object of class "integer" of length one indicating the number of values that are logically or structurally unobservable

which_unpossible:

Object of class "integer" indicating the positions of the unpossible values in the data slot

n_drawn:

Object of class "integer" of length one which is the sum of the n_miss and n_extra slots

which_drawn:

Object of class "integer" which is a vector concatinating the which_miss and which_extra slots

imputation_method:

Object of class "character" of length one indicating how the NA values are to be imputed. Possibilities include “ppd” for imputation from the posterior predictive distribution, “pmm” for imputation via predictive mean matching, “mean” for mean-imputation, “median” for median-imputation, “expectation” for conditional mean-imputation. With enough programming effort, other kinds of imputation can be defined and specified here.

family:

Object of class "WeAreFamily" that will typically be passed to glm and similar functions during the multiple imputation process

known_families:

Object of class character indicating the families that are known to be supported for a class; see family

known_links:

Object of class character indicating what link functions are known to be supported by the elements of the known_families slot; see family

imputations:

Object of class "MatrixTypeThing" with rows equal to the number of iterations (initially zero) of the multiple imputation algorithm and columns equal to the n_drawn slot. The rows are appropriately extended and then filled by the mi function

done:

Object of class "logical" of length one indicating whether the NA values in the data slot have been replaced by imputed values

parameters:

Object of class "MatrixTypeThing" with rows equal to the number of iterations (initially zero) of the multiple imputation algorithm and columns equal to the number of estimated parameters when modeling the data slot. The rows are appropriately extended and then filled by the mi function

model:

Object of class "ANY" which can be filled by an object that is output by one of the fit_model-methods, which is done by default by mi when all the iterations have completed

fitted:

Object of class "ANY" although typically a vector or matrix that contains the fitted values of the model in the slot immediately above. Note that the fitted slot is filled by default by mi, although the model slot is left empty by default to save RAM.

estimator:

Object of class "character" of length one indicating which pre-existing fit_model to use for an unordered-categorical variable. Options are "mnl", in which multinom from the nnet package is used to fit the values of the unordered categorical variable; and "rnl", in which each category is separately modeled as the positive binary outcome against all other categories using a bayesglm fit_model and the probabilities of each category are normalized to sum to 1 after each model is run. In general, "rnl" is slightly less accurate than "mnl", but runs much more quickly especially when the unordered categorical variable has many unique categories.

The WeAreFamily class is a class union of character and family, while the MatrixTypeThing class is a class union of matrix only at the moment.

Author(s)

Ben Goodrich and Jonathan Kropko, for this version, based on earlier versions written by Yu-Sung Su, Masanao Yajima, Maria Grazia Pittau, Jennifer Hill, and Andrew Gelman.

See Also

Examples

# STEP 0: GET DATA
data(nlsyV, package = "mi")

# STEP 0.5 CREATE A missing_variable (you never need to actually do this)
income <- missing_variable(nlsyV$income, type = "continuous")
show(income)

# STEP 1: CONVERT IT TO A missing_data.frame
mdf <- missing_data.frame(nlsyV) # this calls missing_variable() internally
show(mdf)

mi

Missing Data Imputation and Model Checking

v1.0
GPL (>= 2)
Authors
Andrew Gelman [ctb], Jennifer Hill [ctb], Yu-Sung Su [aut], Masanao Yajima [ctb], Maria Pittau [ctb], Ben Goodrich [cre, aut], Yajuan Si [ctb], Jon Kropko [aut]
Initial release
2015-04-16

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.