Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

syn.catall

Synthesis of a group of categorical variables from a saturated model


Description

A saturated model is fitted to a table produced by cross-tabulating all the variables.

Usage

syn.catall(x, k, proper = FALSE, priorn = 1, structzero = NULL, 
           maxtable = 1e8, ...)

Arguments

x

a data frame (n x p) of the set of original variables.

k

a number of rows in each synthetic data set - defaults to n.

proper

if proper = TRUE x is replaced with a bootstrap sample before synthesis, thus effectively sampling from the posterior distribution of the model, given the data.

priorn

the sum of the parameters of the Dirichelet prior which can be thought of as a pseudo-count giving the number of observations that inform prior knowledge about the parameters.

structzero

a named list of lists that defines which cells in the table are structural zeros and will remain as zeros in the synthetic data, by leaving their prior as zeros. Each element of the structzero list is a list that describes a set of cells in the table defined by a combination of two or more variables and a name of each such element must consist of those variable names seperated by an underscore, e.g. sex_edu. The length of each such element is determined by the number of variables and each component gives the variable levels (numeric or labels) that define the structural zero cells (see an example below).

maxtable

a number of cells in the cross-tabulation of all the variables that will trigger a severe warning.

...

additional parameters.

Details

When used in syn function the group of categorical variables with method = "catall" must all be together at the start of the visit.sequence. Subsequent variables in visit.sequence are then synthesised conditional on the synthesised values of the grouped variables. A saturated model is fitted to a table produced by cross-tabulating all the variables. Prior probabilities for the proportions in each cell of the table are specified from the parameters of a Dirichlet distribution with the same parameter for every cell in the table that is not a structural zero (see above). The sum of these parameters is priorn so that each one is priorn/N where N is the number of cells in the table that are not structural zeros. The default priorn = 1 can be thought of as equivalent to the knowledge that 1 observation would be equally likely to be in any cell that is not a structural zero. The posterior expectation, given the observed counts, for the probability of being in a cell with observed count n_i is thus (n_i + priorn/N) / (N + priorn). The synthetic data are generated from a multinomial distribution with parameters given by these probabilities.

Unlike syn.satcat, which fits saturated conditional models, the synthesised data can include any combination of variables, except those defined by the combinations of variables in structzero.

NOTE that when the function is called by setting elements of method in syn() to "catall", the parameters priorn, structzero and maxtable must be supplied to syn as e.g. catall.priorn.

Value

A list with two components:

res

a data frame of dimension k x p containing the synthesised data.

fit

the cross-tabulation of all the original variables used.

Examples

ods <- SD2011[, c(1, 4, 5, 6, 2, 10, 11)]
table(ods[, c("placesize", "region")])

# Each \code{placesize_region} sublist: 
# for each relevant level of \code{placesize} defined in the first element, 
# the second element defines regions (variable \code{region}) that do not 
# have places of that size.

struct.zero <- list(
  placesize_region = list("URBAN 500,000 AND OVER", c(2, 4, 5, 8:13, 16)),
  placesize_region = list("URBAN 200,000-500,000", c(3, 4, 10:11, 13)),
  placesize_region = list("URBAN 20,000-100,000", c(1, 3, 5, 6, 8, 9, 14:15)))

syncatall <- syn(ods, method = c(rep("catall", 4), "ctree", "normrank", "ctree"),
                 catall.priorn = 2, catall.structzero = struct.zero)

synthpop

Generating Synthetic Versions of Sensitive Microdata for Statistical Disclosure Control

v1.6-0
GPL-2 | GPL-3
Authors
Beata Nowok [aut, cre], Gillian M Raab [aut], Chris Dibben [ctb], Joshua Snoke [ctb], Caspar van Lissa [ctb]
Initial release
2020-09-03

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.