synthpop: utility.tab – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

utility.tab

Tabular utility

Description

Produce tables from observed and synthesized data and calculates utility measures to compare them with their expectation if the synthesising model is correct.

Usage

utility.tab(object, data, vars = NULL, ngroups = 5, useNA = TRUE,
            print.tables = length(vars) < 4, print.stats = 'VW',
            print.zdiff = FALSE, digits = 2, ...) 

## S3 method for class 'utility.tab'
print(x, print.tables = x$print.tables, 
  print.zdiff = x$print.zdiff, print.stats = x$print.stats, 
  digits = x$digits, ...)

Arguments

`object`	an object of class `synds`, which stands for 'synthesised data set'. It is typically created by function `syn()` or `syn.strata()` and it includes `object$m` number of synthesised data set(s), as well as `object$syn` the synthesised data set, if `m = 1`, or a list of `m` such data sets.
`data`	the original (observed) data set.
`vars`	a single string or a vector of strings with the names of variables to be used to form the table.
`ngroups`	if numerical (non-factor) variables are included they will be classified into this number of groups to form tables. Classification is performed using `classIntervals()` function for `n = ngroups`. By default, to avoid problems for variables with a small number of unique values, `style = "fisher"`. Arguments of `classIntervals()` may be, however, specified in the call to `utility.tab()`.
`useNA`	determines if NA values are to be included in tables.
`print.tables`	a logical value that determines if tables of observed and synthesised are to be printed.
`print.stats`	Determines which chi-squred statistics to print to compare the observed and synthetic tables : 'VW' for Voas Williams, 'FT' for Freeman Tukey or c('VW','FT') for both.
`print.zdiff`	a logical value that determines if tables of Z scores for differences between observed and expected are to be printed.
`digits`	an integer indicating the number of decimal places for printing statistics, `tab.zdiff` and mean results for `m > 1`.
`...`	additional parameters; can be passed to classIntervals() function.
`x`	an object of class `utility.tab`.

Details

Forms tables of observed and synthesised values for the variables specified in vars. Two utility measures are calculated from the cells of the tables, a measure of fit proposed by Voas and Williams sum((observed-synthesied)^2/[(observed + synthesised)/2)]) and one proposed by Freeman and Tukey 4*sum((observed^(0.5)-synthesised^(0.5))^2)). In both cases those cells where observed and synthesised are both zero do not contribute to the sum. If the synthesising model is correct both of these measures should have chi-square distributions for large samples.

Value

An object of class utility.tab which is a list with the following components:

`m`	number of synthetic data sets in object, i.e. `object$m`.
`tab.obs`	a table from the observed data.
`UtabFT`	a vector with `object$m` values for the Freeman Tukey utility measure.
`UtabVW`	a vector with `object$m` values for the Voas Williamson utility measure.
`df`	a vector of degrees of freedom for the chi-square tests which equal to one minus the number of cells in the table with any observed or synthesised counts.
`ratioFT`	a vector with ratios of `UtabFT` to `df`.
`ratioVW`	a vector with ratios of `UtabVW` to `df`.
`pvalFT`	a vector with `object$m` p-values for the chi-square tests for the Freeman Tukey utility measure.
`pvalVW`	a vector with `object$m` p-values for the chi-square tests for the Voas Williamson utility measure.
`nempty`	a vector of length `object$m` with number of cells not contributing to the statistics.
`tab.obs`	a table from the observed data.
`tab.syn`	a table or a list of `m` tables from the synthetic data.
`tab.zdiff`	a table or a list of `m` tables of Z statistics for differences between observed and synthesised cells of the tables. Large absolute values indicate a large contribution to lack-of-fit.
`n`	number of observation in the original dataset.

References

Nowok, B., Raab, G.M and Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74(11), 1-26. doi: 10.18637/jss.v074.i11.

Read, T.R.C. and Cressie, N.A.C. (1988) Goodness–of–Fit Statistics for Discrete Multivariate Data, Springer–Verlag, New York.

Voas, D. and Williamson, P. (2001) Evaluating goodness-of-fit measures for synthetic microdata. Geographical and Environmental Modelling, 5(2), 177-200.

Examples

ods <- SD2011[1:1000, c("sex", "age", "edu", "marital")]

s1 <- syn(ods, m = 10)
utility.tab(s1, ods, vars = c("marital", "sex"))

s2 <- syn(ods, m = 1)
utility.tab(s2, ods, vars = c("marital", "age"), ngroups = 3, print.tables = TRUE)
u2 <- utility.tab(s2, ods, vars = c("marital", "age"), style = "pretty")
print(u2, print.tables = TRUE, print.zdiff = TRUE)

synthpop

Generating Synthetic Versions of Sensitive Microdata for Statistical Disclosure Control

v1.6-0

GPL-2 | GPL-3

Authors

Beata Nowok [aut, cre], Gillian M Raab [aut], Chris Dibben [ctb], Joshua Snoke [ctb], Caspar van Lissa [ctb]

Initial release

2020-09-03