synthpop: utility.gen – R documentation

Pricing

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

Get Started for Free

Documentation

synthpop

utility.gen

Distributional comparison of synthesised and observed data

Description

Distributional comparison of synthesised data set with the original (observed) data set using propensity scores.

Usage

utility.gen(object, data, method = "logit", maxorder = 1, 
            tree.method = "rpart", resamp.method = NULL,  
            nperms = 50, cp = 1e-3, minbucket = 5, mincriterion = 0, 
            vars = NULL, aggregate = FALSE, maxit = 200, ngroups = NULL,
            print.every = 10, digits = 2, print.zscores = FALSE, zthresh = 1.6, 
            print.ind.results = TRUE, print.variable.importance = FALSE, ...)

## S3 method for class 'utility.gen'
print(x, digits = x$digits, 
  print.zscores = x$print.zscores, zthresh = x$zthresh, 
  print.ind.results = x$print.ind.results,  
  print.variable.importance = x$print.variable.importance, ...)

Arguments

`object`	an object of class `synds`, which stands for 'synthesised data set'. It is typically created by function `syn()` and it includes `object$m` synthesised data set(s) as `object$syn`. This a single data set when `object$m = 1` or a list of length `object$m` when `object$m > 1`.
`data`	the original (observed) data set.
`method`	a single string specifying the method for modeling the propensity scores. Method can be selected from `"logit"` and `"cart"`.
`maxorder`	maximum order of interactions to be considered in `"logit"` method. For model without interactions `0` should be provided.
`tree.method`	implementation of `"cart"` method, can be `"rpart"` or `"ctree"`.
`resamp.method`	method used for resampling estimate of the null `pMSE`, can be `"perm"` or `"pairs"`. For `pMSEs` calculated with method `"cart"` it defaults to `"perm"` if all the variables from `"object"` in `"vars"` have been synthesised or to `"pairs"` if some have not (i.e. have `method = ""`).
`nperms`	number of permutations for the permutation test to obtain the null distribution of the utility measure when `resamp.method = "perm"`.
`cp`	complexity parameter for classification with tree.method `"rpart"`. Small values grow bigger trees.
`minbucket`	minimum number of observations allowed in a leaf for classification when `method = "cart"`.
`mincriterion`	criterion between 0 and 1 to use to control `tree.method = "ctree"` when the tree will not be allowed to split further. A value of `0.95` would be equivalent to a `5%` significance test. Here we set it to `0` to effectively disable this test and grow large trees.
`vars`	variables to be included in the utility comparison. If none are specified all the variables in the synthesised data will be included.
`aggregate`	logical flag as to whether the data should be aggregated by collapsing identical rows before computation. This can lead to much faster computation when all the variables are categorical. Only works for `method = "logit"`.
`maxit`	maximum iterations to use when `method = "logit"`. If the model does not converge in this number a warning will suggest increasing it.
`ngroups`	target number of groups for categorisation of each numeric variable: final number may differ if there are many repeated values. If `NULL` (default) variables are not categorised into groups.
`print.every`	controls the printing of progress of resampling when `resamp.method` is not `NULL`. When `print.every = 0` no progress is reported, otherwise the resample number is printed every `print.every`.
`...`	additional parameters passed to `glm`, `rpart`, or `ctree`.
`x`	an object of class `utility.gen`.
`digits`	number of digits to print in the default output, excluding `pMSE` values.
`print.zscores`	logical value as to whether z-scores for coefficients of the logit model should be printed.
`zthresh`	threshold value to use to suppress the printing of z-scores under `+`/`-` this value for `method = "logit"`. If set to `NA` all z-scores are printed.
`print.ind.results`	logical value as to whether utility score results from individual syntheses should be printed.
`print.variable.importance`	logical value as to whether the variable importance measure should be printed when `tree.method = "rpart"`.

Details

This function follows the method for evaluating the utility of masked data as given in Snoke et al. (forthcoming) and originally proposed by Woo et al. (2009). The original and synthetic data are combined into one dataset and propensity scores, as detailed in Rosenbaum and Rubin (1983), are calculated to estimate the probability of membership in the synthetic data set. The utility measure is based on the mean squared difference between these probabilities and the probability expected if the data did not distinguish the synthetic data from the original. The expected probability is just the proportion of synthetic data in the combined data set, 0.5 when the original and synthetic data have the same number of records.

Propensity scores can be modeled by logistic regression method = "logit" or by two different implementations of classification and regression trees as method "cart". For logistic regression the predictors are all variables in the data and their interactions up to order maxorder. The default of 1 gives all main effects and first order interactions. For logistic regression the null distribution of the propensity score is derived and is used to calculate ratios and standardised values.

For method = "cart" the expectation and variance of the null distribution is calculated from a permutation test.

If missing values exist, indicator variables are added and included in the model as recommended by Rosenbaum and Rubin (1984). For categorical variables, NA is treated as a new category.

Value

An object of class utility.gen which is a list including the utility measures their expected null values for each synthetic set with the following components:

`call`	the call that produced the result.
`m`	number of synthetic data sets in object.
`method`	method used to fit propensity score.
`tree.method`	cart function used to fit propensity score when `method = "cart"`.
`pMSE`	Propensity score mean square error from the utility model or a vector of these values if `object$m > 1`.
`utilVal`	utility value(s). Calculated from the `pMSE` as `pMSE*(n1+n2)^3/n1^2/n2` or a vector of these values if `object$m > 1`. For `method = "logit"` the null distribution of this quantity will be chi-squared with degrees of freedom equal to the number of parameters involving synthesised variables in the propensity score minus 1. For `method = "cart"` its distribution will be obtained by resampling.
`utilExp`	expected value(s) of the utility score if the synthesis method is correct.
`utilR`	ratio(s) of `utilVal(s)` to `utilExp`.
`utilStd`	utility value standardised by expressing it as z-scores, difference(s) from the expected value divided by the expected standard deviation.
`pval`	p-value(s) for the chi-square test(s) for `utilVal` with `utilExp` degrees of freedom.
`fit`	the fitted model for the propensity score or a list of fitted models of length `m` if `m > 0`.

References

Woo, M-J., Reiter, J.P., Oganian, A. and Karr, A.F. (2009). Global measures of data utility for microdata masked for disclosure limitation. Journal of Privacy and Confidentiality, 1(1), 111-124.

Rosenbaum, P.R. and Rubin, D.B. (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association, 79(387), 516-524.

Snoke, J., Raab, G.M., Nowok, B., Dibben, C. and Slavkovic, A. (2018). General and specific utility measures for synthetic data. Journal of the Royal Statistical Society: Series A, 181, Part 3, 663-688.

Examples

## Not run: 
  ods <- SD2011[1:1000, c("age", "bmi", "depress", "alcabuse", "englang")]
  s1 <- syn(ods, m = 5)
  utility.gen(s1, ods)
  u1 <- utility.gen(s1, ods)
  print(u1, print.zscores = TRUE, usethresh = TRUE)
  u2 <- utility.gen(s1, ods, groups = TRUE)
  print(u2, print.zscores = TRUE)
  u3 <- utility.gen(s1, ods, method = "cart", nperms = 20)
  print(u3, print.variable.importance = TRUE)
## End(Not run)

synthpop

Generating Synthetic Versions of Sensitive Microdata for Statistical Disclosure Control

v1.6-0

GPL-2 | GPL-3

Authors

Beata Nowok [aut, cre], Gillian M Raab [aut], Chris Dibben [ctb], Joshua Snoke [ctb], Caspar van Lissa [ctb]

Initial release

2020-09-03

utility.gen

Description

Usage

Arguments

Details

Value

References

See Also

Examples

synthpop

We don't support your browser anymore