Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

compare.fit.synds

Compare model estimates based on synthesised and observed data


Description

The same model that was used for the synthesised data set is fitted to the observed data set. The coefficients with confidence intervals for the observed data is plotted together with their estimates from synthetic data. When more than one synthetic data set has been generated (object$m>1) combining rules are applied. Analysis-specific utility measures are used to evaluate differences between synthetic and observed data.

Usage

## S3 method for class 'fit.synds'
compare(object, data, plot = "Z", 
  print.coef = FALSE, return.plot = TRUE, plot.intercept = FALSE, 
  lwd = 1, lty = 1, lcol = c("#1A3C5A","#4187BF"), 
  dodge.height = .5, point.size = 2.5,
  population.inference = FALSE, ci.level = 0.95, ...)

## S3 method for class 'compare.fit.synds'
print(x, print.coef = x$print.coef, ...)

Arguments

object

an object of type fit.synds created by fitting a model to synthesised data set using function glm.synds or lm.synds.

data

an original observed data set.

plot

values to be plotted: "Z" (Z scores) or "coef" (coefficients).

print.coef

a logical value determining whether tables of estimates for the original and synthetic data should be printed.

return.plot

a logical value indicating whether a confidence interval plot should be returned.

plot.intercept

a logical value indicating whether estimates for intercept should be plotted.

lwd

the line type.

lty

the line width.

lcol

line colours.

dodge.height

size of vertical shifts for confidence intervals to prevent overlaping.

point.size

size of plotting symbols used to plot point estimates of coefficients.

population.inference

a logical value indicating whether intervals for inference to population quantities, as decribed by Karr et al. (2006), should be calculated and plotted. This option suppresses the lack-of-fit test and the standardised differences since these are based on differences standardised by the original interval widths.

ci.level

Confidence interval coverage as a proportion.

...

additional parameters passed to ggplot.

x

an object of class compare.fit.synds.

Details

This function can be used to evaluate whether the method used for synthesis is appropriate for the fitted model. If this is the case the estimates from the synthetic dataof what would be expected from the original data xpct(Beta) xpct(Z) should not differ from the estimates from the observed data (Beta and Z) by more than would be expected from the standard errors (se(Beta) and se(Z)). For more details see the vignette on inference.

Value

An object of class compare.fit.synds which is a list with the following components:

call

the original call to fit the model to the synthesised data set.

coef.obs

a data frame including estimates based on the observed data: coefficients (Beta), their standard errors (se(Beta)) and Z scores (Z).

coef.syn

a data frame including (combined) estimates based on the synthesised data: point estimates of observed data coefficients (B.syn), standard errors of those estimates (se(B.syn)), estimates of the observed standard errors (se(Beta).syn), Z scores estimates (Z.syn) and their standard errors (se(Z.syn)). Note that se(B.syn) and se(Z.syn) give the standard errors of the mean of the m syntheses and can be made very small by increasing m (see the vignette on inference for more details).

coef.diff

a data frame containing standardized differences between the coefficients estimated from the original data and those calculated from the combined synthetic data. The difference is standardized by dividing by the estimated standard error of the fit from the original. The corresponding p-values are calculated from a standard Normal distribution and represent the probability of achieving differences as large as those found if the model use for synthesis is compatible with the model that generated the original data.

mean.abs.std.diff

Mean absolute standardized difference (over all coefficients).

ci.overlap

a data frame containing the percentage of overlap between the estimated synthetic confidence intervals and the original sample confidence intervals for each parameter. When population.inference = TRUE overlaps are calculated as suggested by Karr et al. (2006). Otherwise a simpler overlap measure with intervals of equal length is calculated.

mean.ci.overlap

Mean confidence interval overlap (over all coefficients).

lack.of.fit

lack-of-fit measure from all m synthetic data sets combined, calculated as follows, when object$incomplete = FALSE. The vector of mean differences (diff) between the coefficients calculated from the synthetic and original data provides a standardised lack-of-fit = t(diff) %*% V^(-1) t(diff), where %*% represents the matrix product and V^(-1) is the inverse of the variance-covariance matrix for the mean coefficients from the original data. If the model used to synthesize the data is correct this quantity, which is a Mahalanobis distance measure, will follow a chi-squared distribution with degrees of freedom, and thus expectation, equal to the number of parameters (p) in the fitted model. When object$incomplete = TRUE the variance-covariance matrix of the coefficients is estimated from the differences between the m estimates and the lack-of-fit statistic follows a Hotelling's T*2 distribution and the lack-of-fit statistic is referred to an F(p, m - p).

lof.pvalue

p-value for the combined lack-of-fit test of the NULL hypothesis that the method used for synthesis retains all relationships between variables that influence the parameters of the fit.

ci.plot

ggplot of the the coefficients with confidence intervals for models based on observed and synthetic data. If return.plot was set to FALSE then ci.plot is NULL.

print.coef

a logical value determining whether tables of estimates for the original and synthetic data should be printed.

m

the number of synthetic versions of the original (observed) data.

ncoef

the number of coefficients in the fitted model (including an intercept).

incomplete

whether methods for incomplete synthesis due to Reiter (2003) have been used in calculations.

population.inference

whether intervals as decribed by Karr et al. (2016) have been calculated.

References

Karr, A., Kohnen, C.N., Oganian, A., Reiter, J.P. and Sanil, A.P. (2006). A framework for evaluating the utility of data altered to protect confidentiality. The American Statistician, 60(3), 224-232.

Nowok, B., Raab, G.M and Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74(11), 1-26. doi: 10.18637/jss.v074.i11.

Reiter, J.P. (2003) Inference for partially synthetic, public use microdata sets. Survey Methodology, 29, 181-188.

See Also

Examples

ods <- SD2011[,c("sex","age","edu","smoke")]
s1 <- syn(ods, m = 3)
f1 <- glm.synds(smoke ~ sex + age + edu, data = s1, family = "binomial")
compare(f1, ods) 
compare(f1, ods, print.coef = TRUE, plot = "coef")

synthpop

Generating Synthetic Versions of Sensitive Microdata for Statistical Disclosure Control

v1.6-0
GPL-2 | GPL-3
Authors
Beata Nowok [aut, cre], Gillian M Raab [aut], Chris Dibben [ctb], Joshua Snoke [ctb], Caspar van Lissa [ctb]
Initial release
2020-09-03

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.