Generating synthetic data sets
Generates synthetic version(s) of a data set. Function syn.strata()
performs
stratified synthesis.
syn(data, method = "cart", visit.sequence = (1:ncol(data)), predictor.matrix = NULL, m = 1, k = nrow(data), proper = FALSE, minnumlevels = -1, maxfaclevels = 60, rules = NULL, rvalues = NULL, cont.na = NULL, semicont = NULL, smoothing = NULL, event = NULL, denom = NULL, drop.not.used = FALSE, drop.pred.only = FALSE, default.method = c("normrank", "logreg", "polyreg", "polr"), numtocat = NULL, catgroups = rep(5, length(numtocat)), models = FALSE, print.flag = TRUE, seed = "sample", ...) syn.strata(data, strata = NULL, minstratumsize = 10 + 10 * length(visit.sequence), tab.strataobs = TRUE, tab.stratasyn = FALSE, method = "cart", visit.sequence = (1:ncol(data)), predictor.matrix = NULL, m = 1, k = nrow(data), proper = FALSE, minnumlevels = -1, maxfaclevels = 60, rules = NULL, rvalues = NULL, cont.na = NULL, semicont = NULL, smoothing = NULL, event = NULL, denom = NULL, drop.not.used = FALSE, drop.pred.only = FALSE, default.method = c("normrank", "logreg", "polyreg", "polr"), numtocat = NULL, catgroups = rep(5,length(numtocat)), models = FALSE, print.flag = TRUE, seed = "sample", ...) ## S3 method for class 'synds' print(x, ...)
data |
a data frame or a matrix ( |
method |
a single string or a vector of strings of length
|
visit.sequence |
a character vector of names of variables or an integer
vector of their column indices specifying the order of synthesis.
The default sequence |
predictor.matrix |
a square matrix of size |
m |
number of synthetic copies of the original (observed) data to be
generated. The default is |
k |
a size of the synthetic data set ( |
proper |
a logical value with default set to |
minnumlevels |
a minimum number of values a numeric variable should
have to be treated as numeric. Numeric variables with fewer levels
than |
maxfaclevels |
a maximum number of factor levels that can be handled. It can be increased but it may cause computational problems, especially for parametric methods. |
rules |
a named list of rules for restricted values. Restricted values are those that are determined explicitly by values of other variables. The names of the list elements must correspond to the variables names for which the rules need to be specified. |
rvalues |
a named list of the values corresponding to the rules
specified by |
cont.na |
a named list of codes for missing values for continuous
variables if different from the |
semicont |
a named list of values at which semi-continuous variables have spikes. The names of the list elements must correspond to the names of the semi-continuous variables. |
smoothing |
a named list specifying smoothing method ( |
event |
a named list specifying for survival data the names of corresponding event indicators. The names of the list elements must correspond to the names of the survival variables. |
denom |
a named list specifying for variables to be modelled using binomial regression the names of corresponding denominator variables. The names of the list elements must correspond to the names of the variables to be modelled using binomial regression. |
drop.not.used |
a logical value. If |
drop.pred.only |
a logical value. If |
default.method |
a vector of four strings containing the default
parametric synthesising methods for numerical variables, factors
with two levels, unordered factors with more than two levels
and ordered factors with more than two levels respectively.
They are used when |
numtocat |
a vector of numbers or names to indicate columns of |
catgroups |
An integer or a vector of integers of the same length as
|
models |
if |
print.flag |
if |
seed |
an integer to be used as an argument for the |
... |
additional arguments to be passed to synthesising functions. See section 'Details' below for more information. |
strata |
a numeric vector with strata identifiers or a string vector with names of stratifying variable(s). |
minstratumsize |
minimum size of each stratum. |
tab.strataobs |
a logical value indicating whether a frequency table of the number of observations in strata in the original data set should be printed. |
tab.stratasyn |
a logical value indicating whether a frequency table of the number of observations in strata in the synthetic data set(s) should be printed. |
x |
an object of class |
Only variables that are in visit.sequence
with corresponding non-empty
method
are synthesised. The only exceptions are event indicators. They
are synthesised along with the corresponding time to event variables and should
not be included in visit.sequence
. All other variables (not in
visit.sequence
or in visit.sequence
with a corresponding blank
method) can be used as predictors. Including them in visit.sequence
generates a default predictor.matrix
reflecting the order of variables
in the visit.sequence
otherwise predictor.matrix
has to be
adjusted accordingly. All predictors of the variables that are not in
visit.sequence
or are in visit.sequence
but with a blank method
are removed from predictor.matrix
.
Variables to be synthesised that are not synthesised yet cannot be used
as predictors. Also all variables used in passive synthesis or in restricted
values rules (rules
) have to be synthesised before the variables they
apply to.
Mismatch between data type and synthesising method stops execution and
print an error message but numeric variables with number of levels less
than minnumlevels
are changed into factors and methods are changed
automatically, if necessary, to methods for categorical variables.
Methods for variables not in a visit sequence will be changed into blank.
The built-in elementary synthesising methods defined by conditional distributions include:
classification and regression trees (CART),
see syn.cart
methods using ensembles of CART trees,
see syn.bag
, syn.rf
, and syn.ranger
classification and regression trees (CART)
for duration time data (parametric methods for survival data are
not implemented yet), see syn.survctree
normal linear regression, see syn.norm
normal linear regression preserving the marginal
distribution, see syn.normrank
normal linear regression after
natural logarithmic, square root and cube root transformation of
a dependent variable respectively, see syn.lognorm
logistic regression, see syn.logreg
unordered polytomous regression, see
syn.polyreg
ordered polytomous regression, see syn.polr
predictive mean matching, see syn.pmm
random sample from the observed data,
see syn.sample
function of other synthesised data,
see syn.passive
bootstrap sample within each category of the original
grouping variable, see syn.nested
bootstrap sample within each category of the
crosstabulation of all the predictor variables,
see syn.satcat
These methods use a group of variables that are synthesised together. They must always be together at the start of the visit sequence:
fit a saturated log-linear model,
see syn.catall
fit a log-linear model, defined by its margins, by iterative
proportional fitting see syn.ipf
The functions corresponding to these methods are called syn.method
,
where method
is a string with the name of a synthesising method.
For instance a function corresponding to ctree
function is called
syn.ctree
. A new synthesising method can be introduced by writing
a function named syn.newmethod
and then specifying method
parameter of syn()
function as "newmethod"
.
In order to use "nested"
sampling, method
parameter of syn
function has to be specified as "nested.varname"
, where "varname"
is the name of the grouped (less detailed) variable, the only one used in
nested synthesis. A variable synthesised using "nested"
method is
excluded from synthesising other variables except when used for "nested"
method.
Additional parameters can be passed to synthesising methods as part of the
dots
argument. They have to be named using period-separated method and
parameter name (method.parameter
). For instance, in order to set
a minbucket
(minimum number of observations in any terminal node of
a CART model) for a ctree
synthesising method, ctree.minbucket
has to be specified. The parameters are method-specific and will be used for
all variables to be synthesised using that method. See help for
syn.method
for further details about the allowed parameters for
a specific method.
The summary
function (summary.synds
) can be used
to obtain a summary of the synthesised variables.
An object of class synds
, which stands for 'synthesised
data set'. It is a list with the following components:
call |
an original call to |
m |
number of synthetic versions of the original (observed) data. |
syn |
a data frame (for |
method |
a vector of synthesising methods applied to each variable in the saved synthesised data. |
visit.sequence |
a vector of column indices of the visiting sequence. The indices refer to the columns in the saved synthesised data. |
predictor.matrix |
a matrix specifying the set of predictors used for each variable in the saved synthesised data. |
smoothing |
a vector specifying smoothing methods applied to each variable in the saved synthesised data. |
event |
a vector of integers specifying for survival data the column indices for corresponding event indicators. The indices refer to the columns in the saved synthesised data. |
denom |
a vector of integers specifying for variables modelled using binomial regression the column indices for corresponding denominator variables. The indices refer to the columns in the saved synthesised data. |
proper |
a logical value indicating whether proper synthesis was conducted. |
n |
a number of cases in the original data. |
k |
a number of cases in the synthesised data. |
rules |
a list of rules for restricted values applied to the synthetic data. |
rvalues |
a list of the values corresponding to the rules
specified by |
cont.na |
a list of codes for missing values for continuous variables. |
semicont |
a list of values for semi-continuous variables at which they have spikes. |
drop.not.used |
a logical value indicating whether variables not used in synthesis are saved in the synthesised data and corresponding synthesis parameters. |
drop.pred.only |
a logical value indicating whether variables not synthesised and used as predictors only are saved in the synthesised data. |
models |
if |
seed |
an integer used as a |
var.lab |
a vector of variable labels for data imported from SPSS using
|
val.lab |
a list of value labels for factors for data imported from SPSS
using |
obs.vars |
a vector of all variable names in the observed data set. |
When syn.strata()
is used there are two additiona components:
strata.syn |
a factor variable or a list of factor variables containing
stratum values for all observation units in |
strata.lab |
a character vector of strata labels. |
Note also that when syn.strata
is used most values of the items are matrices
with each row corresponding to a stratum or lists with one element per stratum.
See package vignette for additional information.
Nowok, B., Raab, G.M and Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74(11), 1-26. doi: 10.18637/jss.v074.i11.
### selection of variables vars <- c("sex","age","marital","income","ls","smoke") ods <- SD2011[1:1000, vars] ### default synthesis s1 <- syn(ods) s1 ### synthesis with default parametric methods s2 <- syn(ods, method = "parametric", seed = 1) s2$method ### multiple synthesis of selected variables with customised methods s3 <- syn(ods, visit.sequence = c(2, 1, 4, 5), m = 2, method = c("logreg","sample","","normrank","ctree",""), ctree.minbucket = 10) summary(s3) summary(s3, msel = 1:2) ### adjustment to the default predictor matrix s4.ini <- syn(data = ods, visit.sequence = c(1, 2, 5, 3), m = 0, drop.not.used = FALSE) pM.cor <- s4.ini$predictor.matrix pM.cor["marital","ls"] <- 0 s4 <- syn(data = ods, visit.sequence = c(1, 2, 5, 3), predictor.matrix = pM.cor) ### handling missing values in continuous variables s5 <- syn(ods, cont.na = list(income = c(NA, -8))) ### rules for restricted values - marital status of males under 18 should be 'single' s6 <- syn(ods, rules = list(marital = "age < 18 & sex == 'MALE'"), rvalues = list(marital = 'SINGLE'), method = "parametric", seed = 1) with(s6$syn, table(marital[age < 18 & sex == 'MALE'])) ### results for default parametric synthesis without the rule with(s2$syn, table(marital[age < 18 & sex == 'MALE'])) ### synthesis with ipf for all variables s7 <- syn(ods[, 1:3], method = "ipf", numtocat = "age") ### stratified synthesis s8 <- syn.strata(ods, strata = "sex")
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.