synthpop: syn.cart – R documentation

Pricing

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

Get Started for Free

Documentation

synthpop

syn.cart

Synthesis with classification and regression trees (CART)

Description

Generates univariate synthetic data using classification and regression trees (without or with bootstrap).

Usage

syn.ctree(y, x, xp, smoothing = "", proper = FALSE, 
          minbucket = 5, mincriterion = 0.9, ...)
syn.cart(y, x, xp, smoothing = "", proper = FALSE, 
         minbucket = 5, cp = 1e-08, ...)

Arguments

`y`	an original data vector of length `n`.
`x`	a matrix (`n` x `p`) of original covariates.
`xp`	a matrix (`k` x `p`) of synthesised covariates.
`smoothing`	smoothing method for continuous variables.
`proper`	for proper synthesis (`proper = TRUE`) a CART model is fitted to a bootstrapped sample of the original data.
`minbucket`	the minimum number of observations in any terminal node. See `rpart.control` and `ctree_control` for details.
`cp`	complexity parameter. Any split that does not decrease the overall lack of fit by a factor of cp is not attempted. Small values of `cp` will grow large trees. See `rpart.control` for details.
`mincriterion`	`1 - p-value` of the test that must be exceeded for a split to be retained. Small values of `mincriterion` will grow large trees. See `ctree_control` for details.
`...`	additional parameters passed to `ctree_control` for `syn.ctree` and `rpart.control` for `syn.cart`.

Details

The procedure for synthesis by a CART model is as follows:

Fit a classification or regression tree by binary recursive partitioning.
For each xp find the terminal node.
Randomly draw a donor from the members of the node and take the observed value of y from that draw as the synthetic value.

syn.ctree uses ctree function from the party package and syn.cart uses rpart function from the rpart package. They differ, among others, in a selection of a splitting variable and a stopping rule for the splitting process.

A Guassian kernel smoothing can be applied to continuous variables by setting smoothing parameter to "density". It is recommended as a tool to decrease the disclosure risk. Increasing minbucket is another means of data protection.

CART models were suggested for generation of synthetic data by Reiter (2005) and then evaluated by Drechsler and Reiter (2011).

Value

A list with two components:

`res`	a vector of length `k` with synthetic values of `y`.
`fit`	the fitted model which is an object of class `rpart.object` or `ctree.object` that can be printed or plotted.

References

Reiter, J.P. (2005). Using CART to generate partially synthetic, public use microdata. Journal of Official Statistics, 21(3), 441–462.

Drechsler, J. and Reiter, J.P. (2011). An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Computational Statistics and Data Analysis, 55(12), 3232–3243.