syn_da.Rd
This function generates synthetic data utilizing data augmentation
(Jiang et al., 2022; Grund et al., 2022). Continuous
and ordinal variables can be handled. The order of the synthesized variables
can be defined using the argument syn_vars
.
syn_da(dat, syn_vars=NULL, fix_vars=NULL, ord_vars=NULL, da_noise=0.5,
formula_syn=NULL, use_pls=TRUE, ncomp=20, exact_regression=TRUE,
exact_marginal=TRUE, imp_maxit=5)
Original dataset
Vector with variable names that should be synthesized
Vector with variable names that are held fixed in the synthesis
Vector with ordinal variables that are treated as factors when modeled as predictors in the regression model
Proportion of variance (i.e., unreliability) that is added as noise in data
augmentation. The argument can be numeric or a vector, depending on whether it is made
variable-specific. Can also be a vector of the same dimension as syn_vars
if different unreliabilities should be used. Variables that should not receive
a noise variable should be specified with an 1
entry (see Example 2).
If da_noise=1
, no noisy versions of the original variables are specified.
Optional list of regression formulas for conditional models. Formulas can be a specified for a subset of synthesized variables. Non-specified formulas are automatically specified by linear models.
Logical indicating whether partial least squares (PLS) should be used for dimension reduction
Number of PLS factors
Logical indicating whether residuals are forced to be uncorrelated with predictors in the synthesis model
Logical indicating whether marginal distributions of the variables should be preserved
Number of iterations in the imputation if the original dataset contains missing values
A list with entries
generated synthetic data
Data frame containing original and synthetic data
more entries
Grund, S., Luedtke, O., & Robitzsch, A. (2022). Using synthetic data to improve the reproducibility of statistical results in psychological research. Psychological Methods. Epub ahead of print. doi:10.1037/met0000526
Jiang, B., Raftery, A. E., Steele, R. J., & Wang, N. (2022). Balancing inferential integrity and disclosure risk via model targeted masking and multiple imputation. Journal of the American Statistical Association, 117(537), 52-66. doi:10.1080/01621459.2021.1909597
if (FALSE) {
#############################################################################
# EXAMPLE 1: Generate synthetic data with item responses and covariates
#############################################################################
data(data.ma09, package="miceadds")
dat <- data.ma09
# fixed variables in synthesis
fix_vars <- c("PV1MATH", "SEX","AGE")
# ordinal variables in synthesis
ord_vars <- c("FISCED", "MISCED", items)
# variables that should be synthesized
syn_vars <- c("HISEI", "FISCED", "MISCED", items)
#-- synthesize data
mod <- miceadds::syn_da( dat=dat, syn_vars=syn_vars, fix_vars=fix_vars,
ord_vars=ord_vars, da_noise=0.5, imp_maxit=2, use_pls=TRUE, ncomp=20,
exact_regression=TRUE, exact_marginal=TRUE)
#- extract synthetic dataset
mod$dat_syn
#############################################################################
# EXAMPLE 2: Not all variables are augmented, formula specifications
#############################################################################
data(data.ma09, package="miceadds")
dat <- data.ma09
# fixed variables in synthesis
fix_vars <- c("PV1MATH", "SEX")
# ordinal variables in synthesis
ord_vars <- c("FISCED", "MISCED")
# variables that should be synthesized
syn_vars <- c("AGE","HISEI", "FISCED", "MISCED")
# no noise variable for FISCED and MISCED should be specified
da_noise <- c(AGE=0.1, HISEI=0.1, FISCED=0, MISCED=0)
# define conditional models for some variables
formula_syn <- list(
AGE=AGE ~ 1 + PV1MATH + SEX + I(PV1MATH^2) + AGE_DA + HISEI_DA,
HISEI=HISEI ~ 1 + PV1MATH + SEX + AGE + I(PV1MATH^2) + I(AGE^2) +
I(AGE*PV1MATH) + AGE_DA + HISEI_DA
)
#-- synthesize data
mod <- miceadds::syn_da( dat=dat, syn_vars=syn_vars, fix_vars=fix_vars,
ord_vars=ord_vars, da_noise=da_noise,
formula_syn=formula_syn, imp_maxit=2, use_pls=TRUE, ncomp=20,
exact_regression=TRUE, exact_marginal=TRUE)
str(mod)
}