Title: | Synthetic Data Integration |
---|---|
Description: | Regression inference for multiple populations by integrating summary-level data using stacked imputations. Gu, T., Taylor, J.M.G. and Mukherjee, B. (2021) A synthetic data integration framework to leverage external summary-level information from heterogeneous populations <arXiv:2106.06835>. |
Authors: | Tian Gu [aut], Jeremy M.G. Taylor [aut], Bhramar Mukherjee [aut], Michael Kleinsasser [cre] |
Maintainer: | Michael Kleinsasser <[email protected]> |
License: | GPL-2 |
Version: | 0.1.0 |
Built: | 2025-01-17 05:24:16 UTC |
Source: | https://github.com/umich-biostatistics/syndi |
Example data set for Create.Synthetic()
a list with
nrep when generating the synthetic data, replicate the observed X nrep times
datan simulated internal data set
betaHatExt_list list of external model estimates
Creates a synthetic data set from internal data and external models.
Create.Synthetic( datan, nrep, Y, XB, Ytype = "binary", parametric, betaHatExt_list, sigmaHatExt_list = NULL )
Create.Synthetic( datan, nrep, Y, XB, Ytype = "binary", parametric, betaHatExt_list, sigmaHatExt_list = NULL )
datan |
internal data only |
nrep |
number of replication when creating the synthetic data |
Y |
outcome name, e.g. Y='Y' |
XB |
all covariate names for both X and B in the target model, e.g. XB=c('X1','X2','X3','X4','B1','B2') |
Ytype |
the type of outcome Y, either 'binary' or 'continuous'. |
parametric |
choice of "Yes" or "No" for each external model. Specify whether the external model is paramtric or not, e.g. parametric=c('Yes','No') |
betaHatExt_list |
a list of parameter estimates of the external models. The order needs to be the same as listed in XB, and variable name is required. See example for details. |
sigmaHatExt_list |
a list of sigma^2 for continuous outcome fitted from linear regression. If not available or the outcome type is binary, set sigmaHatExt_list=NULL |
a data.frame
. The combined dataset of the internal data (of size n) and the synthetic
data for the given external model (of size n *
nrep). This combined dataset
contains a total of n*(1+nrep) rows, one intercept column (Int), one outcome
column (Y), one indicator column (S), and all the predictors in the internal
data. S is the indicator variable, where the internal data is indicated as S=0,
and the synthetic data is indicated as S=1. The internal data part is a complete
dataset without any missingness. The synthetic data part may contain missingness
for certain predictors that were not used in the external model.
Reference: Gu, T., Taylor, J.M.G. and Mukherjee, B. (2021) Regression inference for multiple populations by integrating summary-level data using stacked imputations https://arxiv.org/abs/2106.06835.
data(create_synthetic_example) nrep = create_synthetic_example$nrep datan = create_synthetic_example$datan betaHatExt_list = create_synthetic_example$betaHatExt_list data.combined = Create.Synthetic(nrep = nrep, datan = datan, Y = 'Y', XB = c('X1', 'X2', 'X3', 'X4', 'B1', 'B2'), Ytype = 'binary', parametric = c('Yes', 'No'), betaHatExt_list = betaHatExt_list, sigmaHatExt_list = NULL)
data(create_synthetic_example) nrep = create_synthetic_example$nrep datan = create_synthetic_example$datan betaHatExt_list = create_synthetic_example$betaHatExt_list data.combined = Create.Synthetic(nrep = nrep, datan = datan, Y = 'Y', XB = c('X1', 'X2', 'X3', 'X4', 'B1', 'B2'), Ytype = 'binary', parametric = c('Yes', 'No'), betaHatExt_list = betaHatExt_list, sigmaHatExt_list = NULL)
Expit function
expit(x)
expit(x)
x |
vector to expit |
numeric vector with the value of the expit function y = expit(x) = exp(x)/(1+exp(x)).
Expit helper function.
Example data set for Initial.estimates()
a list with
datan simulated internal data set
gamma.I internal gamma coefficients
beta beta estimates from external model 1
Calculate the initial estimates for external populations.
Initial.estimates(datan, gamma.I, X, B, beta, Btype)
Initial.estimates(datan, gamma.I, X, B, beta, Btype)
datan |
internal data only |
gamma.I |
regression estimates using internal data only (datan) |
X |
a vector of predictor that were used in the external study, e.g. X = c('X1','X2','X3') |
B |
a vector of covariates that were not used in the external study, e.g. B=c('X4','B1','B2') |
beta |
a vector of external model estimates, the vector order should be the same as listed in X, e.g. names(beta) = c("int", "X1", "X2", "X3") |
Btype |
a vector of type of B, either continuous or binary. If "continuous", linear regression will be used; if "binary", logistic regression will be used. More types can be implemented manually. |
a numeric vector of estimated coefficients of the target model for the given external population. Assume the internal data contains p predictors. The vector is of dimension (p+1), including the estimates of the intercept.
Neuhaus, J. and Jewell, N. (1993). A geometric approach to assess bias due to omitted covariates in generalized linear models. Biometrika 80,807–815.
Gu, T., Taylor, J.M.G. and Mukherjee, B. (2021) Regression inference for multiple populations by integrating summary-level data using stacked imputations https://arxiv.org/abs/2106.06835.
#' data(initial_estimates_example) datan = initial_estimates_example$datan gamma.I = initial_estimates_example$gamma.I beta = initial_estimates_example$beta # calculate the initial gamma for population S=1 gamma.S1.origin = Initial.estimates(datan = datan, gamma.I = gamma.I, X = c('X1', 'X2', 'X3'), B = c('X4', 'B1', 'B2'), beta = beta, Btype = c('continuous', 'continuous', 'binary'))
#' data(initial_estimates_example) datan = initial_estimates_example$datan gamma.I = initial_estimates_example$gamma.I beta = initial_estimates_example$beta # calculate the initial gamma for population S=1 gamma.S1.origin = Initial.estimates(datan = datan, gamma.I = gamma.I, X = c('X1', 'X2', 'X3'), B = c('X4', 'B1', 'B2'), beta = beta, Btype = c('continuous', 'continuous', 'binary'))
Resampling function to get bootstrap variance for binary Y. Note that readers need to modify the existing function Resample.gamma.binaryY() to match their own Steps 1-5. It was only included in the package for the purpose of providing an example.
Resample.gamma.binaryY(data, indices)
Resample.gamma.binaryY(data, indices)
data |
synthetic data |
indices |
row indices to replicate |
numeric vector of regression coefficients
Reference: Gu, T., Taylor, J.M.G. and Mukherjee, B. (2021) Regression inference for multiple populations by integrating summary-level data using stacked imputations https://arxiv.org/abs/2106.06835.
Resampling function to get bootstrap variance for continuous Y. Note that readers need to modify the existing function Resample.gamma.continuousY() to match their own Steps 1-5. It was only included in the package for the purpose of providing an example.
Resample.gamma.continuousY(data, indices)
Resample.gamma.continuousY(data, indices)
data |
synthetic data |
indices |
row indices to replicate |
numeric vector of regression coefficients
Reference: Gu, T., Taylor, J.M.G. and Mukherjee, B. (2021) Regression inference for multiple populations by integrating summary-level data using stacked imputations https://arxiv.org/abs/2106.06835.