Title: | Post-Linkage Data Analysis Based on Mixture Modelling |
---|---|
Description: | Perform inference in the secondary analysis setting with linked data potentially containing mismatch errors. Only the linked data file may be accessible and information about the record linkage process may be limited or unavailable. Implements the 'General Framework for Regression with Mismatched Data' developed by Slawski et al. (2023) <doi:10.48550/arXiv.2306.00909>. The framework uses a mixture model for pairs of linked records whose two components reflect distributions conditional on match status, i.e., correct match or mismatch. Inference is based on composite likelihood and the Expectation-Maximization (EM) algorithm. The package currently supports Cox Proportional Hazards Regression (right-censored data only) and Generalized Linear Regression Models (Gaussian, Gamma, Poisson, and Logistic (binary models only)). Information about the underlying record linkage process can be incorporated into the method if available (e.g., assumed overall mismatch rate, safe matches, predictors of match status, or predicted probabilities of correct matches). |
Authors: | Priyanjali Bukke [aut, cre], Zhenbang Wang [aut], Martin Slawski [aut] ([email protected]), Brady T. West [aut], Emanuel Ben-David [aut], Guoqing Diao [aut] |
Maintainer: | Priyanjali Bukke <[email protected]> |
License: | GPL-2 |
Version: | 0.1.1 |
Built: | 2025-03-03 04:32:45 UTC |
Source: | https://github.com/bpriy/pldamixture |
pldamixture
implements the "General Framework for Regression with
Mismatched Data" developed by Slawski et al., 2023. The framework uses a
mixture model for pairs of linked records whose two components reflect
distributions conditional on match status, i.e., correct match or mismatch.
Inference is based on composite likelihood and the EM algorithm.
The package contains 4 functions for usage:fit_mixture
print.fitmixture
summary.fitmixture
predict.fitmixture
The references below discuss the implemented framework in more detail.
*Corresponding Author ([email protected])
Slawski, M.*, West, B. T., Bukke, P., Diao, G., Wang, Z., & Ben-David, E. (2023).
A General Framework for Regression with Mismatched Data Based on Mixture Modeling.
Under Review. < doi:10.48550/arXiv.2306.00909 >
Bukke, P., Ben-David, E., Diao, G., Slawski, M.*, & West, B. T. (2023).
Cox Proportional Hazards Regression Using Linked Data: An Approach Based on Mixture Modelling.
Under Review.
Slawski, M.*, Diao, G., Ben-David, E. (2021). A pseudo-likelihood approach to linear regression with partially shuffled data. Journal of Computational and Graphical Statistics. 30(4), 991-1003 < doi:10.1080/10618600.2020.1870482 >
# optional inputs for linear regression of age at death on year of birth, # using a cubic polynomial specification. ## use commonness of names as predictors of match status ## first and last names were used for linkage mformula <- ~commf + comml ## hand-linked records are considered "safe" matches safematches <- ifelse(lifem$hndlnk =="Hand-Linked At Some Level", TRUE, FALSE) ## overall mismatch rate in the data set is assumed to be ~ 0.05 mrate <- 0.05 fit <- fit_mixture(age_at_death ~ poly(unit_yob, 3, raw = TRUE), data = lifem, family = "gaussian", mformula, safematches, mrate) print(fit) summary(fit) predict(fit)
# optional inputs for linear regression of age at death on year of birth, # using a cubic polynomial specification. ## use commonness of names as predictors of match status ## first and last names were used for linkage mformula <- ~commf + comml ## hand-linked records are considered "safe" matches safematches <- ifelse(lifem$hndlnk =="Hand-Linked At Some Level", TRUE, FALSE) ## overall mismatch rate in the data set is assumed to be ~ 0.05 mrate <- 0.05 fit <- fit_mixture(age_at_death ~ poly(unit_yob, 3, raw = TRUE), data = lifem, family = "gaussian", mformula, safematches, mrate) print(fit) summary(fit) predict(fit)
Perform regression adjusted for mismatched data. The function currently supports Cox Proportional Hazards Regression (right-censored data only) and Generalized Linear Regression Models (Gaussian, Gamma, Poisson, and Logistic (binary models only)). Information about the underlying record linkage process can be incorporated into the method if available (e.g., assumed overall mismatch rate, safe matches, predictors of match status, or predicted probabilities of correct matches).
fit_mixture( formula, data, family = "gaussian", mformula, safematches, mrate, control = list(initbeta = "default", initgamma = "default", fy = "default", maxiter = 1000, tol = 1e-04, cmaxiter = 1000), ... )
fit_mixture( formula, data, family = "gaussian", mformula, safematches, mrate, control = list(initbeta = "default", initgamma = "default", fy = "default", maxiter = 1000, tol = 1e-04, cmaxiter = 1000), ... )
formula |
a formula object for the outcome model, with the covariate(s) on
the right of "~" and the response on the left. In the Cox proportional hazards
setting, the response should be provided using the |
data |
a data.frame with linked data used in "formula" and "formula.m" (optional) |
family |
the type of regression model ("gaussian" - default, "poisson", "binomial", "gamma", "cox"). For Generalized Linear Models, standard link functions are used ("identity" for Gaussian, "log" for Poisson and Gamma, and "logit" for binomial). |
mformula |
a one-sided formula object for the mismatch indicator model, with the covariates on the right of "~". The default is an intercept-only model corresponding to a constant mismatch rate) |
safematches |
an indicator variable for safe matches (TRUE : record can be treated as a correct match and FALSE : record may be mismatched). The default is FALSE for all matches. |
mrate |
the assumed overall mismatch rate (a proportion between 0 and 1). If not provided, no overall mismatch rate is assumed. |
control |
an optional list variable to customize the initial parameter estimates ("initbeta" for the outcome model and "initgamma" for the mismatch indicator model), estimated marginal density of the response ("fy"), maximum iterations for the EM algorithm ("maxiter"), maximum iterations for the subroutine in the constrained logistic regression function ("cmaxiter"), and convergence tolerance for the termination of the EM algorithm ("tol"). |
... |
the option to directly pass "control" arguments |
a list of results from the function called depending on the "family" specified.
coefficients |
the outcome model coefficient estimates |
match.prob |
the posterior correct match probabilities for observations given parameter estimates |
objective |
a variable that tracks the negative log pseudo-likelihood for all iterations of the EM algorithm. |
family |
the type of (outcome) regression model |
standard.errors |
the estimated standard errors |
m.coefficients |
the correct match model coefficient estimates |
call |
the matched call |
wfit |
an internal-use object for the predict function |
dispersion |
the dispersion parameter estimate when the family is a Generalized Linear Model |
Lambdahat_0 |
the baseline cumulative hazard (using weighted Breslow estimator) when the family is "cox" |
g_Lambdahat_0 |
the baseline cumulative hazard for the marginal density of the response variable (using Nelson-Aalen estimator) when the family is "cox" |
The references below discuss the implemented framework in more detail. The standard
errors are estimated using Louis' method for the "cox" family (Bukke et al., 2023) and
using the sandwich formula otherwise (Slawski et al., 2023).
*Corresponding Author ([email protected])
Slawski, M.*, West, B. T., Bukke, P., Diao, G., Wang, Z., & Ben-David, E. (2023).
A General Framework for Regression with Mismatched Data Based on Mixture Modeling.
Under Review. < doi:10.48550/arXiv.2306.00909 >
Bukke, P., Ben-David, E., Diao, G., Slawski, M.*, & West, B. T. (2023).
Cox Proportional Hazards Regression Using Linked Data: An Approach Based on Mixture Modelling.
Under Review.
Slawski, M.*, Diao, G., Ben-David, E. (2021). A pseudo-likelihood approach to linear regression with partially shuffled data. Journal of Computational and Graphical Statistics. 30(4), 991-1003 < doi:10.1080/10618600.2020.1870482 >
## commonness score of first and last names used for linkage mformula <- ~commf + comml ## hand-linked records are considered "safe" matches safematches <- ifelse(lifem$hndlnk =="Hand-Linked At Some Level", TRUE, FALSE) ## overall mismatch rate in the data set is assumed to be ~ 0.05 mrate <- 0.05 fit <- fit_mixture(age_at_death ~ poly(unit_yob, 3, raw = TRUE), data = lifem, family = "gaussian", mformula, safematches, mrate)
## commonness score of first and last names used for linkage mformula <- ~commf + comml ## hand-linked records are considered "safe" matches safematches <- ifelse(lifem$hndlnk =="Hand-Linked At Some Level", TRUE, FALSE) ## overall mismatch rate in the data set is assumed to be ~ 0.05 mrate <- 0.05 fit <- fit_mixture(age_at_death ~ poly(unit_yob, 3, raw = TRUE), data = lifem, family = "gaussian", mformula, safematches, mrate)
The lifem
data set contains a subset of data from the Life-M project (https://life-m.org/) on 3,238 individuals born between 1883
to 1906. These records were obtained from linking birth certificates and death certificates either of two
ways. A fraction of the records (2,159 records) were randomly sampled to be “hand-linked at some level” (HL).
These records are high quality and were manually linked at some point by trained research assistants.
The remaining records were “purely machine-linked” (ML) based on probabilistic record linkage without clerical
review. The Life-M team expects the mismatch rate among these records to be around 5% (Bailey et al.
2022). Of interest is the relationship between age at death and year of birth. The lifem
demo data set consists of 2,159 hand-linked records
and 1,079 records that were randomly sampled from the purely machine-linked records (~2:1 HL-ML ratio).
data(lifem)
data(lifem)
a data frame with 3,238 rows and 6 variables
yob: year of birth (value from 1883 and 1906)
unit_yob: yob re-scaled to the unit interval for analysis (between 0 and 1). If X is the yob, we use the following: (X – min(X)) / (max(X) – min(X)) = a * X + b, a = 1/(max(X) – min(X)), b = -min(X)*a
age_at_death: age at death (in years)
hndlnk: whether record was purely machine-linked or hand-linked at some level.
commf: commonness score of first name (between 0 and 1). It is based on the 1940 census. It is a ratio of the log count of the individual’s first name over the log count of the most commonly occurring first name in the census.
comml: commonness score of last name (between 0 and 1). It is based on the 1940 census. It is a ratio of the log count of the individual’s last name over the log count of the most commonly occurring last name in the census.
Bailey, Martha J., Lin, Peter Z., Mohammed, A.R. Shariq, Mohnen, Paul, Murray, Jared, Zhang, Mengying, and Prettyman, Alexa. LIFE-M: The Longitudinal, Intergenerational Family Electronic Micro-Database. Ann Arbor, MI: Inter-university Consortium for Political and Social Research (distributor), 2022-12-21. < doi:10.3886/E155186V5 >
Obtain predictions from a fit_mixture()
object using predict.coxph()
, predict.glm()
, or predict.lm()
.
## S3 method for class 'fitmixture' predict( object, newdata, type, terms = NULL, na.action = na.pass, reference = "strata", ... )
## S3 method for class 'fitmixture' predict( object, newdata, type, terms = NULL, na.action = na.pass, reference = "strata", ... )
object |
the result of a call to |
newdata |
optional new data to obtain predictions for. The original data is used by default. |
type |
the type of prediction. For the "cox" family, the choices are the linear predictor ("lp"), the risk score exp(lp) ("risk"), the expected number of events given the covariates and follow-up time ("expected"), and the terms of the linear predictor ("terms"). The survival probability for a subject is equal to exp(-expected). For the "gaussian" family, the choices are response ("response") or model term ("terms"). For the other glm families ("poisson", "binomial", "gamma"), the choices are predictions on the scale of the linear predictors ("link"), response ("response"), or model term ("terms"). |
terms |
the terms when type = "terms". By default, all terms are included. |
na.action |
a function for what to do with missing values in |
reference |
when family = "cox", reference for centering predictions. Available options are c("strata" - default, "sample", "zero"). The default is "strata". |
... |
for future predict arguments |
a vector or matrix of predictions based on arguments specified.
## commonness score of first and last names used for linkage mformula <- ~commf + comml ## hand-linked records are considered "safe" matches safematches <- ifelse(lifem$hndlnk =="Hand-Linked At Some Level", TRUE, FALSE) ## overall mismatch rate in the data set is assumed to be ~ 0.05 mrate <- 0.05 fit <- fit_mixture(age_at_death ~ poly(unit_yob, 3, raw = TRUE), data = lifem, family = "gaussian", mformula, safematches, mrate) predict(fit)
## commonness score of first and last names used for linkage mformula <- ~commf + comml ## hand-linked records are considered "safe" matches safematches <- ifelse(lifem$hndlnk =="Hand-Linked At Some Level", TRUE, FALSE) ## overall mismatch rate in the data set is assumed to be ~ 0.05 mrate <- 0.05 fit <- fit_mixture(age_at_death ~ poly(unit_yob, 3, raw = TRUE), data = lifem, family = "gaussian", mformula, safematches, mrate) predict(fit)
Print call and outcome model coefficients from a fit_mixture()
object
## S3 method for class 'fitmixture' print(x, digits = max(3L, getOption("digits") - 3L), ...)
## S3 method for class 'fitmixture' print(x, digits = max(3L, getOption("digits") - 3L), ...)
x |
the result of a call to |
digits |
the number of significant digits to print |
... |
for additional print arguments |
invisibly returns the fit_mixture()
object that is provided as an argument
## commonness score of first and last names used for linkage mformula <- ~commf + comml ## hand-linked records are considered "safe" matches safematches <- ifelse(lifem$hndlnk =="Hand-Linked At Some Level", TRUE, FALSE) ## overall mismatch rate in the data set is assumed to be ~ 0.05 mrate <- 0.05 fit <- fit_mixture(age_at_death ~ poly(unit_yob, 3, raw = TRUE), data = lifem, family = "gaussian", mformula, safematches, mrate) print(fit)
## commonness score of first and last names used for linkage mformula <- ~commf + comml ## hand-linked records are considered "safe" matches safematches <- ifelse(lifem$hndlnk =="Hand-Linked At Some Level", TRUE, FALSE) ## overall mismatch rate in the data set is assumed to be ~ 0.05 mrate <- 0.05 fit <- fit_mixture(age_at_death ~ poly(unit_yob, 3, raw = TRUE), data = lifem, family = "gaussian", mformula, safematches, mrate) print(fit)
Summarize results from a fit_mixture()
object
## S3 method for class 'fitmixture' summary(object, ...)
## S3 method for class 'fitmixture' summary(object, ...)
object |
the result of a call to |
... |
for additional summary arguments |
a list of results from the function called depending on the "family" specified.
call |
the matched call |
family |
the assumed type of (outcome) regression model |
coefficients |
a matrix with the outcome model's coefficient estimates, standard errors, t or z values, and p-values |
m.coefficients |
a matrix with the correct match model's coefficient estimates and standard errors |
avgcmr |
the average correct match rate among all records |
match.prob |
the posterior correct match probabilities for observations given parameter estimates |
dispersion |
the dispersion parameter estimate when the family is a Generalized Linear Model |
## commonness score of first and last names used for linkage mformula <- ~commf + comml ## hand-linked records are considered "safe" matches safematches <- ifelse(lifem$hndlnk =="Hand-Linked At Some Level", TRUE, FALSE) ## overall mismatch rate in the data set is assumed to be ~ 0.05 mrate <- 0.05 fit <- fit_mixture(age_at_death ~ poly(unit_yob, 3, raw = TRUE), data = lifem, family = "gaussian", mformula, safematches, mrate) summary(fit)
## commonness score of first and last names used for linkage mformula <- ~commf + comml ## hand-linked records are considered "safe" matches safematches <- ifelse(lifem$hndlnk =="Hand-Linked At Some Level", TRUE, FALSE) ## overall mismatch rate in the data set is assumed to be ~ 0.05 mrate <- 0.05 fit <- fit_mixture(age_at_death ~ poly(unit_yob, 3, raw = TRUE), data = lifem, family = "gaussian", mformula, safematches, mrate) summary(fit)