Package 'pldamixture' reference manual

Title:	Post-Linkage Data Analysis Based on Mixture Modelling
Description:	Perform inference in the secondary analysis setting with linked data potentially containing mismatch errors. Only the linked data file may be accessible and information about the record linkage process may be limited or unavailable. Implements the 'General Framework for Regression with Mismatched Data' developed by Slawski et al. (2023) <doi:10.48550/arXiv.2306.00909>. The framework uses a mixture model for pairs of linked records whose two components reflect distributions conditional on match status, i.e., correct match or mismatch. Inference is based on composite likelihood and the Expectation-Maximization (EM) algorithm. The package currently supports Cox Proportional Hazards Regression (right-censored data only) and Generalized Linear Regression Models (Gaussian, Gamma, Poisson, and Logistic (binary models only)). Information about the underlying record linkage process can be incorporated into the method if available (e.g., assumed overall mismatch rate, safe matches, predictors of match status, or predicted probabilities of correct matches).
Authors:	Priyanjali Bukke [aut, cre], Zhenbang Wang [aut], Martin Slawski [aut] ([email protected]), Brady T. West [aut], Emanuel Ben-David [aut], Guoqing Diao [aut]
Maintainer:	Priyanjali Bukke <[email protected]>
License:	GPL-2
Version:	0.1.1
Built:	2025-03-03 04:32:45 UTC
Source:	https://github.com/bpriy/pldamixture

Post-Linkage Data Analysis Based on Mixture Modelling

Description

pldamixture implements the "General Framework for Regression with Mismatched Data" developed by Slawski et al., 2023. The framework uses a mixture model for pairs of linked records whose two components reflect distributions conditional on match status, i.e., correct match or mismatch. Inference is based on composite likelihood and the EM algorithm.

The package contains 4 functions for usage:
fit_mixture
print.fitmixture
summary.fitmixture
predict.fitmixture

Note

The references below discuss the implemented framework in more detail.

*Corresponding Author ([email protected])

References

Slawski, M.*, West, B. T., Bukke, P., Diao, G., Wang, Z., & Ben-David, E. (2023). A General Framework for Regression with Mismatched Data Based on Mixture Modeling. Under Review. < doi:10.48550/arXiv.2306.00909 >

Bukke, P., Ben-David, E., Diao, G., Slawski, M.*, & West, B. T. (2023). Cox Proportional Hazards Regression Using Linked Data: An Approach Based on Mixture Modelling. Under Review.

Slawski, M.*, Diao, G., Ben-David, E. (2021). A pseudo-likelihood approach to linear regression with partially shuffled data. Journal of Computational and Graphical Statistics. 30(4), 991-1003 < doi:10.1080/10618600.2020.1870482 >

Examples

# optional inputs for linear regression of age at death on year of birth,
#    using a cubic polynomial specification.
## use commonness of names as predictors of match status
## first and last names were used for linkage
mformula <- ~commf + comml
## hand-linked records are considered "safe" matches
safematches <- ifelse(lifem$hndlnk =="Hand-Linked At Some Level", TRUE, FALSE)
## overall mismatch rate in the data set is assumed to be ~ 0.05
mrate <- 0.05

fit <- fit_mixture(age_at_death ~ poly(unit_yob, 3, raw = TRUE), data = lifem,
                   family = "gaussian", mformula, safematches, mrate)
print(fit)
summary(fit)
predict(fit)
# optional inputs for linear regression of age at death on year of birth,
#    using a cubic polynomial specification.
## use commonness of names as predictors of match status
## first and last names were used for linkage
mformula <- ~commf + comml
## hand-linked records are considered "safe" matches
safematches <- ifelse(lifem$hndlnk =="Hand-Linked At Some Level", TRUE, FALSE)
## overall mismatch rate in the data set is assumed to be ~ 0.05
mrate <- 0.05

fit <- fit_mixture(age_at_death ~ poly(unit_yob, 3, raw = TRUE), data = lifem,
                   family = "gaussian", mformula, safematches, mrate)
print(fit)
summary(fit)
predict(fit)

Adjustment Method

Description

Perform regression adjusted for mismatched data. The function currently supports Cox Proportional Hazards Regression (right-censored data only) and Generalized Linear Regression Models (Gaussian, Gamma, Poisson, and Logistic (binary models only)). Information about the underlying record linkage process can be incorporated into the method if available (e.g., assumed overall mismatch rate, safe matches, predictors of match status, or predicted probabilities of correct matches).

Usage

fit_mixture(
  formula,
  data,
  family = "gaussian",
  mformula,
  safematches,
  mrate,
  control = list(initbeta = "default", initgamma = "default", fy = "default", maxiter =
    1000, tol = 1e-04, cmaxiter = 1000),
  ...
)
fit_mixture(
  formula,
  data,
  family = "gaussian",
  mformula,
  safematches,
  mrate,
  control = list(initbeta = "default", initgamma = "default", fy = "default", maxiter =
    1000, tol = 1e-04, cmaxiter = 1000),
  ...
)

Arguments

`formula`	a formula object for the outcome model, with the covariate(s) on the right of "~" and the response on the left. In the Cox proportional hazards setting, the response should be provided using the `Surv` function and the covariates should be separated by + signs.
`data`	a data.frame with linked data used in "formula" and "formula.m" (optional)
`family`	the type of regression model ("gaussian" - default, "poisson", "binomial", "gamma", "cox"). For Generalized Linear Models, standard link functions are used ("identity" for Gaussian, "log" for Poisson and Gamma, and "logit" for binomial).
`mformula`	a one-sided formula object for the mismatch indicator model, with the covariates on the right of "~". The default is an intercept-only model corresponding to a constant mismatch rate)
`safematches`	an indicator variable for safe matches (TRUE : record can be treated as a correct match and FALSE : record may be mismatched). The default is FALSE for all matches.
`mrate`	the assumed overall mismatch rate (a proportion between 0 and 1). If not provided, no overall mismatch rate is assumed.
`control`	an optional list variable to customize the initial parameter estimates ("initbeta" for the outcome model and "initgamma" for the mismatch indicator model), estimated marginal density of the response ("fy"), maximum iterations for the EM algorithm ("maxiter"), maximum iterations for the subroutine in the constrained logistic regression function ("cmaxiter"), and convergence tolerance for the termination of the EM algorithm ("tol").
`...`	the option to directly pass "control" arguments

Value

a list of results from the function called depending on the "family" specified.

`coefficients`	the outcome model coefficient estimates
`match.prob`	the posterior correct match probabilities for observations given parameter estimates
`objective`	a variable that tracks the negative log pseudo-likelihood for all iterations of the EM algorithm.
`family`	the type of (outcome) regression model
`standard.errors`	the estimated standard errors
`m.coefficients`	the correct match model coefficient estimates
`call`	the matched call
`wfit`	an internal-use object for the predict function
`dispersion`	the dispersion parameter estimate when the family is a Generalized Linear Model
`Lambdahat_0`	the baseline cumulative hazard (using weighted Breslow estimator) when the family is "cox"
`g_Lambdahat_0`	the baseline cumulative hazard for the marginal density of the response variable (using Nelson-Aalen estimator) when the family is "cox"

Note

The references below discuss the implemented framework in more detail. The standard errors are estimated using Louis' method for the "cox" family (Bukke et al., 2023) and using the sandwich formula otherwise (Slawski et al., 2023).

*Corresponding Author ([email protected])

References

Bukke, P., Ben-David, E., Diao, G., Slawski, M.*, & West, B. T. (2023). Cox Proportional Hazards Regression Using Linked Data: An Approach Based on Mixture Modelling. Under Review.

Examples

## commonness score of first and last names used for linkage
mformula <- ~commf + comml
## hand-linked records are considered "safe" matches
safematches <- ifelse(lifem$hndlnk =="Hand-Linked At Some Level", TRUE, FALSE)
## overall mismatch rate in the data set is assumed to be ~ 0.05
mrate <- 0.05

fit <- fit_mixture(age_at_death ~ poly(unit_yob, 3, raw = TRUE), data = lifem,
                   family = "gaussian", mformula, safematches, mrate)

## commonness score of first and last names used for linkage
mformula <- ~commf + comml
## hand-linked records are considered "safe" matches
safematches <- ifelse(lifem$hndlnk =="Hand-Linked At Some Level", TRUE, FALSE)
## overall mismatch rate in the data set is assumed to be ~ 0.05
mrate <- 0.05

fit <- fit_mixture(age_at_death ~ poly(unit_yob, 3, raw = TRUE), data = lifem,
                   family = "gaussian", mformula, safematches, mrate)

LIFE-M Data

Description

The lifem data set contains a subset of data from the Life-M project (https://life-m.org/) on 3,238 individuals born between 1883 to 1906. These records were obtained from linking birth certificates and death certificates either of two ways. A fraction of the records (2,159 records) were randomly sampled to be “hand-linked at some level” (HL). These records are high quality and were manually linked at some point by trained research assistants. The remaining records were “purely machine-linked” (ML) based on probabilistic record linkage without clerical review. The Life-M team expects the mismatch rate among these records to be around 5% (Bailey et al. 2022). Of interest is the relationship between age at death and year of birth. The lifem demo data set consists of 2,159 hand-linked records and 1,079 records that were randomly sampled from the purely machine-linked records (~2:1 HL-ML ratio).

Usage

data(lifem)
data(lifem)

Format

a data frame with 3,238 rows and 6 variables

Details

yob: year of birth (value from 1883 and 1906)
unit_yob: yob re-scaled to the unit interval for analysis (between 0 and 1). If X is the yob, we use the following: (X – min(X)) / (max(X) – min(X)) = a * X + b, a = 1/(max(X) – min(X)), b = -min(X)*a
age_at_death: age at death (in years)
hndlnk: whether record was purely machine-linked or hand-linked at some level.
commf: commonness score of first name (between 0 and 1). It is based on the 1940 census. It is a ratio of the log count of the individual’s first name over the log count of the most commonly occurring first name in the census.
comml: commonness score of last name (between 0 and 1). It is based on the 1940 census. It is a ratio of the log count of the individual’s last name over the log count of the most commonly occurring last name in the census.

References

Bailey, Martha J., Lin, Peter Z., Mohammed, A.R. Shariq, Mohnen, Paul, Murray, Jared, Zhang, Mengying, and Prettyman, Alexa. LIFE-M: The Longitudinal, Intergenerational Family Electronic Micro-Database. Ann Arbor, MI: Inter-university Consortium for Political and Social Research (distributor), 2022-12-21. < doi:10.3886/E155186V5 >

Predictions From a "fitmixture" Object

Description

Obtain predictions from a fit_mixture() object using predict.coxph(), predict.glm(), or predict.lm().

Usage

## S3 method for class 'fitmixture'
predict(
  object,
  newdata,
  type,
  terms = NULL,
  na.action = na.pass,
  reference = "strata",
  ...
)
## S3 method for class 'fitmixture'
predict(
  object,
  newdata,
  type,
  terms = NULL,
  na.action = na.pass,
  reference = "strata",
  ...
)

Arguments

`object`	the result of a call to `fit_mixture()`
`newdata`	optional new data to obtain predictions for. The original data is used by default.
`type`	the type of prediction. For the "cox" family, the choices are the linear predictor ("lp"), the risk score exp(lp) ("risk"), the expected number of events given the covariates and follow-up time ("expected"), and the terms of the linear predictor ("terms"). The survival probability for a subject is equal to exp(-expected). For the "gaussian" family, the choices are response ("response") or model term ("terms"). For the other glm families ("poisson", "binomial", "gamma"), the choices are predictions on the scale of the linear predictors ("link"), response ("response"), or model term ("terms").
`terms`	the terms when type = "terms". By default, all terms are included.
`na.action`	a function for what to do with missing values in `newdata`. The default is to predict "NA".
`reference`	when family = "cox", reference for centering predictions. Available options are c("strata" - default, "sample", "zero"). The default is "strata".
`...`	for future predict arguments

Value

a vector or matrix of predictions based on arguments specified.

Examples

## commonness score of first and last names used for linkage
mformula <- ~commf + comml
## hand-linked records are considered "safe" matches
safematches <- ifelse(lifem$hndlnk =="Hand-Linked At Some Level", TRUE, FALSE)
## overall mismatch rate in the data set is assumed to be ~ 0.05
mrate <- 0.05
fit <- fit_mixture(age_at_death ~ poly(unit_yob, 3, raw = TRUE), data = lifem,
                   family = "gaussian", mformula, safematches, mrate)

predict(fit)

## commonness score of first and last names used for linkage
mformula <- ~commf + comml
## hand-linked records are considered "safe" matches
safematches <- ifelse(lifem$hndlnk =="Hand-Linked At Some Level", TRUE, FALSE)
## overall mismatch rate in the data set is assumed to be ~ 0.05
mrate <- 0.05
fit <- fit_mixture(age_at_death ~ poly(unit_yob, 3, raw = TRUE), data = lifem,
                   family = "gaussian", mformula, safematches, mrate)

predict(fit)

Print a "fitmixture" Object

Description

Print call and outcome model coefficients from a fit_mixture() object

Usage

## S3 method for class 'fitmixture'
print(x, digits = max(3L, getOption("digits") - 3L), ...)
## S3 method for class 'fitmixture'
print(x, digits = max(3L, getOption("digits") - 3L), ...)

Arguments

`x`	the result of a call to `fit_mixture()`
`digits`	the number of significant digits to print
`...`	for additional print arguments

Value

invisibly returns the fit_mixture() object that is provided as an argument

Examples

## commonness score of first and last names used for linkage
mformula <- ~commf + comml
## hand-linked records are considered "safe" matches
safematches <- ifelse(lifem$hndlnk =="Hand-Linked At Some Level", TRUE, FALSE)
## overall mismatch rate in the data set is assumed to be ~ 0.05
mrate <- 0.05
fit <- fit_mixture(age_at_death ~ poly(unit_yob, 3, raw = TRUE), data = lifem,
                   family = "gaussian", mformula, safematches, mrate)

print(fit)

## commonness score of first and last names used for linkage
mformula <- ~commf + comml
## hand-linked records are considered "safe" matches
safematches <- ifelse(lifem$hndlnk =="Hand-Linked At Some Level", TRUE, FALSE)
## overall mismatch rate in the data set is assumed to be ~ 0.05
mrate <- 0.05
fit <- fit_mixture(age_at_death ~ poly(unit_yob, 3, raw = TRUE), data = lifem,
                   family = "gaussian", mformula, safematches, mrate)

print(fit)

Summarize a "fitmixture" Object

Description

Summarize results from a fit_mixture() object

Usage

## S3 method for class 'fitmixture'
summary(object, ...)
## S3 method for class 'fitmixture'
summary(object, ...)

Arguments

`object`	the result of a call to `fit_mixture()`
`...`	for additional summary arguments

Value

a list of results from the function called depending on the "family" specified.

`call`	the matched call
`family`	the assumed type of (outcome) regression model
`coefficients`	a matrix with the outcome model's coefficient estimates, standard errors, t or z values, and p-values
`m.coefficients`	a matrix with the correct match model's coefficient estimates and standard errors
`avgcmr`	the average correct match rate among all records
`match.prob`	the posterior correct match probabilities for observations given parameter estimates
`dispersion`	the dispersion parameter estimate when the family is a Generalized Linear Model

Examples

## commonness score of first and last names used for linkage
mformula <- ~commf + comml
## hand-linked records are considered "safe" matches
safematches <- ifelse(lifem$hndlnk =="Hand-Linked At Some Level", TRUE, FALSE)
## overall mismatch rate in the data set is assumed to be ~ 0.05
mrate <- 0.05
fit <- fit_mixture(age_at_death ~ poly(unit_yob, 3, raw = TRUE), data = lifem,
                   family = "gaussian", mformula, safematches, mrate)

summary(fit)

## commonness score of first and last names used for linkage
mformula <- ~commf + comml
## hand-linked records are considered "safe" matches
safematches <- ifelse(lifem$hndlnk =="Hand-Linked At Some Level", TRUE, FALSE)
## overall mismatch rate in the data set is assumed to be ~ 0.05
mrate <- 0.05
fit <- fit_mixture(age_at_death ~ poly(unit_yob, 3, raw = TRUE), data = lifem,
                   family = "gaussian", mformula, safematches, mrate)

summary(fit)

Package 'pldamixture'

Help Index

Post-Linkage Data Analysis Based on Mixture Modelling

Description

Note

References

Examples

Adjustment Method

Description

Usage

Arguments

Value

Note

References

Examples

LIFE-M Data

Description

Usage

Format

Details

References

Predictions From a "fitmixture" Object

Description

Usage

Arguments

Value

Examples

Print a "fitmixture" Object

Description

Usage

Arguments

Value

Examples

Summarize a "fitmixture" Object

Description

Usage

Arguments

Value

Examples