Outcome model with exact grid likelihood evaluation

Fits a logistic regression of the binary outcome on SNP genotype(s) plus covariates, then evaluates the profile log-likelihood across a pre-specified grid of beta_ZY values. This is the core methodological function for federated MR: each site runs this locally and shares only the resulting log-likelihood profile vector.

Two analysis modes are supported: alleleScore fits a single model using a weighted allele score as the genetic exposure variable, while perSNP fits separate models for each SNP (needed for multi-SNP sensitivity analyses).

Usage

fitOutcomeModel(
  cohortData,
  covariateData = NULL,
  instrumentTable,
  betaGrid = seq(-3, 3, by = 0.01),
  regularizationVariance = 0.1,
  instrumentRegularization = FALSE,
  outcomeType = "binary",
  analysisType = "alleleScore",
  siteId = "site_1",
  modelBackend = "glm"
)

Arguments

cohortData: Data frame. Output of buildMRCohort.
covariateData: Covariate data. Output of buildMRCovariates or a data frame with person_id and covariate columns.
instrumentTable: Data frame. Output of getMRInstruments.
betaGrid: Numeric vector. Grid of beta_ZY values at which to evaluate the profile log-likelihood. Default is seq(-3, 3, by = 0.01) (601 grid points).
regularizationVariance: Numeric. When modelBackend = "cyclops", this is the variance of the normal prior applied to nuisance coefficients. Smaller values imply stronger shrinkage; use Inf for an unpenalized Cyclops fit. Ignored by the "glm" backend. Default is 0.1.
instrumentRegularization: Logical. When modelBackend = "cyclops", whether the exposure coefficient is included in the Cyclops prior. Default is FALSE so the profiled exposure coefficient remains unpenalized. Ignored by the "glm" backend.
outcomeType: Character. Type of outcome: "binary" for logistic regression. Default is "binary".
analysisType: Character. "alleleScore" for single weighted score model or "perSNP" for separate per-SNP models. Default is "alleleScore".
siteId: Character. Identifier for this site. Included in the returned profile object. Default is "site_1".
modelBackend: Character. Outcome-model fitting backend: "glm" uses base R logistic regression, while "cyclops" uses Cyclops for scalable logistic regression with optional Gaussian shrinkage on nuisance covariates. Default is "glm".

Value

A list with class "medusaSiteProfile" containing:

siteId: Character site identifier.
betaGrid: Numeric vector of grid points (same as input).
logLikProfile: Numeric vector of profile log-likelihood values, same length as betaGrid. In perSNP mode this remains the valid allele-score profile used for pooling.
nCases: Number of outcome cases.
nControls: Number of controls.
snpIds: Character vector of SNP IDs used.
diagnosticFlags: List of diagnostic flags from the model fit.
betaHat: Point estimate of beta_ZY (MLE from profile).
seHat: Approximate standard error from profile curvature.
perSnpEstimates: When analysisType = "perSNP", a data frame of per-SNP beta_ZY / se_ZY estimates for summarized-data sensitivity analyses.

This object contains no individual-level data and is safe to share.

Details

Fit Outcome Model and Evaluate Profile Log-Likelihood

The profile log-likelihood evaluation works as follows:

Fit the unconstrained model to obtain the MLE and its Wald SE.
At each grid point, fix beta_ZY to that value and re-fit the nuisance parameters using the fixed term as an offset.
Record the maximized constrained log-likelihood at each grid point. This is the exact profile likelihood for the coefficient in a logistic generalized linear model.

When modelBackend = "cyclops" and regularizationVariance is finite, the nuisance parameters are estimated under a Gaussian prior at both the unconstrained optimum and every grid point. In that configuration the returned objective is a penalized profile, not an unpenalized MLE profile.

When using the allele score, weights are \(w_j = \gamma_j / \mathrm{SE}(\gamma_j)^2\), normalized by \(\sum_j |w_j|\). The same weights are reused downstream so that the MR denominator matches the exact score fitted at each site.

Missing SNP dosages in the allele-score model are imputed to the expected dosage \(2 \times \mathrm{EAF}_j\) from the instrument table, rather than being treated as homozygous reference. This avoids a systematic downward bias in the score when genotype missingness is present.

References

Suchard, M. A., et al. (2013). Massive parallelization of serial inference algorithms for a complex generalized linear model. ACM Transactions on Modeling and Computer Simulation, 23(1), 1-17.

Examples

simData <- simulateMRData(n = 2000, nSnps = 5, trueEffect = 0.3)
profile <- fitOutcomeModel(
  cohortData = simData$data,
  covariateData = NULL,
  instrumentTable = simData$instrumentTable,
  betaGrid = seq(-2, 2, by = 0.05)
)
#> Fitting outcome model at site 'site_1' (1171 cases, 829 controls)...
#> Site 'site_1': beta_ZY_hat = 0.3273 (SE = 0.1919).
plot(profile$betaGrid, profile$logLikProfile, type = "l",
     xlab = "beta_ZY", ylab = "Profile log-likelihood")