
Course 5: From Theory to the Medusa Workflow
Course05-FromTheoryToMedusaWorkflow.RmdWhat you’ll learn
By the end of this chapter, you should be able to:
- map textbook two-sample MR quantities onto Medusa objects,
- explain the roles of the coordinator and local sites,
- explain why Medusa shares profile likelihoods instead of person-level data,
- walk through the main Medusa analysis functions on simulated data.
Why this matters
The earlier chapters explained two-sample MR as a summarized-data method. Medusa implements that same causal logic in a privacy-preserving federated workflow.
The exposure side still comes from external summary statistics. The main difference is how the outcome side is estimated and shared.
The Medusa division of labor
Coordinator tasks
The coordinator typically:
- defines the scientific question,
- assembles or imports the instrument table,
- distributes the instrument table to participating sites,
- receives site-level summary artifacts,
- pools them and computes the final MR estimate.
Site tasks
Each site typically:
- extracts the outcome cohort,
- harmonizes genotype coding,
- fits the outcome model locally,
- exports a profile likelihood summary,
- optionally exports per-SNP summary estimates for sensitivity analyses.
Key takeaway: Medusa separates the exposure side, the local outcome-model fitting, and the final pooled MR estimation.
How Medusa represents the MR ingredients
In textbook notation:
-
beta_ZX: SNP-exposure associations from external GWAS, -
beta_ZY: SNP-outcome association estimated from the outcome sample.
In Medusa:
- the instrument table stores the SNP-exposure side (
beta_ZX,se_ZX, alleles, effect allele frequency), -
fitOutcomeModel()estimates the outcome-side signal locally, -
poolLikelihoodProfiles()combines site summaries, -
computeMREstimate()converts the pooled profile into the final MR estimate.
Why profile likelihood is shared
Medusa’s primary federated estimate does not ask each site to export all person-level outcomes or raw genotypes. Instead, each site evaluates a profile log-likelihood over a grid of candidate SNP-outcome effect values.
The coordinator receives:
- the grid,
- the log-likelihood values,
- metadata such as case counts,
- the allele-score definition.
Because log-likelihoods from independent sites add, the coordinator can pool the profiles exactly by summation.
This is what makes the approach both federated and statistically principled.
A runnable Medusa example
library(Medusa)
beta_grid <- seq(-2, 2, by = 0.05)
sim_data <- simulateMRData(
n = 4000,
nSnps = 8,
trueEffect = 0.40,
seed = 1005
)
head(sim_data$instrumentTable[, c("snp_id", "beta_ZX", "se_ZX", "eaf")])
#> snp_id beta_ZX se_ZX eaf
#> 1 rs1 0.4936651 0.06309374 0.1736507
#> 2 rs2 0.1734556 0.02556424 0.2347729
#> 3 rs3 0.1970980 0.04546400 0.1961966
#> 4 rs4 0.3080739 0.05459666 0.1726639
#> 5 rs5 0.4419107 0.05693616 0.3474797
#> 6 rs6 0.4473812 0.07394154 0.2378871Step 1: fit the local outcome model
profile_site_a <- fitOutcomeModel(
cohortData = sim_data$data,
covariateData = NULL,
instrumentTable = sim_data$instrumentTable,
betaGrid = beta_grid,
siteId = "site_A"
)
#> Fitting outcome model at site 'site_A' (2621 cases, 1379 controls)...
#> Site 'site_A': beta_ZY_hat = 0.4640 (SE = 0.1438).
names(profile_site_a)
#> [1] "siteId" "betaGrid" "logLikProfile" "nCases"
#> [5] "nControls" "snpIds" "diagnosticFlags" "betaHat"
#> [9] "seHat" "scoreDefinition"This local profile is Medusa’s site-level representation of the SNP-outcome evidence.
Step 2: create partner site summaries
To simulate federation, we create two compatible partner profiles that reuse the same allele-score definition.
make_partner_profile <- function(template, site_id, shift, info_scale, n_cases, n_controls) {
partner <- template
partner$siteId <- site_id
partner$betaHat <- template$betaHat + shift
partner$seHat <- template$seHat / sqrt(info_scale)
partner$logLikProfile <- -0.5 * info_scale *
((template$betaGrid - partner$betaHat) / template$seHat)^2
partner$logLikProfile <- partner$logLikProfile - max(partner$logLikProfile)
partner$nCases <- n_cases
partner$nControls <- n_controls
partner
}
site_profiles <- list(
site_A = profile_site_a,
site_B = make_partner_profile(profile_site_a, "site_B", shift = 0.04,
info_scale = 1.10, n_cases = 230, n_controls = 1770),
site_C = make_partner_profile(profile_site_a, "site_C", shift = -0.03,
info_scale = 0.95, n_cases = 210, n_controls = 1790)
)Step 3: pool the profiles
combined_profile <- poolLikelihoodProfiles(site_profiles)
#> Pooling profile likelihoods from 3 site(s)...
#> Pooling complete: 3 sites, 3061 total cases, 4939 total controls.
combined_profile$nSites
#> [1] 3
combined_profile$totalCases
#> [1] 3061
combined_profile$totalControls
#> [1] 4939Step 4: recover the MR estimate
mr_estimate <- computeMREstimate(combined_profile, sim_data$instrumentTable)
#> MR estimate: beta = 1.4779 (95% CI: 1.1495, 1.9705), p = 1.68e-07
#> Odds ratio: 4.384 (95% CI: 3.157, 7.175)
mr_estimate$betaMR
#> [1] 1.477905
mr_estimate$ciLower
#> [1] 1.149481
mr_estimate$ciUpper
#> [1] 1.97054
mr_estimate$oddsRatio
#> [1] 4.383751Mapping objects back to theory
-
sim_data$instrumentTablecorresponds to the SNP-exposure side. -
profile_site_a$betaHatis the local site’s fitted score-to-outcome effect. -
combined_profile$logLikProfileis the pooled evidence about the outcome-side parameter. -
mr_estimate$betaMRis the causal estimate after dividing the pooled SNP-outcome signal by the exposure effect of the same fitted score.
This last point is important: Medusa’s main federated estimator is built around the exact allele score used locally, not a generic average SNP effect.
Key takeaway: Medusa preserves the logic of two-sample MR while changing the way outcome information is communicated across sites.
Optional per-SNP mode for sensitivity analyses
profile_per_snp <- fitOutcomeModel(
cohortData = sim_data$data,
covariateData = NULL,
instrumentTable = sim_data$instrumentTable,
betaGrid = beta_grid,
siteId = "site_A",
analysisType = "perSNP"
)
#> Fitting outcome model at site 'site_A' (2621 cases, 1379 controls)...
#> Site 'site_A': beta_ZY_hat = 0.4640 (SE = 0.1438).
head(profile_per_snp$perSnpEstimates)
#> snp_id effect_allele other_allele eaf beta_ZY se_ZY beta_ZX
#> 1 rs1 C A 0.1736507 0.15532721 0.06590682 0.4936651
#> 2 rs2 C T 0.2347729 -0.03576511 0.05656689 0.1734556
#> 3 rs3 T C 0.1961966 0.05255082 0.06120696 0.1970980
#> 4 rs4 T G 0.1726639 0.10805974 0.06489736 0.3080739
#> 5 rs5 C A 0.3474797 0.22455152 0.05161634 0.4419107
#> 6 rs6 C T 0.2378871 0.19933685 0.05721299 0.4473812
#> se_ZX pval_ZX pval_ZY
#> 1 0.06309374 5.104392e-15 0.0184346747
#> 2 0.02556424 1.160163e-11 0.5272155342
#> 3 0.04546400 1.455910e-05 0.3905745821
#> 4 0.05459666 1.673798e-08 0.0958954331
#> 5 0.05693616 8.392247e-15 0.0000135892
#> 6 0.07394154 1.444222e-09 0.0004937608Those perSnpEstimates are the package’s bridge back to standard summarized-data MR sensitivity analyses.
Exercise set 1
Exercise 1
In Medusa, where does the SNP-exposure information live?
Show solution
It lives in the instrument table, which carries quantities such as beta_ZX, se_ZX, allele labels, and effect allele frequencies.
Exercise 2
Why does Medusa pool log-likelihood profiles instead of person-level records?
Show solution
Because the site-level log-likelihood profile is a privacy-preserving summary that can be added across independent sites to recover the pooled evidence without sharing person-level genotypes or outcomes.
Exercise set 2: inspect Medusa objects
Use the objects created above and answer:
- Which object stores the number of pooled sites?
- Which object stores the final odds ratio?
Show solution
combined_profile$nSites stores the number of pooled sites, and mr_estimate$oddsRatio stores the final odds ratio.
Chapter summary
- Medusa keeps the exposure side in the instrument table and estimates the outcome side locally.
- Sites share profile-likelihood summaries rather than person-level records.
- The coordinator pools those profiles and computes the MR estimate from the same fitted allele score used at the sites.
- Per-SNP output is available when you want standard summarized-data sensitivity analyses.
References
- Suchard, M.A., Simpson, S.E., Zorych, I., Ryan, P., & Madigan, D. (2013). Massive parallelization of serial inference algorithms for a complex generalized linear model. ACM Transactions on Modeling and Computer Simulation, 23(1), 1–17.
- Duan, R., Boland, M.R., Liu, Z., et al. (2020). Learning from electronic health records across multiple sites: a communication-efficient and privacy-preserving distributed algorithm. Journal of the American Medical Informatics Association, 27(3), 376–385.
- Hemani, G., Zheng, J., Elsworth, B., et al. (2018). The MR-Base platform supports systematic causal inference across the human phenome. eLife, 7, e34408.
Next chapter
Next: Course 6: Doing and critiquing a two-sample MR study
For more package detail after this chapter, see Getting Started with Medusa and Federated Analysis Guide.