Getting Started with Medusa • Medusa

What is Medusa?

Medusa (Mendelian Estimation in Distributed Standardized Analytics) is an R package that implements two-sample Mendelian Randomization (MR) within the OHDSI ecosystem using the OMOP Common Data Model.

Mendelian Randomization uses genetic variants as instrumental variables to estimate causal effects of exposures on outcomes. Because genetic variants are randomly allocated at conception (like a natural randomized trial), MR can provide evidence for causality from observational data.

Two-sample MR separates the analysis into two parts:

SNP-exposure associations (beta_ZX): Obtained from published GWAS (typically from large biobank studies)
SNP-outcome associations (beta_ZY): Estimated from your own cohort data

The causal estimate is the Wald ratio: beta_MR = beta_ZY / beta_ZX.

This vignette is a runnable sandbox: every example below uses synthetic data, so the goal is to show the workflow and the shape of the outputs rather than to make a scientific claim.

Installation

# Install from GitHub
remotes::install_github("OHDSI/Medusa")

# This will also install OHDSI dependencies:
# DatabaseConnector, SqlRender, Cyclops, FeatureExtraction

The Federated Approach

Medusa’s key innovation is one-shot federated pooling. Here’s how it works:

The coordinator assembles genetic instruments from public GWAS data
Each site (hospital/biobank with OMOP CDM + genomic data):
- Extracts the outcome cohort and genotype data locally
- Fits a regression model locally
- Computes a profile log-likelihood curve (a vector of numbers)
- Shares ONLY this vector — no individual-level data leaves the site
The coordinator sums the log-likelihood vectors across sites
The causal estimate is extracted from the combined curve

This is privacy-preserving by design: the shared data is a smooth numeric vector that cannot be reverse-engineered to identify individual patients.

Quick Start with Simulated Data

No database connection needed for this example.

library(Medusa)

betaGrid <- seq(-2, 2, by = 0.05)

# Simulate MR data with a known causal effect of 0.5 (log-OR)
simData <- simulateMRData(
  n = 5000,
  nSnps = 10,
  trueEffect = 0.5,
  seed = 42
)

# Look at the instrument table
head(simData$instrumentTable[, c("snp_id", "beta_ZX", "se_ZX", "eaf")])
#>   snp_id   beta_ZX      se_ZX       eaf
#> 1    rs1 0.1784231 0.04386823 0.2846324
#> 2    rs2 0.3862010 0.07320910 0.3445011
#> 3    rs3 0.2940426 0.05558554 0.2029028
#> 4    rs4 0.1500947 0.02021748 0.2164829
#> 5    rs5 0.4452723 0.05744303 0.1043854
#> 6    rs6 0.4793527 0.06068079 0.1473405

Fit the Outcome Model

# Fit the outcome model and get the profile likelihood
profileSiteA <- fitOutcomeModel(
  cohortData = simData$data,
  covariateData = NULL,
  instrumentTable = simData$instrumentTable,
  betaGrid = betaGrid,
  siteId = "site_A"
)
#> Fitting outcome model at site 'site_A' (3452 cases, 1548 controls)...
#> Site 'site_A': beta_ZY_hat = 0.6137 (SE = 0.1410).

# The profile contains no individual-level data
names(profileSiteA)
#>  [1] "siteId"          "betaGrid"        "logLikProfile"   "nCases"         
#>  [5] "nControls"       "snpIds"          "diagnosticFlags" "betaHat"        
#>  [9] "seHat"           "scoreDefinition"
cat(sprintf("beta_ZY estimate: %.3f (SE: %.3f)\n",
            profileSiteA$betaHat, profileSiteA$seHat))
#> beta_ZY estimate: 0.614 (SE: 0.141)

Pool Profiles (Simulating Federation)

# Create synthetic partner sites that reuse the same allele-score definition.
# This keeps the pooled likelihood compatible with the denominator used later
# in computeMREstimate().
makePartnerProfile <- function(template, siteId, shift, infoScale, nCases, nControls) {
  partner <- template
  partner$siteId <- siteId
  partner$betaHat <- template$betaHat + shift
  partner$seHat <- template$seHat / sqrt(infoScale)
  partner$logLikProfile <- -0.5 * infoScale *
    ((template$betaGrid - partner$betaHat) / template$seHat)^2
  partner$logLikProfile <- partner$logLikProfile - max(partner$logLikProfile)
  partner$nCases <- nCases
  partner$nControls <- nControls
  partner
}

profiles <- list(
  site_A = profileSiteA,
  site_B = makePartnerProfile(profileSiteA, "site_B", shift = 0.06,
                              infoScale = 1.1, nCases = 295, nControls = 2705),
  site_C = makePartnerProfile(profileSiteA, "site_C", shift = -0.04,
                              infoScale = 0.9, nCases = 260, nControls = 2740)
)

# Pool the profiles
combined <- poolLikelihoodProfiles(profiles)
#> Pooling profile likelihoods from 3 site(s)...
#> Pooling complete: 3 sites, 4007 total cases, 6993 total controls.
cat(sprintf("Pooled %d sites: %d total cases, %d total controls\n",
            combined$nSites, combined$totalCases, combined$totalControls))
#> Pooled 3 sites: 4007 total cases, 6993 total controls

Compute the MR Estimate

estimate <- computeMREstimate(combined, simData$instrumentTable)
#> MR estimate: beta = 2.1158 (95% CI: 1.7631, 2.6447), p = 3.99e-12
#> Odds ratio: 8.296 (95% CI: 5.831, 14.079)

cat(sprintf("Causal estimate (beta_MR): %.3f\n", estimate$betaMR))
#> Causal estimate (beta_MR): 2.116
cat(sprintf("95%% CI: [%.3f, %.3f]\n", estimate$ciLower, estimate$ciUpper))
#> 95% CI: [1.763, 2.645]
cat(sprintf("P-value: %.2e\n", estimate$pValue))
#> P-value: 3.99e-12
cat(sprintf("Odds ratio: %.3f\n", estimate$oddsRatio))
#> Odds ratio: 8.296

The key detail is that the pooled profile carries the exact allele-score definition used at each site. Passing the same instrument table back into computeMREstimate() keeps the Wald ratio denominator aligned with the fitted score rather than with an unrelated simulated instrument set.

Run Sensitivity Analyses

# Refit the SAME synthetic cohort in per-SNP mode so the sensitivity analyses
# use summaries that are aligned with the example above.
profilePerSnp <- fitOutcomeModel(
  cohortData = simData$data,
  covariateData = NULL,
  instrumentTable = simData$instrumentTable,
  betaGrid = betaGrid,
  siteId = "site_A",
  analysisType = "perSNP"
)
#> Fitting outcome model at site 'site_A' (3452 cases, 1548 controls)...
#> Site 'site_A': beta_ZY_hat = 0.6137 (SE = 0.1410).

results <- runSensitivityAnalyses(
  profilePerSnp$perSnpEstimates,
  methods = c("IVW", "MREgger", "WeightedMedian", "LeaveOneOut"),
  engine = "internal"
)
#> Running sensitivity analyses with 10 SNPs...
#>   Engine: internal
#>   IVW...
#>   MR-Egger...
#>   Weighted Median...
#>   Leave-One-Out...
#> Sensitivity analyses complete.
print(results$summary)
#>            method   beta_MR      se_MR  ci_lower  ci_upper         pval
#> 1             IVW 0.4766943 0.05242835 0.3739347 0.5794538 9.696701e-20
#> 2        MR-Egger 0.6596779 0.11148672 0.4411639 0.8781918 3.548477e-04
#> 3 Weighted Median 0.5123123 0.06656621 0.3818425 0.6427821 1.400833e-14

Understanding the Report

The generateMRReport() function creates a self-contained HTML report with:

Executive Summary — Plain-language interpretation with strength rating
Instrument Summary — Table of all SNPs with F-statistics
Likelihood Profile — Visual of site and combined likelihood curves
Main Result — Odds ratio with confidence interval
Sensitivity Analyses — Forest plot comparing methods
Diagnostics — instrument PheWAS, allele-frequency checks, missingness
Methods Section — Auto-generated text for manuscripts

FAQ

What genomic data is required? Your OMOP CDM site needs the VARIANT_OCCURRENCE table from the OMOP Genomic CDM. The minimal required columns are person_id, rs_id, and genotype. Genotypes can be stored as VCF-style strings (“0/0”, “0/1”, “1/1”) or plain integers (“0”, “1”, “2”). The reference_allele and alternate_allele columns are also used for allele harmonization when available.

What if my site has no genomic data? Medusa requires at least one site with the OMOP Genomic Extension tables populated. Sites without genomic data cannot participate in the federated analysis.

How many sites are needed? A single site is sufficient. More sites increase statistical power and allow for cross-site validation. The federated approach adds value with 2+ sites.

What OMOP CDM version is required? Medusa supports OMOP CDM v5.3 and v5.4.