Federated Analysis Guide for Network Coordinators • Medusa

Overview

This guide walks OHDSI network coordinators through setting up and running a federated Medusa analysis across multiple sites. The key principle: only site-level summary artifacts leave each site — no individual-level data is shared. The primary artifact is the profile log-likelihood vector, accompanied by site metadata and the allele-score definition needed to verify that every site fit the same score.

The code chunks in this vignette are templates, not executable examples. They are shown with eval = FALSE because every network needs to substitute its own connection details, cohort IDs, file paths, and governance controls.

What Data Leaves Each Site

Shared with coordinator	NOT shared
Log-likelihood profile (numeric vector, ~600 numbers)	Individual genotypes
Number of cases and controls	Person-level outcomes
Optional diagnostic flag summary (logical values)	Covariate values
Optional per-SNP summary estimates for sensitivity analyses	Raw SNP-by-person data
Site identifier	Demographics

The profile vector is a smooth curve that represents the aggregate statistical evidence at a site. It cannot be reverse-engineered to identify individuals. The required export already includes model-level flags such as low case count or an MLE at the grid boundary, plus the fixed allele-score definition so the coordinator can verify consistency across sites. If a network wants IVW, MR-Egger, or weighted-median sensitivity analyses, each site can also share one row per SNP containing beta_ZY and se_ZY. Those are still aggregate summaries, but they are optional and separate from the main pooled likelihood workflow.

Site Setup Requirements

OMOP CDM Requirements

OMOP CDM version 5.3 or 5.4
Standard tables: PERSON, OBSERVATION_PERIOD, CONDITION_OCCURRENCE
A cohort table with the outcome cohort pre-defined (e.g., via ATLAS)

OMOP Genomic Extension (VARIANT_OCCURRENCE)

Each site needs the VARIANT_OCCURRENCE table from the OMOP Genomic CDM. This table stores per-person variant calls. The minimal required columns are:

Column	Type	Description
`person_id`	BIGINT	Links to the PERSON table
`rs_id`	VARCHAR(50)	dbSNP rs identifier (e.g., “rs2228145”)
`genotype`	VARCHAR(50)	Genotype call: VCF-style (“0/0”, “0/1”, “1/1”) or integer (“0”, “1”, “2”)

Additional columns used when available:

Column	Type	Description
`reference_allele`	VARCHAR(255)	Reference allele (used for allele harmonization)
`alternate_allele`	VARCHAR(255)	Alternate allele (used for allele harmonization)

If the genomic extension tables are in a different schema from the main CDM, specify it via the genomicDatabaseSchema parameter (defaults to cdmDatabaseSchema).

R Package Dependencies

# Required at each site
remotes::install_github("OHDSI/Medusa")

# This installs: DatabaseConnector, SqlRender, Cyclops, FeatureExtraction

Ancestry Principal Components (Optional but Recommended)

A table with ancestry PCs controls for population stratification:

CREATE TABLE genomics.ancestry_pcs (
  person_id BIGINT NOT NULL,
  pc1 FLOAT, pc2 FLOAT, pc3 FLOAT, pc4 FLOAT, pc5 FLOAT,
  pc6 FLOAT, pc7 FLOAT, pc8 FLOAT, pc9 FLOAT, pc10 FLOAT
);

Template Site Analysis Script

Send this script to each participating site, customized with their local connection details:

# ============================================================
# Medusa Site Analysis Script
# Study: [YOUR STUDY NAME]
# Site: [SITE NAME]
# Date: [DATE]
# ============================================================

library(Medusa)

# ----- Site-specific configuration -----
connectionDetails <- DatabaseConnector::createConnectionDetails(
  dbms = "postgresql",              # Change to your DBMS
  server = "localhost/ohdsi",       # Change to your server
  user = "ohdsi_user",             # Change to your user
  password = keyring::key_get("ohdsi_db")  # Use secure credential storage
)

cdmDatabaseSchema <- "cdm"           # Change to your CDM schema
cohortDatabaseSchema <- "results"     # Change to your results schema
cohortTable <- "cohort"               # Change if different
outcomeCohortId <- 1234               # Provided by coordinator
genomicDatabaseSchema <- "genomics"   # Schema with VARIANT_OCCURRENCE (default: cdmDatabaseSchema)
siteId <- "site_A"                    # Unique identifier for this site

# ----- Load instrument table (provided by coordinator) -----
instrumentTable <- read.csv("instruments.csv", stringsAsFactors = FALSE)

# ----- Step 1: Build cohort -----
cohortData <- buildMRCohort(
  connectionDetails = connectionDetails,
  cdmDatabaseSchema = cdmDatabaseSchema,
  cohortDatabaseSchema = cohortDatabaseSchema,
  cohortTable = cohortTable,
  outcomeCohortId = outcomeCohortId,
  instrumentTable = instrumentTable,
  genomicDatabaseSchema = genomicDatabaseSchema,
  washoutPeriod = 365,
  excludePriorOutcome = TRUE
)

# ----- Step 2: Build covariates -----
covariateData <- buildMRCovariates(
  connectionDetails = connectionDetails,
  cdmDatabaseSchema = cdmDatabaseSchema,
  cohortDatabaseSchema = cohortDatabaseSchema,
  cohortTable = cohortTable,
  outcomeCohortId = outcomeCohortId
)

# ----- Step 3: Run diagnostics -----
diagnostics <- runInstrumentDiagnostics(
  cohortData = cohortData,
  covariateData = covariateData,
  instrumentTable = instrumentTable
)

# ----- Step 4: Fit outcome model -----
# Use the SAME betaGrid as all other sites
betaGrid <- seq(-3, 3, by = 0.01)

profile <- fitOutcomeModel(
  cohortData = cohortData,
  covariateData = covariateData,
  instrumentTable = instrumentTable,
  betaGrid = betaGrid,
  siteId = siteId,
  analysisType = "alleleScore"
)

# ----- Step 5: Export and share -----
# Required for the primary federated MR estimate:
exportSiteProfile(profile, outputDir = ".", prefix = "medusa")

# Optional: share a one-line-per-check summary of the richer diagnostics object.
utils::write.csv(
  data.frame(
    check = names(diagnostics$diagnosticFlags),
    flag = unname(diagnostics$diagnosticFlags),
    stringsAsFactors = FALSE
  ),
  sprintf("medusa_diagnostic_flags_%s.csv", siteId),
  row.names = FALSE
)

# Optional: only run this block if the coordinator requested per-SNP summaries
# for IVW / MR-Egger / weighted-median sensitivity analyses.
profilePerSnp <- fitOutcomeModel(
  cohortData = cohortData,
  covariateData = covariateData,
  instrumentTable = instrumentTable,
  betaGrid = betaGrid,
  siteId = siteId,
  analysisType = "perSNP"
)
utils::write.csv(
  profilePerSnp$perSnpEstimates,
  sprintf("medusa_per_snp_%s.csv", siteId),
  row.names = FALSE
)

message("Analysis complete. Share the required profile CSVs and, if requested, the per-SNP summary CSV.")

Coordinator Pooling Script

After collecting profile CSV files from all sites:

library(Medusa)

# Load instrument table
instrumentTable <- read.csv("instruments.csv", stringsAsFactors = FALSE)

# Import site profiles from CSV files
siteProfiles <- list(
  site_A = importSiteProfile("medusa_profile_site_A.csv"),
  site_B = importSiteProfile("medusa_profile_site_B.csv"),
  site_C = importSiteProfile("medusa_profile_site_C.csv")
)

# Pool
combined <- poolLikelihoodProfiles(siteProfiles)

# Estimate
estimate <- computeMREstimate(combined, instrumentTable)

# Optional: collect simple diagnostic flag summaries if sites shared them.
diagnosticFlagPaths <- c(
  "medusa_diagnostic_flags_site_A.csv",
  "medusa_diagnostic_flags_site_B.csv",
  "medusa_diagnostic_flags_site_C.csv"
)
if (all(file.exists(diagnosticFlagPaths))) {
  siteDiagnosticFlags <- lapply(diagnosticFlagPaths, read.csv, stringsAsFactors = FALSE)
}

# Optional: set this to NULL unless per-SNP CSVs were collected.
sensitivity <- NULL

# Optional: fixed-effect pool the per-SNP summaries before sensitivity analyses.
# Each site's per-SNP CSV should contain one row per SNP with beta_ZY and se_ZY.
perSnpPaths <- c(
  site_A = "medusa_per_snp_site_A.csv",
  site_B = "medusa_per_snp_site_B.csv",
  site_C = "medusa_per_snp_site_C.csv"
)
if (all(file.exists(perSnpPaths))) {
  perSnpBySite <- lapply(perSnpPaths, read.csv, stringsAsFactors = FALSE)

  stackedPerSnp <- do.call(rbind, perSnpBySite)
  splitPerSnp <- split(stackedPerSnp, stackedPerSnp$snp_id)

  perSnpPooled <- do.call(
    rbind,
    lapply(splitPerSnp, function(df) {
      weights <- 1 / (df$se_ZY^2)
      data.frame(
        snp_id = df$snp_id[1],
        effect_allele = df$effect_allele[1],
        other_allele = df$other_allele[1],
        eaf = df$eaf[1],
        beta_ZY = sum(weights * df$beta_ZY) / sum(weights),
        se_ZY = sqrt(1 / sum(weights)),
        beta_ZX = df$beta_ZX[1],
        se_ZX = df$se_ZX[1],
        pval_ZX = df$pval_ZX[1],
        stringsAsFactors = FALSE
      )
    })
  )

  sensitivity <- runSensitivityAnalyses(
    perSnpPooled,
    methods = c("IVW", "MREgger", "WeightedMedian", "LeaveOneOut"),
    engine = "internal"
  )
}

# Report
generateMRReport(
  mrEstimate = estimate,
  sensitivityResults = sensitivity,
  combinedProfile = combined,
  siteProfileList = siteProfiles,
  instrumentTable = instrumentTable,
  exposureLabel = "IL-6 signaling",
  outcomeLabel = "Colorectal cancer"
)

Secure File Transfer

Profile CSV files should be transferred securely between sites and coordinator. Options include:

SFTP with encrypted credentials
Institutional secure file sharing (e.g., Box, SharePoint with encryption)
OHDSI network file transfer protocols (if available)

The required profile CSV files are human-readable and typically very small (< 100 KB) because they contain only numeric grid values and summary statistics. Optional per-SNP summary CSVs are also compact. Using CSV ensures that every value leaving a site can be inspected and audited before transfer.

Coordinator Pre-flight Checklist

Before pooling, verify:

Every site used the same instruments.csv file.
Every site used the same betaGrid values.
Every imported profile has the expected site identifier and non-zero case count.
If per-SNP summaries were requested, each file uses the same SNP IDs and allele coding.

Handling Different OMOP Versions

Sites running OMOP CDM v5.3 vs v5.4 can participate together. The SQL templates in Medusa use only core CDM tables that are consistent across versions. If a site uses non-standard table names, these can be configured via function parameters.

Troubleshooting

“No persons have genotype data”

Verify the VARIANT_OCCURRENCE table exists and has data in the specified schema
Check that person_id values in VARIANT_OCCURRENCE overlap with the cohort
Ensure rs_id values match the instrument table’s snp_id values

“Profile likelihood is flat”

Instruments may be too weak at this site
Check F-statistics in diagnostics
Verify that genotype data is coded correctly (0/1/2)

“MLE is at grid boundary”

Expand the betaGrid range (e.g., seq(-5, 5, by = 0.01))
Ensure the same betaGrid is used at all sites

Weak instrument warning (F < 10)

The instrument explains very little outcome variance at this site
Consider excluding weak instruments or using more instruments
Results may be biased toward the null