
Federated Analysis Guide for Network Coordinators
FederatedAnalysisGuide.RmdOverview
This guide walks OHDSI network coordinators through setting up and running a federated Medusa analysis across multiple sites. The key principle: only site-level summary artifacts leave each site — no individual-level data is shared. The primary artifact is the profile log-likelihood vector, accompanied by site metadata and the allele-score definition needed to verify that every site fit the same score.
The code chunks in this vignette are templates, not executable examples. They are shown with eval = FALSE because every network needs to substitute its own connection details, cohort IDs, file paths, and governance controls.
What Data Leaves Each Site
| Shared with coordinator | NOT shared |
|---|---|
| Log-likelihood profile (numeric vector, ~600 numbers) | Individual genotypes |
| Number of cases and controls | Person-level outcomes |
| Optional diagnostic flag summary (logical values) | Covariate values |
| Optional per-SNP summary estimates for sensitivity analyses | Raw SNP-by-person data |
| Site identifier | Demographics |
The profile vector is a smooth curve that represents the aggregate statistical evidence at a site. It cannot be reverse-engineered to identify individuals. The required export already includes model-level flags such as low case count or an MLE at the grid boundary, plus the fixed allele-score definition so the coordinator can verify consistency across sites. If a network wants IVW, MR-Egger, or weighted-median sensitivity analyses, each site can also share one row per SNP containing beta_ZY and se_ZY. Those are still aggregate summaries, but they are optional and separate from the main pooled likelihood workflow.
Site Setup Requirements
OMOP CDM Requirements
- OMOP CDM version 5.3 or 5.4
- Standard tables: PERSON, OBSERVATION_PERIOD, CONDITION_OCCURRENCE
- A cohort table with the outcome cohort pre-defined (e.g., via ATLAS)
OMOP Genomic Extension (VARIANT_OCCURRENCE)
Each site needs the VARIANT_OCCURRENCE table from the OMOP Genomic CDM. This table stores per-person variant calls. The minimal required columns are:
| Column | Type | Description |
|---|---|---|
person_id |
BIGINT | Links to the PERSON table |
rs_id |
VARCHAR(50) | dbSNP rs identifier (e.g., “rs2228145”) |
genotype |
VARCHAR(50) | Genotype call: VCF-style (“0/0”, “0/1”, “1/1”) or integer (“0”, “1”, “2”) |
Additional columns used when available:
| Column | Type | Description |
|---|---|---|
reference_allele |
VARCHAR(255) | Reference allele (used for allele harmonization) |
alternate_allele |
VARCHAR(255) | Alternate allele (used for allele harmonization) |
If the genomic extension tables are in a different schema from the main CDM, specify it via the genomicDatabaseSchema parameter (defaults to cdmDatabaseSchema).
R Package Dependencies
# Required at each site
remotes::install_github("OHDSI/Medusa")
# This installs: DatabaseConnector, SqlRender, Cyclops, FeatureExtractionTemplate Site Analysis Script
Send this script to each participating site, customized with their local connection details:
# ============================================================
# Medusa Site Analysis Script
# Study: [YOUR STUDY NAME]
# Site: [SITE NAME]
# Date: [DATE]
# ============================================================
library(Medusa)
# ----- Site-specific configuration -----
connectionDetails <- DatabaseConnector::createConnectionDetails(
dbms = "postgresql", # Change to your DBMS
server = "localhost/ohdsi", # Change to your server
user = "ohdsi_user", # Change to your user
password = keyring::key_get("ohdsi_db") # Use secure credential storage
)
cdmDatabaseSchema <- "cdm" # Change to your CDM schema
cohortDatabaseSchema <- "results" # Change to your results schema
cohortTable <- "cohort" # Change if different
outcomeCohortId <- 1234 # Provided by coordinator
genomicDatabaseSchema <- "genomics" # Schema with VARIANT_OCCURRENCE (default: cdmDatabaseSchema)
siteId <- "site_A" # Unique identifier for this site
# ----- Load instrument table (provided by coordinator) -----
instrumentTable <- read.csv("instruments.csv", stringsAsFactors = FALSE)
# ----- Step 1: Build cohort -----
cohortData <- buildMRCohort(
connectionDetails = connectionDetails,
cdmDatabaseSchema = cdmDatabaseSchema,
cohortDatabaseSchema = cohortDatabaseSchema,
cohortTable = cohortTable,
outcomeCohortId = outcomeCohortId,
instrumentTable = instrumentTable,
genomicDatabaseSchema = genomicDatabaseSchema,
washoutPeriod = 365,
excludePriorOutcome = TRUE
)
# ----- Step 2: Build covariates -----
covariateData <- buildMRCovariates(
connectionDetails = connectionDetails,
cdmDatabaseSchema = cdmDatabaseSchema,
cohortDatabaseSchema = cohortDatabaseSchema,
cohortTable = cohortTable,
outcomeCohortId = outcomeCohortId
)
# ----- Step 3: Run diagnostics -----
diagnostics <- runInstrumentDiagnostics(
cohortData = cohortData,
covariateData = covariateData,
instrumentTable = instrumentTable
)
# ----- Step 4: Fit outcome model -----
# Use the SAME betaGrid as all other sites
betaGrid <- seq(-3, 3, by = 0.01)
profile <- fitOutcomeModel(
cohortData = cohortData,
covariateData = covariateData,
instrumentTable = instrumentTable,
betaGrid = betaGrid,
siteId = siteId,
analysisType = "alleleScore"
)
# ----- Step 5: Export and share -----
# Required for the primary federated MR estimate:
exportSiteProfile(profile, outputDir = ".", prefix = "medusa")
# Optional: share a one-line-per-check summary of the richer diagnostics object.
utils::write.csv(
data.frame(
check = names(diagnostics$diagnosticFlags),
flag = unname(diagnostics$diagnosticFlags),
stringsAsFactors = FALSE
),
sprintf("medusa_diagnostic_flags_%s.csv", siteId),
row.names = FALSE
)
# Optional: only run this block if the coordinator requested per-SNP summaries
# for IVW / MR-Egger / weighted-median sensitivity analyses.
profilePerSnp <- fitOutcomeModel(
cohortData = cohortData,
covariateData = covariateData,
instrumentTable = instrumentTable,
betaGrid = betaGrid,
siteId = siteId,
analysisType = "perSNP"
)
utils::write.csv(
profilePerSnp$perSnpEstimates,
sprintf("medusa_per_snp_%s.csv", siteId),
row.names = FALSE
)
message("Analysis complete. Share the required profile CSVs and, if requested, the per-SNP summary CSV.")Coordinator Pooling Script
After collecting profile CSV files from all sites:
library(Medusa)
# Load instrument table
instrumentTable <- read.csv("instruments.csv", stringsAsFactors = FALSE)
# Import site profiles from CSV files
siteProfiles <- list(
site_A = importSiteProfile("medusa_profile_site_A.csv"),
site_B = importSiteProfile("medusa_profile_site_B.csv"),
site_C = importSiteProfile("medusa_profile_site_C.csv")
)
# Pool
combined <- poolLikelihoodProfiles(siteProfiles)
# Estimate
estimate <- computeMREstimate(combined, instrumentTable)
# Optional: collect simple diagnostic flag summaries if sites shared them.
diagnosticFlagPaths <- c(
"medusa_diagnostic_flags_site_A.csv",
"medusa_diagnostic_flags_site_B.csv",
"medusa_diagnostic_flags_site_C.csv"
)
if (all(file.exists(diagnosticFlagPaths))) {
siteDiagnosticFlags <- lapply(diagnosticFlagPaths, read.csv, stringsAsFactors = FALSE)
}
# Optional: set this to NULL unless per-SNP CSVs were collected.
sensitivity <- NULL
# Optional: fixed-effect pool the per-SNP summaries before sensitivity analyses.
# Each site's per-SNP CSV should contain one row per SNP with beta_ZY and se_ZY.
perSnpPaths <- c(
site_A = "medusa_per_snp_site_A.csv",
site_B = "medusa_per_snp_site_B.csv",
site_C = "medusa_per_snp_site_C.csv"
)
if (all(file.exists(perSnpPaths))) {
perSnpBySite <- lapply(perSnpPaths, read.csv, stringsAsFactors = FALSE)
stackedPerSnp <- do.call(rbind, perSnpBySite)
splitPerSnp <- split(stackedPerSnp, stackedPerSnp$snp_id)
perSnpPooled <- do.call(
rbind,
lapply(splitPerSnp, function(df) {
weights <- 1 / (df$se_ZY^2)
data.frame(
snp_id = df$snp_id[1],
effect_allele = df$effect_allele[1],
other_allele = df$other_allele[1],
eaf = df$eaf[1],
beta_ZY = sum(weights * df$beta_ZY) / sum(weights),
se_ZY = sqrt(1 / sum(weights)),
beta_ZX = df$beta_ZX[1],
se_ZX = df$se_ZX[1],
pval_ZX = df$pval_ZX[1],
stringsAsFactors = FALSE
)
})
)
sensitivity <- runSensitivityAnalyses(
perSnpPooled,
methods = c("IVW", "MREgger", "WeightedMedian", "LeaveOneOut"),
engine = "internal"
)
}
# Report
generateMRReport(
mrEstimate = estimate,
sensitivityResults = sensitivity,
combinedProfile = combined,
siteProfileList = siteProfiles,
instrumentTable = instrumentTable,
exposureLabel = "IL-6 signaling",
outcomeLabel = "Colorectal cancer"
)Secure File Transfer
Profile CSV files should be transferred securely between sites and coordinator. Options include:
- SFTP with encrypted credentials
- Institutional secure file sharing (e.g., Box, SharePoint with encryption)
- OHDSI network file transfer protocols (if available)
The required profile CSV files are human-readable and typically very small (< 100 KB) because they contain only numeric grid values and summary statistics. Optional per-SNP summary CSVs are also compact. Using CSV ensures that every value leaving a site can be inspected and audited before transfer.
Coordinator Pre-flight Checklist
Before pooling, verify:
- Every site used the same
instruments.csvfile. - Every site used the same
betaGridvalues. - Every imported profile has the expected site identifier and non-zero case count.
- If per-SNP summaries were requested, each file uses the same SNP IDs and allele coding.
Handling Different OMOP Versions
Sites running OMOP CDM v5.3 vs v5.4 can participate together. The SQL templates in Medusa use only core CDM tables that are consistent across versions. If a site uses non-standard table names, these can be configured via function parameters.
Troubleshooting
“No persons have genotype data”
- Verify the VARIANT_OCCURRENCE table exists and has data in the specified schema
- Check that
person_idvalues in VARIANT_OCCURRENCE overlap with the cohort - Ensure
rs_idvalues match the instrument table’ssnp_idvalues
“Profile likelihood is flat”
- Instruments may be too weak at this site
- Check F-statistics in diagnostics
- Verify that genotype data is coded correctly (0/1/2)