Methodology

How we make synthetic healthcare data you can actually trust.

This page is written for the skeptic. We build one coherent synthetic person — an archetype — then render their entire data trail from it. We'll show you exactly how those people and records are generated, which real-world factors the engine reproduces, how we measure fidelity like an eval, and — just as important — where the data falls short of the real thing.

Download the sample and check it yourself →

The generation approach

The archetype is the primitive. Everything renders from it.

We never synthesize a record in isolation. The AI supplies clinical knowledge; a deterministic engine turns that knowledge into people, journeys, and finally records. The split matters: generation is creative, rendering is reproducible. And the people come first — one synthetic person is the single source from which every data domain is rendered.

1Generative

AI generates archetypes

A large language model, grounded in published clinical knowledge and prevalence data, drafts a library of clinically-coherent patient archetypes — condition sets, comorbidity patterns, demographics. A diabetic with CHF and stage-3 CKD reads like a real one because the model knows those conditions travel together.

2Deterministic

The engine instantiates a population

A deterministic, seeded engine instantiates those archetypes into a full synthetic population, perturbing demographics, condition burden, and a per-member utilization propensity so no two members are identical and the cohort spans well to super-utilizer.

3Longitudinal

Each member lives a 36-month journey

Every member is walked month-by-month through enrollment, condition progression, acute episodes, post-acute pathways, and persistence. High-utilizers in the first half stay high-utilizers in the second — the journey, not a per-row dice roll, decides what happens next.

4Reproducible

Journeys render any data domain

One journey is a render source, not a single file. Today it emits eligibility, medical claims, Rx fills, and revenue — with realistic coding, adjustment chains, paid-date lag, and Part D benefit phases. The same journey is what will render encounters, labs, and quality gaps. Same seed in, byte-identical output out: the rendering is fully reproducible.

To be precise about the division of labor: the AI generates the archetypes and the shape of the journeys. Everything downstream — instantiation, the month-by-month walk, and the rendering of each data domain — runs through a seeded, deterministic engine. Given the same seed and config snapshot, the output is byte-identical, every time.

One archetype

Diabetic, CHF, stage-3 CKD · 71 · full-dual

A single synthetic person with one coherent clinical life and a 36-month journey.

Renders every data domain — all internally consistent

Eligibility & enrollmentliveMedical claimslivePharmacyliveRevenue & paymentliveEncounterssoonLabs & resultssoonQuality measuressoon

The claim that bills the CHF admit, the eGFR that tracks the CKD, the A1c that tracks the diabetes, the open quality gap, and the risk-adjusted revenue all come from the same person — so they agree by construction, not by reconciliation.

Calibration · the Medicare Advantage release

Calibrated to published benchmarks — not real claims.

The sections that follow drill into our first line of business, Medicare Advantage — the same archetype engine, instantiated and calibrated for one population. The engine's aggregate outputs are tuned to match published statistics: average risk score, medical loss ratio, PMPM by service bucket, utilization per 1,000, and HCC prevalence. Every calibration target is anchored to an explicitly-cited public source and checked automatically at release.

Read this part carefully. We calibrate to published aggregate statistics. We are nottrained on real row-level claims — no member's real record is anywhere in the pipeline. Learning patterns directly from licensed real claims is on the roadmap, and we say so plainly below.

Benchmark sourcesPublic & cited

CMS Advance & Final Rate Notices
Risk-score normalization, coding-intensity adjustment, county base rates, sequestration.
MedPAC March Report (MA chapter)
Utilization per 1,000 — IP admits, ER visits, PCP and specialist office visits, spending mix.
CMS HCC Model software & coefficient tables
V24 and V28 risk-score construction and the blended phase-in spread.
CMS Report to Congress (Risk Adjustment)
HCC prevalence targets for the top conditions across the population.
Public MA insurer 10-Ks
Medical loss ratio and PMPM economics — Humana, UnitedHealth, Elevance, CVS/Aetna, Centene.
CMS Part D Final Rule & PCUG
Benefit-phase structure, the IRA OOP cap, LIS cost-sharing, and MMR reason codes.

Factors we model · the Medicare Advantage release

Each release reproduces more of the messy truth.

The core of every version bump is factor coverage — the number of real-world phenomena the engine actually reproduces. Below is what v2 “Meridian” models for Medicare Advantage and how far it moved from v1. Each new factor is validated against clinical guidance or a published rate wherever one exists.

Acute episode library

v1 · ~50 pathwaysv2 · ~1,000 pathways

Coherent claim bundles for acute events — CHF exacerbation, AMI, sepsis, hip fracture, stroke — that fire stochastically and emit an ER → IP → discharge-meds → follow-up sequence with consistent dates and diagnoses.

Validated against Clinical episode pathways & published event rates.

Member-level utilization persistence

v1 · Independent per monthv2 · Sticky propensity

A latent per-member propensity multiplier, drawn once at creation, governs every utilization draw — so high-utilizers persist. Targets H1→H2 spend correlation of 0.55–0.70 and top-10% stickiness of 40–55%.

Validated against Claims-persistence literature.

5/50 cost concentration & a well cohort

v1 · Flat curvev2 · Real right-skew

A carved-out well cohort (no HCCs, minimal care) and a ~1% super-utilizer tail produce a realistic concentration curve: top 1% ≈ 25–30% of spend, top 5% ≈ 50%, and 3–6% of members with $0 medical spend over three years.

Validated against The 5/50 rule (MEPS / MedPAC).

Monthly seasonality

v1 · Uniform monthsv2 · Q1 peak, Q4 trough

Month-of-year multipliers on ER, IP, surgery, and office visits — a Q1 flu/respiratory peak (~1.10–1.15) and a December trough (~0.90–0.92) — so service dates carry the winter shape real claims have.

Validated against Seasonal utilization patterns.

Dual-vs-non-dual MLR

v1 · Invertedv2 · Duals higher

A social-determinants cost multiplier on full-dual members corrects v1's inverted loss ratios. Dual MLR now runs 2–4 points above non-dual, matching real MA disclosures.

Validated against Public MA insurer segment disclosures.

HCC-driven journeys & comorbidity

v1 · Independent HCCsv2 · Correlated triads

Conditions are sampled with a pairwise correlation structure (CHF↔CKD, diabetes↔CKD, diabetes↔vascular) and distinct diagnosis pools per HCC, targeting realistic comorbidity — DM+CKD ≈ 10%, the DM+CKD+CHF triad ≈ 3–4%.

Validated against CMS Report to Congress prevalence.

Rx refill chains & Part D phases

v1 · Random cost-sharingv2 · Phase-aware

NDC-level fills with refill chains and a running per-member OOP accumulator that flips cost-sharing across deductible → initial → catastrophic phases, with LIS caps and the post-2025 IRA $2,000 cap.

Validated against CMS Part D Final Rule.

Paid-date lag & run-out

v1 · Tight ~20-day lagv2 · Full IBNR triangle

Bucket-specific paid-date development — IP and SNF lag longer than professional and Rx — producing a lag triangle smooth enough for chain-ladder IBNR, with a genuine 6–24 month run-out tail.

Validated against MA paid-development patterns.

Adjustments, denials & reversals

v1 · Coarse status setv2 · 9+ statuses, linked chains

A full claim-process enumeration — partial pays, denials, pends, adjustments, reversals, reprocessing — with every non-original row linked to its parent so adjustment chains reconcile to net plan liability.

Validated against Claim-adjudication process modeling.

The fidelity framework

We treat fidelity like an eval.

Synthetic healthcare data is only worth buying if it matches reality — so we measure it the way you'd measure a model. Every release runs an automated credibility audit that compares dozens of metrics against their benchmarks and rolls them up into a composite fidelity score from 0 to 100. The audit exits non-zero if any headline metric drifts more than its tolerance, so a regression can't ship silently. The scorecard below is the Medicare Advantage release; every line of business gets its own audit against its own benchmarks.

Every purchased dataset carries its own full audit in the report bundle — target, actual, delta, and source citation for each metric — so you can re-derive the score against the exact files you received.

See an audit in the sample bundle

Example scorecardv2 “Meridian” · 88 / 100

MetricOursBenchmark

Avg risk score1.011.00–1.10

Medical loss ratio88%85–92%

IP admits / 1,000274250–300

ER visits / 1,000580550–650

Dual vs non-dual MLR+6 ppdual higher

Q1 seasonality index1.07~1.10

Figures are illustrative. Each dataset's real audit ships in its report bundle.

Honest limitations

Synthetic is not real. Here's where it shows.

We'd rather you trust the data because we told you its edges than discover them mid-project. These are the limitations we hold ourselves accountable to disclosing.

Tail events are approximations
Rare-disease cohorts and extreme catastrophic claims are modeled to plausible rates, not drawn from observed reality. Use them for shape, not for pricing a specific rare-disease bundle.
Provider networks are stylized
Provider attribution and referral patterns are simplified. Network-adequacy, steerage, and true referral graphs are not yet faithfully reproduced.
Calibrated to aggregates, not your book
Because we calibrate to published national statistics, the data won't capture a specific plan's geography, benefit quirks, or contract idiosyncrasies.
Not for production decisions
Not for regulatory filings, rate setting, bid pricing, claims adjudication, or clinical decisions. It is built for analytics, model development, benchmarking, and product testing.
One line of business, four domains — today
Shipping now is Medicare Advantage across eligibility, medical claims, Rx, and revenue. Encounters, labs, and quality measures, and other lines of business, are expanding — same archetype engine, not yet generally available. Don't plan around a render target until it's marked live.

Available now vs expanding

One engine, every render target — shipped honestly.

The architecture renders any data domain across any line of business from the same archetypes. We label exactly what that means in practice today, so you never plan around something that isn't live yet.

Data domains4 live

Eligibility & enrollment
Member-month enrollment, demographics, plan/program, and benefit status — the spine every other domain links to.
Available now
Medical claims
Line-level institutional + professional claims: diagnoses, procedures, settings, allowed/paid, and realistic adjustment chains.
Available now
Pharmacy
NCPDP-grade drug fills with refill chains, benefit phases, formulary tiers, and net-of-rebate economics.
Available now
Revenue & payment
Payer-side revenue the way a plan receives it — capitation, risk scores, and the factors behind every dollar.
Available now
Encounters
Encounter-level utilization independent of billing — the visit-and-service record health systems and value-based programs run on.
Expanding
Labs & results
Ordered tests with realistic result values trended to each member's conditions — A1c that tracks the diabetic, eGFR that tracks the CKD.
Expanding
Quality measures
Measure-ready numerators, denominators, and gaps (HEDIS-style / Stars) rendered from each member's actual care.
Expanding

Lines of businessMA live

Medicare Advantage
Our first line, shipping today: HCC risk adjustment, MMR revenue, Part D, dual/LIS dynamics.
Available now
Medicare FFS
Traditional fee-for-service Medicare — the benchmark population behind most ACO and VBC work.
Expanding
Commercial / employer
Working-age commercial populations — different age mix, benefit design, and cost curve.
Expanding
Medicaid
Managed Medicaid with its own eligibility churn, demographics, and program structure.
Expanding
ACA / exchange
Marketplace populations with HHS-HCC risk adjustment and metal-tier benefit design.
Expanding

Expanding render targets reuse the same archetypes and the same deterministic engine — and each ships only once it has its own calibration to published benchmarks and its own fidelity audit. None are generally available until marked live.

Roadmap

Where the fidelity goes next.

Beyond widening the render targets above, the biggest item is honest about the calibration limitation: today we calibrate to published aggregates. Next, we learn patterns directly from licensed real, row-level data.

Training on real, row-level data (v3)

Roadmap

Train on licensed real claims to derive distributions and pathways directly from observed data — moving from calibrated-to-aggregates toward learned-from-reality. No real record appears in today's pipeline; this is the planned v3 shift.

New domains: encounters, labs, quality

Roadmap

Render encounter-level utilization, ordered labs with condition-trended results, and HEDIS-style quality gaps from the same archetypes — each calibrated and audited before it goes live.

New lines of business

Roadmap

Medicare FFS, commercial, Medicaid, and ACA / exchange populations on the same reproducible engine — each with its own age mix, benefit design, risk model, and cost curve.

SNP cohort modeling

Roadmap

Explicit D-SNP, C-SNP, and I-SNP structures with their distinct utilization, revenue, and plan/contract characteristics.

Provider continuity & referral networks

Roadmap

Stable attributed PCPs and specialist panels, referral graphs, and provider-level continuity to support real VBC and network analysis.

Readmission & post-acute pathways

Roadmap

30-day readmission, SNF, and home-health pathways calibrated to clinical literature targets (CHF ~22%, AMI ~17%, ortho SNF use ~60–80%).

Don't take our word for the fidelity.

Download the 1,000-member sample — full schema, full credibility audit, no signup. Run your own checks against the numbers on this page.

Get the free sample Explore the data

Prefer the landing page first? Back to overview →