Methodology
How we make synthetic healthcare data you can actually trust.
This page is written for the skeptic. We build one coherent synthetic person — an archetype — then render their entire data trail from it. We'll show you exactly how those people and records are generated, which real-world factors the engine reproduces, how we measure fidelity like an eval, and — just as important — where the data falls short of the real thing.
The generation approach
The archetype is the primitive. Everything renders from it.
We never synthesize a record in isolation. The AI supplies clinical knowledge; a deterministic engine turns that knowledge into people, journeys, and finally records. The split matters: generation is creative, rendering is reproducible. And the people come first — one synthetic person is the single source from which every data domain is rendered.
AI generates archetypes
A large language model, grounded in published clinical knowledge and prevalence data, drafts a library of clinically-coherent patient archetypes — condition sets, comorbidity patterns, demographics. A diabetic with CHF and stage-3 CKD reads like a real one because the model knows those conditions travel together.
The engine instantiates a population
A deterministic, seeded engine instantiates those archetypes into a full synthetic population, perturbing demographics, condition burden, and a per-member utilization propensity so no two members are identical and the cohort spans well to super-utilizer.
Each member lives a 36-month journey
Every member is walked month-by-month through enrollment, condition progression, acute episodes, post-acute pathways, and persistence. High-utilizers in the first half stay high-utilizers in the second — the journey, not a per-row dice roll, decides what happens next.
Journeys render any data domain
One journey is a render source, not a single file. Today it emits eligibility, medical claims, Rx fills, and revenue — with realistic coding, adjustment chains, paid-date lag, and Part D benefit phases. The same journey is what will render encounters, labs, and quality gaps. Same seed in, byte-identical output out: the rendering is fully reproducible.
To be precise about the division of labor: the AI generates the archetypes and the shape of the journeys. Everything downstream — instantiation, the month-by-month walk, and the rendering of each data domain — runs through a seeded, deterministic engine. Given the same seed and config snapshot, the output is byte-identical, every time.
A single synthetic person with one coherent clinical life and a 36-month journey.
The claim that bills the CHF admit, the eGFR that tracks the CKD, the A1c that tracks the diabetes, the open quality gap, and the risk-adjusted revenue all come from the same person — so they agree by construction, not by reconciliation.
Calibration · the Medicare Advantage release
Calibrated to published benchmarks — not real claims.
The sections that follow drill into our first line of business, Medicare Advantage — the same archetype engine, instantiated and calibrated for one population. The engine's aggregate outputs are tuned to match published statistics: average risk score, medical loss ratio, PMPM by service bucket, utilization per 1,000, and HCC prevalence. Every calibration target is anchored to an explicitly-cited public source and checked automatically at release.
Read this part carefully. We calibrate to published aggregate statistics. We are nottrained on real row-level claims — no member's real record is anywhere in the pipeline. Learning patterns directly from licensed real claims is on the roadmap, and we say so plainly below.
- CMS Advance & Final Rate NoticesRisk-score normalization, coding-intensity adjustment, county base rates, sequestration.
- MedPAC March Report (MA chapter)Utilization per 1,000 — IP admits, ER visits, PCP and specialist office visits, spending mix.
- CMS HCC Model software & coefficient tablesV24 and V28 risk-score construction and the blended phase-in spread.
- CMS Report to Congress (Risk Adjustment)HCC prevalence targets for the top conditions across the population.
- Public MA insurer 10-KsMedical loss ratio and PMPM economics — Humana, UnitedHealth, Elevance, CVS/Aetna, Centene.
- CMS Part D Final Rule & PCUGBenefit-phase structure, the IRA OOP cap, LIS cost-sharing, and MMR reason codes.
Factors we model · the Medicare Advantage release
Each release reproduces more of the messy truth.
The core of every version bump is factor coverage — the number of real-world phenomena the engine actually reproduces. Below is what v2 “Meridian” models for Medicare Advantage and how far it moved from v1. Each new factor is validated against clinical guidance or a published rate wherever one exists.
Acute episode library
Coherent claim bundles for acute events — CHF exacerbation, AMI, sepsis, hip fracture, stroke — that fire stochastically and emit an ER → IP → discharge-meds → follow-up sequence with consistent dates and diagnoses.
Member-level utilization persistence
A latent per-member propensity multiplier, drawn once at creation, governs every utilization draw — so high-utilizers persist. Targets H1→H2 spend correlation of 0.55–0.70 and top-10% stickiness of 40–55%.
5/50 cost concentration & a well cohort
A carved-out well cohort (no HCCs, minimal care) and a ~1% super-utilizer tail produce a realistic concentration curve: top 1% ≈ 25–30% of spend, top 5% ≈ 50%, and 3–6% of members with $0 medical spend over three years.
Monthly seasonality
Month-of-year multipliers on ER, IP, surgery, and office visits — a Q1 flu/respiratory peak (~1.10–1.15) and a December trough (~0.90–0.92) — so service dates carry the winter shape real claims have.
Dual-vs-non-dual MLR
A social-determinants cost multiplier on full-dual members corrects v1's inverted loss ratios. Dual MLR now runs 2–4 points above non-dual, matching real MA disclosures.
HCC-driven journeys & comorbidity
Conditions are sampled with a pairwise correlation structure (CHF↔CKD, diabetes↔CKD, diabetes↔vascular) and distinct diagnosis pools per HCC, targeting realistic comorbidity — DM+CKD ≈ 10%, the DM+CKD+CHF triad ≈ 3–4%.
Rx refill chains & Part D phases
NDC-level fills with refill chains and a running per-member OOP accumulator that flips cost-sharing across deductible → initial → catastrophic phases, with LIS caps and the post-2025 IRA $2,000 cap.
Paid-date lag & run-out
Bucket-specific paid-date development — IP and SNF lag longer than professional and Rx — producing a lag triangle smooth enough for chain-ladder IBNR, with a genuine 6–24 month run-out tail.
Adjustments, denials & reversals
A full claim-process enumeration — partial pays, denials, pends, adjustments, reversals, reprocessing — with every non-original row linked to its parent so adjustment chains reconcile to net plan liability.
The fidelity framework
We treat fidelity like an eval.
Synthetic healthcare data is only worth buying if it matches reality — so we measure it the way you'd measure a model. Every release runs an automated credibility audit that compares dozens of metrics against their benchmarks and rolls them up into a composite fidelity score from 0 to 100. The audit exits non-zero if any headline metric drifts more than its tolerance, so a regression can't ship silently. The scorecard below is the Medicare Advantage release; every line of business gets its own audit against its own benchmarks.
Every purchased dataset carries its own full audit in the report bundle — target, actual, delta, and source citation for each metric — so you can re-derive the score against the exact files you received.
Figures are illustrative. Each dataset's real audit ships in its report bundle.
Honest limitations
Synthetic is not real. Here's where it shows.
We'd rather you trust the data because we told you its edges than discover them mid-project. These are the limitations we hold ourselves accountable to disclosing.
- Tail events are approximations
Rare-disease cohorts and extreme catastrophic claims are modeled to plausible rates, not drawn from observed reality. Use them for shape, not for pricing a specific rare-disease bundle.
- Provider networks are stylized
Provider attribution and referral patterns are simplified. Network-adequacy, steerage, and true referral graphs are not yet faithfully reproduced.
- Calibrated to aggregates, not your book
Because we calibrate to published national statistics, the data won't capture a specific plan's geography, benefit quirks, or contract idiosyncrasies.
- Not for production decisions
Not for regulatory filings, rate setting, bid pricing, claims adjudication, or clinical decisions. It is built for analytics, model development, benchmarking, and product testing.
- One line of business, four domains — today
Shipping now is Medicare Advantage across eligibility, medical claims, Rx, and revenue. Encounters, labs, and quality measures, and other lines of business, are expanding — same archetype engine, not yet generally available. Don't plan around a render target until it's marked live.
Available now vs expanding
One engine, every render target — shipped honestly.
The architecture renders any data domain across any line of business from the same archetypes. We label exactly what that means in practice today, so you never plan around something that isn't live yet.
- Available nowEligibility & enrollmentMember-month enrollment, demographics, plan/program, and benefit status — the spine every other domain links to.
- Available nowMedical claimsLine-level institutional + professional claims: diagnoses, procedures, settings, allowed/paid, and realistic adjustment chains.
- Available nowPharmacyNCPDP-grade drug fills with refill chains, benefit phases, formulary tiers, and net-of-rebate economics.
- Available nowRevenue & paymentPayer-side revenue the way a plan receives it — capitation, risk scores, and the factors behind every dollar.
- ExpandingEncountersEncounter-level utilization independent of billing — the visit-and-service record health systems and value-based programs run on.
- ExpandingLabs & resultsOrdered tests with realistic result values trended to each member's conditions — A1c that tracks the diabetic, eGFR that tracks the CKD.
- ExpandingQuality measuresMeasure-ready numerators, denominators, and gaps (HEDIS-style / Stars) rendered from each member's actual care.
- Available nowMedicare AdvantageOur first line, shipping today: HCC risk adjustment, MMR revenue, Part D, dual/LIS dynamics.
- ExpandingMedicare FFSTraditional fee-for-service Medicare — the benchmark population behind most ACO and VBC work.
- ExpandingCommercial / employerWorking-age commercial populations — different age mix, benefit design, and cost curve.
- ExpandingMedicaidManaged Medicaid with its own eligibility churn, demographics, and program structure.
- ExpandingACA / exchangeMarketplace populations with HHS-HCC risk adjustment and metal-tier benefit design.
Expanding render targets reuse the same archetypes and the same deterministic engine — and each ships only once it has its own calibration to published benchmarks and its own fidelity audit. None are generally available until marked live.
Roadmap
Where the fidelity goes next.
Beyond widening the render targets above, the biggest item is honest about the calibration limitation: today we calibrate to published aggregates. Next, we learn patterns directly from licensed real, row-level data.
Training on real, row-level data (v3)
RoadmapTrain on licensed real claims to derive distributions and pathways directly from observed data — moving from calibrated-to-aggregates toward learned-from-reality. No real record appears in today's pipeline; this is the planned v3 shift.
New domains: encounters, labs, quality
RoadmapRender encounter-level utilization, ordered labs with condition-trended results, and HEDIS-style quality gaps from the same archetypes — each calibrated and audited before it goes live.
New lines of business
RoadmapMedicare FFS, commercial, Medicaid, and ACA / exchange populations on the same reproducible engine — each with its own age mix, benefit design, risk model, and cost curve.
SNP cohort modeling
RoadmapExplicit D-SNP, C-SNP, and I-SNP structures with their distinct utilization, revenue, and plan/contract characteristics.
Provider continuity & referral networks
RoadmapStable attributed PCPs and specialist panels, referral graphs, and provider-level continuity to support real VBC and network analysis.
Readmission & post-acute pathways
Roadmap30-day readmission, SNF, and home-health pathways calibrated to clinical literature targets (CHF ~22%, AMI ~17%, ortho SNF use ~60–80%).
Don't take our word for the fidelity.
Download the 1,000-member sample — full schema, full credibility audit, no signup. Run your own checks against the numbers on this page.
Prefer the landing page first? Back to overview →