Row-level provenance in real-world evidence studies with lineager

RWE studies are increasingly used to support regulatory and payer decisions — but they face a documentation problem that randomised trials have largely solved. In a clinical trial, the analysis population is defined prospectively, the exclusion criteria are pre-specified, and the ADaM derivation is auditable. In a retrospective RWE study, the cohort definition is operationalised in code, and the documentation is usually a flow chart produced at the end.

The gap matters. NICE, ICER, and FDA's RWE programme all ask how closely the operational cohort matches the pre-specified eligibility criteria. When a reviewer asks "why was patient PT-0038291 excluded?" the expected answer is not "let me check the code." lineager makes the answer immediate and complete.

The new-user design problem

The new-user active comparator design is the methodological gold standard for comparative effectiveness research. But it involves a series of eligibility criteria — minimum lookback, washout period, prevalent user exclusion, contraindication screen — each of which excludes a subset of the initial patient pool. A study comparing GLP-1 receptor agonists to DPP-4 inhibitors in type 2 diabetes might start with 25,000 patients and arrive at 6,000 after applying all criteria. The question is not just how many were excluded at each step, but which ones, and whether the exclusion logic faithfully implements the SAP.

Tagging the source data

lg_start(
  analyst  = "Ndoh Penn",
  study    = "T2DM-GLP1-RWE-001",
  purpose  = "New-user active comparator cohort derivation",
  sap_ref  = "SAP v1.2, Section 3 — Cohort Definition"
)

claims <- lg_tag(claims_raw, name = "CLAIMS_RAW", id_var = "PATID")
# 25,000 patients tagged with lineage IDs

Documenting every exclusion criterion

Each eligibility criterion is a filter with a mandatory reason. The reason is not a code comment — it is a structured log entry tied to the specific patient records it affects:

# Criterion 1: Age 40–89
cohort <- lg_filter(
  claims,
  AGE_INDEX >= 40L & AGE_INDEX <= 89L,
  reason = "Eligible age range 40–89 years at index date per SAP Section 3.1.
            Excludes paediatric patients and those with extreme age where
            T2DM prevalence and drug indication differ substantially."
)

# Criterion 2: Confirmed T2DM
cohort <- lg_filter(
  cohort,
  T2DM_DX == TRUE,
  reason = "Require confirmed T2DM diagnosis (ICD-10 E11.x) in 12 months
            prior to index date per SAP Section 3.1.2."
)

# Criterion 3: 12-month continuous enrolment lookback
cohort <- lg_filter(
  cohort,
  LOOKBACK_MO >= 12L,
  reason = "Require >= 12 months continuous enrolment prior to index date
            per SAP Section 3.2. Ensures adequate capture of prior drug use
            and comorbidities for PS model and new-user ascertainment."
)

# Criterion 4: New user — no prior use of index drug class
cohort <- lg_filter(
  cohort,
  PRIOR_USE == FALSE,
  reason = "Exclude prevalent users: no dispensing of index drug class in
            12-month lookback per SAP Section 3.3 (new-user design).
            Prevents depletion of susceptibles bias."
)

# Criterion 5: No insulin at index
cohort <- lg_filter(
  cohort,
  INSULIN_USE == FALSE,
  reason = "Exclude patients on insulin at index date per SAP Section 3.4.
            Insulin represents a different disease stage; differential
            indication for GLP-1 RA versus DPP-4i in this setting."
)

# Criterion 6: No end-stage renal disease
cohort <- lg_filter(
  cohort,
  ESRD == FALSE,
  reason = "Exclude ESRD (CKD Stage 5 or RRT) per SAP Section 3.5.
            DPP-4i requires dose adjustment in severe CKD — introduces
            differential measurement error in comparator arm."
)

# Criterion 7: 12 months follow-up available
cohort <- lg_filter(
  cohort,
  FOLLOW_MO >= 12L,
  reason = "Require >= 12 months follow-up for primary 12-month MACE outcome
            per SAP Section 3.6."
)

The exclusion registry

After applying all criteria, lg_exclusions() returns the complete derivation record:

lg_exclusions()

# step  reason                              n_excluded  n_remaining
# 1     Age 40-89                           3,841       21,159
# 2     T2DM confirmed                      2,548       18,611
# 3     12m enrolment lookback              1,923       16,688
# 4     New user (no prior use)             4,709       11,979
# 5     No insulin at index                 2,156        9,823
# 6     No ESRD                               394        9,429
# 7     12m follow-up available             3,287        6,142

This is the CONSORT flow table — generated automatically from the derivation, not maintained separately. The numbers are always consistent with the code because they are produced by the code.

Tracing any patient

When a reviewer questions a specific patient's inclusion or exclusion:

lg_trace("PT-0038291")

# Patient: PT-0038291
# ─────────────────────────────────────────────────────────────
# CLAIMS_RAW (source)     → tagged at row 38291
# After age criterion     → retained (AGE_INDEX = 57)
# After T2DM criterion    → retained (T2DM_DX = TRUE)
# After lookback criterion → retained (LOOKBACK_MO = 18)
# After new-user criterion → EXCLUDED at step 4
#   Reason: No prior use of index drug class in 12-month lookback
#   PRIOR_USE = TRUE for this patient
# ─────────────────────────────────────────────────────────────

The answer is immediate, precise, and does not require re-running the derivation or examining the source data.

Why this matters for RWE reviewers

NICE's real-world evidence framework, ICER's value assessment methodology, and FDA's RWE programme all ask the same core question: does the operational cohort match the pre-specified protocol? Aggregate flow charts answer this at the population level. Row-level provenance answers it at the patient level.

When a sensitivity analysis re-runs the derivation with a different lookback window, lineager shows exactly which patients changed status — not just the count difference. When an independent analyst replicates the study, they can verify not just that the final N matches, but that the same patients are included and excluded at each step.

The distinction between a reproducible RWE study and a defensible one is provenance at the row level. lineager makes this the default rather than the exception.

Full R code for the complete cohort derivation — including PS model derivation, outcome analysis, and attrition waterfall chart — is in the lineager pkgdown article.

Questions or corrections — hello@reprostats.org.