Traceable ADaM derivation: every exclusion documented with lineager

Every ADaM derivation involves exclusions. Screen failures, missing baselines, protocol deviations, out-of-window assessments — they all remove rows from the analytical dataset. The standard documentation is aggregate: "47 subjects were excluded from the per-protocol population for the reasons listed in Table 14.1.3." Which 47? Why exactly? Can you trace subject DUP301-0082 from the ADLB source record through to the ADEFF analysis record and show every step in between?

In most trials, the answer is: eventually, with effort, after checking with the programmer. lineager makes it immediate, automatic, and self-documenting.

The idea: lineage IDs that travel with rows

The core concept is simple. When you tag a source dataset with lg_tag(), every row receives a unique lineage ID (.__lid__) that is embedded in the USUBJID and dataset name:

lg_start(analyst = "Ndoh Penn", study = "DERM-DUP-301",
         purpose = "ADEFF derivation — primary efficacy analysis")

adlb_tagged <- lg_tag(adlb, name = "ADLB", id_var = "USUBJID")

# Every row now carries:
# .__lid__ = "ADLB:DUP301-0001:row001"

This ID survives every subsequent operation. When you filter, the excluded rows are logged. When you join, both sides of the join are tracked. When you derive, the derivation description is recorded. At any point, lg_trace("DUP301-0082") returns that subject's complete journey.

Building ADEFF step by step

The derivation follows the standard ADaM ADEFF pattern: start from ADLB, apply population flags, merge ADSL covariates, derive baseline and change from baseline, flag responders. Each filter uses lg_filter() instead of dplyr::filter(), and each derivation uses lg_derive() instead of mutate():

# Step 1: Safety population
adlb_saf <- lg_filter(
  adlb_tagged,
  USUBJID %in% adsl$USUBJID[adsl$SAFFL == "Y"],
  reason = "Restrict to safety analysis set (SAFFL = Y) per SAP Section 4.1"
)

# Step 2: ITT population
adlb_itt <- lg_filter(
  adlb_saf,
  USUBJID %in% adsl$USUBJID[adsl$ITTFL == "Y"],
  reason = "Restrict to ITT population (ITTFL = Y) per SAP Section 4.2"
)

# Step 3: Join ADSL for treatment and covariates
adeff <- lg_join(
  adlb_itt, adsl_core, by = "USUBJID", type = "left",
  description = "Left join ADLB to ADSL for treatment and baseline covariates"
)

# Step 4: Derive baseline
adeff <- lg_derive(
  adeff,
  BASE = AVAL[AVISITN == 0L][1L],
  .by  = "USUBJID",
  description = "Derive baseline EASI score as AVAL at Visit 0"
)

# Step 5: Exclude missing baseline
adeff <- lg_filter(
  adeff,
  !is.na(BASE),
  reason = "Exclude subjects with missing baseline EASI — required for
            change-from-baseline analysis per SAP Section 5.1"
)

# Step 6: Derive CHG and PCHG
adeff <- lg_derive(
  adeff,
  CHG  = AVAL - BASE,
  PCHG = (AVAL - BASE) / BASE * 100,
  description = "Derive CHG and PCHG per CDISC ADaM IG Section 3.2.7"
)

# Step 7: EASI-75 responder flag
adeff_final <- lg_derive(
  adeff,
  EASI75FL = ifelse(AVISITN > 0 & !is.na(PCHG),
                    ifelse(PCHG <= -75, "Y", "N"), NA_character_),
  description = "EASI-75 responder flag: Y if percent change <= -75%"
)

The syntax is almost identical to standard dplyr. The difference is what gets recorded.

What the exclusion registry looks like

After the derivation, lg_exclusions() returns a data frame of every step:

lg_exclusions()

# step  reason                                          n_excluded  n_remaining
# 1     Restrict to safety set (SAFFL = Y)              6           1914
# 2     Restrict to ITT population (ITTFL = Y)          18          1896
# 3     Exclude subjects with missing baseline EASI      12          1884
# 4     Retain baseline and post-baseline visits only    320         1564
# 5     Exclude missing post-baseline EASI assessments   94          1470

Each row is an audit record, not a manually maintained count. If the derivation code changes, the registry updates automatically.

CONSORT table from derivation objects

The CONSORT-style disposition table — the one that appears in Table 14.1.1 of most clinical study reports — is derived directly from the lineage objects, not from a separate count script:

lg_disposition(
  groups = list(
    Randomised     = adsl_tagged,
    "Safety set"   = adsl_tagged |> filter(SAFFL  == "Y"),
    "ITT set"      = adsl_tagged |> filter(ITTFL  == "Y"),
    "PP set"       = adsl_tagged |> filter(PPROTFL == "Y"),
    "Week 16 EASI" = adeff_final |> filter(ANL01FL == "Y")
  ),
  group_var = "TRT01P"
)

The numbers in this table match the exclusion registry by construction. You cannot have a CONSORT table that disagrees with the derivation code.

Tracing a subject

For any subject, at any time:

lg_trace("DUP301-0082")

# Subject: DUP301-0082
# ─────────────────────────────────────────────────────────
# ADLB (source)         → 6 records tagged
# After SAFFL filter    → 6 records retained
# After ITTFL filter    → 6 records retained
# After baseline filter → 6 records retained (BASE = 38.4)
# After visit filter    → 5 records (baseline + 4 post-BL)
# ADEFF (final)         → 5 records, CHG at Week 16: -28.1
#                         EASI75FL at Week 16: N
# ─────────────────────────────────────────────────────────

And for a subject who was excluded:

lg_trace("DUP301-0019")

# Subject: DUP301-0019
# ─────────────────────────────────────────────────────────
# ADLB (source)          → 6 records tagged
# After SAFFL filter     → 6 records retained
# After ITTFL filter     → EXCLUDED at step 2
#   Reason: Restrict to ITT population (ITTFL = Y) per SAP Section 4.2
#   ITTFL = N for this subject
# ─────────────────────────────────────────────────────────

The provenance report

lg_report(
  output  = "DERM-DUP-301_ADEFF_provenance_v1.html",
  title   = "ADEFF Provenance Report — DERM-DUP-301",
  version = "1.0",
  date    = as.Date("2026-06-26"),
  author  = "Ndoh Penn"
)

The report is a self-contained HTML document containing the full exclusion registry, CONSORT table, pipeline summary, and analyst information. It can be attached to the submission package, shared with QC programmers, or archived alongside the analysis dataset.

What changes for the programmer

Almost nothing. Replace filter() with lg_filter() and add a reason argument. Replace mutate() with lg_derive() and add a description. The analytical logic is identical. What changes is that the decisions embedded in the logic are now on permanent record, automatically aggregated, and traceable at the subject level.

The full Quarto article with complete R code, all simulation, and rendered outputs is available in the lineager pkgdown documentation.

Questions or corrections — hello@reprostats.org.