Row-Level Data Provenance and Exclusion Tracking
You build an analysis dataset. You filter it, join it, derive new variables, and produce results. Somewhere along the way rows disappear. Later, someone asks: “Which records were excluded, why, and what happened to record 01-042 between source and analysis?”
Without lineager, that answer requires manual reconstruction. With lineager, it is a single function call.
lineager tags every row of every dataset with a unique lineage identifier that survives filters, joins, and derivations. Every row removal must carry a documented reason. At any point, lg_trace() returns any row’s complete journey across the entire pipeline. lg_lineage() visualises the full pipeline graph. lg_report() compiles everything into a structured provenance document.
lineager is general-purpose: clinical data, machine learning, financial modelling, epidemiology — any pipeline where row-level accountability matters. CDISC-specific features (domain codes, population flags, SDTM-to-ADaM mapping, Reviewer’s Guide output) are available as optional enrichment for pharmaceutical users.
Quick start
library(lineager)
lg_start(study_id = "TRIAL-001", analysis_id = "primary")
# Tag source data
adsl <- lg_tag(haven::read_sas("sdtm/dm.sas7bdat"),
dataset_id = "DM", domain = "DM")
# Derive variables with documented descriptions
adsl <- lg_derive(adsl,
RANDFL = ifelse(ARMCD != "SCRNFAIL", "Y", "N"),
SAFFL = ifelse(ARMCD != "SCRNFAIL" & EXOCCUR == "Y", "Y", "N"),
description = "RANDFL: not screen failure. SAFFL: randomised AND dosed."
)
# Filter with mandatory exclusion reasons
adsl_safety <- lg_filter(
adsl,
SAFFL == "Y",
reason = "Not in safety population (SAFFL != 'Y')",
reason_code = "NOT_SAFETY",
population = "SAFFL"
)
# Trace any subject across the pipeline
lg_trace("01-042")
# Exclusion registry and disposition table
lg_exclusions()
lg_disposition(by = "reason")
# Visualise the pipeline
lin <- lg_lineage()
lg_plot(lin)
# Generate provenance report
lg_report(
output = "outputs/provenance.html",
title = "Data Provenance Report",
sponsor = "Example Pharma Ltd",
author = "Your name"
)
lg_end()Key functions
| Function | Purpose |
|---|---|
lg_start() / lg_end()
|
Session lifecycle |
lg_tag() |
Tag a dataset with row-level lineage IDs |
lg_filter() |
Filter with mandatory exclusion reason |
lg_derive() |
Derive variables with documented description |
lg_join() |
Tracked join with bilateral row-ID tracing |
lg_population() |
Register a population or cohort definition |
lg_spec() |
Document a source-to-analysis variable derivation |
lg_trace() |
Trace a row’s complete lineage journey |
lg_exclusions() |
Retrieve the full exclusion registry |
lg_disposition() |
Grouped exclusion summary table |
lg_operations() |
Full pipeline operation log |
lg_lineage() |
Build a pipeline lineage graph |
lg_plot() |
Render the lineage graph inline or export |
lg_report() |
Generate a structured HTML provenance report |
The lineage ID
Every row carries a .__lid__ column. For CDISC datasets with USUBJID:
DM_0001_01-042 # row 1 from DM domain, subject 01-042
ADLB_0047_01-042 # row 47 from ADLB, same subject
For general datasets:
patients_000001 # row 1 from the patients dataset
This ID persists through lg_filter(), lg_derive(), and lg_join(), forming the traceable thread from any output row back to its source.
CDISC features
For pharmaceutical and clinical users, lineager additionally supports:
-
domainargument inlg_tag()for CDISC domain codes -
lg_population()for SAFFL, ITTFL, PPROTFL flag documentation -
lg_spec()for SDTM-to-ADaM variable derivation mapping -
lg_report()output aligned with CDISC Reviewer’s Guide requirements
None of these are required for general use.
Integration with regulog
lineager and regulog are complementary. Use regulog for a tamper-evident session-level audit trail (who ran what, when, and why), and lineager for row-level data provenance within that session. The lg_report() output can be referenced in the regulog audit trail via log_action().
