When building data pipelines, it’s easy to lose track of what happens at each step. How many rows were dropped by that filter? Did the join introduce duplicates? Which columns have missing values now?
tidyaudit’s audit trail captures metadata-only snapshots at each step of a pipe — row counts, column counts, NA totals, numeric summaries — without storing the data itself. This gives you a lightweight, structured record of your pipeline’s behavior. The trail object also allows for custom functions to increase flexibility and capture domain-specific diagnostics.
Start by creating a trail object and inserting
audit_tap() calls into your pipeline. Each tap records a
snapshot and passes the data through unchanged.
# Sample data
orders <- data.frame(
id = 1:20,
customer = rep(c("Alice", "Bob", "Carol", "Dan", "Eve"), 4),
amount = c(150, 200, 50, 300, 75, 120, 400, 90, 250, 60,
180, 210, 45, 320, 85, 130, 380, 95, 270, 55),
status = rep(c("complete", "pending", "complete", "cancelled", "complete"), 4)
)
trail <- audit_trail("order_pipeline")
result <- orders |>
audit_tap(trail, "raw") |>
filter(status == "complete") |>
audit_tap(trail, "complete_only") |>
mutate(tax = amount * 0.1) |>
audit_tap(trail, "with_tax")Now print the trail to see the snapshot timeline:
print(trail)
#>
#> ── Audit Trail: "order_pipeline" ───────────────────────────────────────────────
#> Created: 2026-02-22 09:37:05
#> Snapshots: 3
#>
#> # Label Rows Cols NAs Type
#> ─ ───────────── ──── ──── ─── ────
#> 1 raw 20 4 0 tap
#> 2 complete_only 12 4 0 tap
#> 3 with_tax 12 5 0 tap
#>
#> Changes:
#> raw → complete_only: -8 rows, = cols, = NAs
#> complete_only → with_tax: = rows, +1 cols, = NAsThe timeline shows row counts, column counts, NA totals, and change summaries between consecutive steps. You can see exactly how many rows each filter removed and when columns were added.
Plain audit_tap() records what the data looks like, but
it can’t tell you why it changed. Operation-aware taps —
left_join_tap(), filter_tap(), etc. — perform
the operation AND record enriched diagnostics.
Replace dplyr::left_join() + audit_tap()
with left_join_tap() to capture match rates, relationship
type, and duplicate key information:
customers <- data.frame(
customer = c("Alice", "Bob", "Carol", "Dan"),
region = c("East", "West", "East", "North")
)
trail2 <- audit_trail("join_pipeline")
result2 <- orders |>
audit_tap(trail2, "raw") |>
left_join_tap(customers, by = "customer",
.trail = trail2, .label = "with_region")
print(trail2)
#>
#> ── Audit Trail: "join_pipeline" ────────────────────────────────────────────────
#> Created: 2026-02-22 09:37:05
#> Snapshots: 2
#>
#> # Label Rows Cols NAs Type
#> ─ ─────────── ──── ──── ─── ────────────────────────────────────
#> 1 raw 20 4 0 tap
#> 2 with_region 20 5 4 left_join (many-to-one, 80% matched)
#>
#> Changes:
#> raw → with_region: = rows, +1 cols, +4 NAsThe Type column now shows the join type, relationship,
and match rate — all without leaving the pipe.
All six dplyr join types are supported: left_join_tap(),
right_join_tap(), inner_join_tap(),
full_join_tap(), anti_join_tap(),
semi_join_tap().
filter_tap() keeps matching rows (like
dplyr::filter()) while recording how many rows were
dropped:
trail3 <- audit_trail("filter_pipeline")
result3 <- orders |>
audit_tap(trail3, "raw") |>
filter_tap(status == "complete",
.trail = trail3, .label = "complete_only") |>
filter_tap(amount > 100,
.trail = trail3, .label = "high_value",
.stat = amount)
#> ℹ filter_tap: status == "complete"
#> Dropped 8 of 20 rows (40.0%)
#> ℹ filter_tap: amount > 100
#> Dropped 8 of 12 rows (66.7%)
#> Stat amount: dropped 555 of 1,135
print(trail3)
#>
#> ── Audit Trail: "filter_pipeline" ──────────────────────────────────────────────
#> Created: 2026-02-22 09:37:05
#> Snapshots: 3
#>
#> # Label Rows Cols NAs Type
#> ─ ───────────── ──── ──── ─── ──────────────────────────────
#> 1 raw 20 4 0 tap
#> 2 complete_only 12 4 0 filter (dropped 8 rows, 40%)
#> 3 high_value 4 4 0 filter (dropped 8 rows, 66.7%)
#>
#> Changes:
#> raw → complete_only: -8 rows, = cols, = NAs
#> complete_only → high_value: -8 rows, = cols, = NAsThe .stat argument tracks a numeric column through the
filter, reporting how much of the total was dropped — useful for
financial pipelines where you want to know the monetary impact of each
filter.
filter_out_tap() works the same way but drops matching
rows (the inverse).
audit_diff() gives you a detailed before/after
comparison between any two snapshots in the trail:
audit_diff(trail3, "raw", "high_value")
#>
#> ── Audit Diff: "raw" → "high_value" ──
#>
#> Metric Before After Delta
#> ────── ────── ───── ─────
#> Rows 20 4 -16
#> Cols 4 4 =
#> NAs 0 0 =
#>
#> ℹ No columns added or removed
#>
#> Numeric shifts (common columns):
#> Column Mean before Mean after Shift
#> ────── ─────────── ────────── ──────
#> id 10.50 8.5 -2
#> amount 173.25 145.0 -28.25This shows row/column/NA deltas, columns added or removed, and numeric distribution shifts.
audit_report() prints the complete trail summary plus
all consecutive diffs in one call:
audit_report(trail3)
#> ── Audit Report: "filter_pipeline" ─────────────────────────────────────────────
#> Created: 2026-02-22 09:37:05
#> Total snapshots: 3
#>
#> ── Audit Trail: "filter_pipeline" ──────────────────────────────────────────────
#> Created: 2026-02-22 09:37:05
#> Snapshots: 3
#>
#> # Label Rows Cols NAs Type
#> ─ ───────────── ──── ──── ─── ──────────────────────────────
#> 1 raw 20 4 0 tap
#> 2 complete_only 12 4 0 filter (dropped 8 rows, 40%)
#> 3 high_value 4 4 0 filter (dropped 8 rows, 66.7%)
#>
#> Changes:
#> raw → complete_only: -8 rows, = cols, = NAs
#> complete_only → high_value: -8 rows, = cols, = NAs
#>
#> ── Detailed Diffs ──────────────────────────────────────────────────────────────
#>
#> ── Audit Diff: "raw" → "complete_only" ──
#>
#> Metric Before After Delta
#> ────── ────── ───── ─────
#> Rows 20 12 -8
#> Cols 4 4 =
#> NAs 0 0 =
#>
#> ℹ No columns added or removed
#>
#> Numeric shifts (common columns):
#> Column Mean before Mean after Shift
#> ────── ─────────── ────────── ──────
#> id 10.50 10.50 0
#> amount 173.25 94.58 -78.67
#>
#> ── Audit Diff: "complete_only" → "high_value" ──
#>
#> Metric Before After Delta
#> ────── ────── ───── ─────
#> Rows 12 4 -8
#> Cols 4 4 =
#> NAs 0 0 =
#>
#> ℹ No columns added or removed
#>
#> Numeric shifts (common columns):
#> Column Mean before Mean after Shift
#> ────── ─────────── ────────── ──────
#> id 10.50 8.5 -2
#> amount 94.58 145.0 +50.42
#>
#> ── Final Snapshot Profile ──────────────────────────────────────────────────────
#>
#> high_value (4 rows x 4 cols)
#> Column types: 2 character, 1 integer, 1 numeric
#> ✔ No missing values
#>
#> Numeric summary:
#> Column Min Mean Median Max
#> ────── ─── ───── ────── ───
#> id 1 8.5 8.5 16
#> amount 120 145.0 140.0 180
#>
#> ────────────────────────────────────────────────────────────────────────────────Pass a named list of functions via .fns to compute
custom diagnostics at any tap:
trail4 <- audit_trail("custom_example")
result4 <- orders |>
audit_tap(trail4, "raw", .fns = list(
mean_amount = ~mean(.x$amount),
n_customers = ~length(unique(.x$customer))
))
audit_report(trail4)
#> ── Audit Report: "custom_example" ──────────────────────────────────────────────
#> Created: 2026-02-22 09:37:06
#> Total snapshots: 1
#>
#> ── Audit Trail: "custom_example" ───────────────────────────────────────────────
#> Created: 2026-02-22 09:37:06
#> Snapshots: 1
#>
#> # Label Rows Cols NAs Type
#> ─ ───── ──── ──── ─── ────
#> 1 raw 20 4 0 tap
#>
#> ── Custom Diagnostics ──────────────────────────────────────────────────────────
#>
#> raw:
#> mean_amount: 173.25
#> n_customers: 5
#>
#> ── Final Snapshot Profile ──────────────────────────────────────────────────────
#>
#> raw (20 rows x 4 cols)
#> Column types: 2 character, 1 integer, 1 numeric
#> ✔ No missing values
#>
#> Numeric summary:
#> Column Min Mean Median Max
#> ────── ─── ────── ────── ───
#> id 1 10.50 10.5 20
#> amount 45 173.25 140.0 400
#>
#> ────────────────────────────────────────────────────────────────────────────────All tap functions work without a trail. When
.trail = NULL (the default):
.stat or
.warn_threshold: runs diagnostics and prints
results without recording to a trail# Plain filter — no diagnostics
orders |> filter_tap(amount > 100) |> nrow()
#> [1] 12
# Diagnostics without a trail
orders |> filter_tap(amount > 100, .stat = amount) |> invisible()
#> filter_keep(.data, amount > 100)
#> Dropped 8 of 20 rows (40.00%).
#> Dropped 555 of 3,465 for amount (16.02%).This makes it easy to add quick diagnostics to any pipeline without setting up a full trail.