panelbuild provides tools for auditing, validating, and
preparing panel datasets before statistical analysis.
Panel datasets often contain duplicate unit-time observations, missing time periods, irregular gaps, and imbalance. These issues can affect fixed effects models, difference-in-differences designs, event studies, and other panel-data methods.
The goal of panelbuild is to help users identify these
issues before estimation.
panelbuild includes a small example dataset called
example_panel.
data(example_panel)
example_panel
#> id year outcome treatment
#> 1 1 2020 10 0
#> 2 1 2021 12 1
#> 3 1 2021 13 1
#> 4 2 2020 20 0
#> 5 2 2022 25 1
#> 6 3 2020 30 0
#> 7 3 2021 31 0
#> 8 3 2022 32 1
#> 9 3 2023 33 1The dataset intentionally includes:
This makes it useful for demonstrating panel-data diagnostics.
The main function is audit_panel().
audit_panel(example_panel, id = id, time = year)
#> Panel audit
#>
#> Data: example_panel
#> Unit variable: id
#> Time variable: year
#>
#> Units: 3
#> Time periods: 4
#> Observed rows: 9
#> Observed id-time cells: 8
#> Expected id-time cells: 12
#> Missing id-time cells: 4
#> Duplicate id-time cells: 1
#> Balanced panel: NoThis gives a quick overview of the panel structure, including whether the panel is balanced and whether there are missing or duplicate unit-time cells.
Duplicate unit-time observations are a common problem in panel datasets.
gap_summary() identifies missing time periods by panel
unit.
flag_panel_issues() adds diagnostic flags to the
data.
flag_panel_issues(example_panel, id = id, time = year)
#> # A tibble: 9 × 7
#> id year outcome treatment panelbuild_row_id panelbuild_id_time_n
#> <dbl> <dbl> <dbl> <dbl> <int> <int>
#> 1 1 2020 10 0 1 1
#> 2 1 2021 12 1 2 2
#> 3 1 2021 13 1 3 2
#> 4 2 2020 20 0 4 1
#> 5 2 2022 25 1 5 1
#> 6 3 2020 30 0 6 1
#> 7 3 2021 31 0 7 1
#> 8 3 2022 32 1 8 1
#> 9 3 2023 33 1 9 1
#> # ℹ 1 more variable: panelbuild_duplicate_cell <lgl>complete_panel() creates a complete unit-time grid. It
does not impute missing outcome values.
Because complete_panel() requires unique unit-time
cells, we first remove duplicate id-time observations from the example
dataset.
example_panel_unique <- example_panel |>
dplyr::distinct(id, year, .keep_all = TRUE)
complete_panel(example_panel_unique, id = id, time = year)
#> # A tibble: 12 × 7
#> id year outcome treatment panelbuild_original_row panelbuild_completed_…¹
#> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
#> 1 1 2020 10 0 TRUE FALSE
#> 2 1 2021 12 1 TRUE FALSE
#> 3 1 2022 NA NA FALSE TRUE
#> 4 1 2023 NA NA FALSE TRUE
#> 5 2 2020 20 0 TRUE FALSE
#> 6 2 2021 NA NA FALSE TRUE
#> 7 2 2022 25 1 TRUE FALSE
#> 8 2 2023 NA NA FALSE TRUE
#> 9 3 2020 30 0 TRUE FALSE
#> 10 3 2021 31 0 TRUE FALSE
#> 11 3 2022 32 1 TRUE FALSE
#> 12 3 2023 33 1 TRUE FALSE
#> # ℹ abbreviated name: ¹panelbuild_completed_cell
#> # ℹ 1 more variable: panelbuild_audit_action <chr>A typical panelbuild workflow is:
panelbuild is designed to provide a transparent and
reproducible workflow for panel-data quality assurance.
Use it before fitting panel models, difference-in-differences designs, event studies, or other longitudinal-data analyses.