This vignette is a decision guide for choosing and checking weights
in cat2cat().
Read it when you want to answer one of these questions:
If you only need the basic two-period workflow, go back to Get Started. If you need multi-period, panel, aggregated, or regression workflows, continue to Advanced Workflows.
library(cat2cat)
library(dplyr)
library(tidyr)
library(e1071)
library(randomForest)
data(occup, package = "cat2cat")
data(trans, package = "cat2cat")
occup_2008 <- occup[occup$year == 2008, ]
occup_2010 <- occup[occup$year == 2010, ]
occup_2012 <- occup[occup$year == 2012, ]cat2cat offers several ways to assign probability
weights to replicated observations. Each method encodes a different
distributional assumption about how ambiguous
observations split across candidate categories. When a downstream
estimand depends on the mapped category, this is the identifying
assumption for that estimand - so always check sensitivity.
Naive weights (wei_naive_c2c) are
always computed. Each replicated observation gets uniform probability
\(1/k\) where \(k\) is the number of candidate
categories.
Frequency-based weights (wei_freq_c2c)
are the default. They use category counts from the base period.
ML weights (wei_knn_c2c,
wei_lda_c2c, wei_rf_c2c,
wei_nb_c2c) use individual features to predict category
membership.
cat2cat_ml_run()Available ML methods:
k.ntree tuning.e1071. Fast,
useful after numeric/logical/factor preprocessing. Assumes conditional
independence of features.ML features must be numeric, logical, or factor columns. Factor columns are one-hot encoded automatically using levels observed in the training data and the target period. Character columns are not encoded automatically; convert them to factors first if they represent categories.
You can run multiple methods at once and compare or combine them:
occup_2_mix <- cat2cat(
data = list(
old = occup_2008, new = occup_2010,
cat_var = "code", time_var = "year"
),
mappings = list(trans = trans, direction = "backward"),
ml = list(
data = occup_2010,
cat_var = "code",
method = c("knn", "rf", "lda", "nb"),
features = c("age", "sex", "edu", "exp", "parttime", "salary"),
args = list(k = 10, ntree = 50),
on_fail = "na"
)
)Correlations between weight methods:
occup_2_mix$old %>%
select(wei_knn_c2c, wei_rf_c2c, wei_lda_c2c, wei_nb_c2c, wei_freq_c2c, wei_naive_c2c) %>%
cor(use = "pairwise.complete.obs")
#> wei_knn_c2c wei_rf_c2c wei_lda_c2c wei_nb_c2c wei_freq_c2c
#> wei_knn_c2c 1.0000000 0.8655665 0.8327864 0.6196138 0.8989887
#> wei_rf_c2c 0.8655665 1.0000000 0.8825387 0.6519528 0.8755385
#> wei_lda_c2c 0.8327864 0.8825387 1.0000000 0.6592195 0.8667809
#> wei_nb_c2c 0.6196138 0.6519528 0.6592195 1.0000000 0.6107475
#> wei_freq_c2c 0.8989887 0.8755385 0.8667809 0.6107475 1.0000000
#> wei_naive_c2c 0.4908619 0.4754159 0.4811839 0.5594270 0.5449029
#> wei_naive_c2c
#> wei_knn_c2c 0.4908619
#> wei_rf_c2c 0.4754159
#> wei_lda_c2c 0.4811839
#> wei_nb_c2c 0.5594270
#> wei_freq_c2c 0.5449029
#> wei_naive_c2c 1.0000000on_fail and
fail_warnSometimes ML probabilities cannot be produced for a subset of
replicated rows (for example incomplete target features or
method-specific prediction failures). cat2cat() exposes
explicit policy controls in ml:
on_fail = "freq" (default): failed ML rows are filled
with wei_freq_c2c.on_fail = "naive": failed ML rows are filled with
wei_naive_c2c.on_fail = "na": failed ML rows are kept as
NA.on_fail = "error": stop immediately when failed rows
are detected.fail_warn = TRUE (default): warn with affected
rows/observations per method.fail_warn = FALSE: suppress these warnings.Important: this failure accounting is specific to
cat2cat() and the constructed weight columns
(wei_*_c2c). It is different from
cat2cat_ml_run() “SKIPPED GROUPS”, which reports mapping
groups that were not evaluated in holdout diagnostics (single category,
too few observations, or method fit/predict error for that group).
ml_setup <- list(
data = bind_rows(occup_2010, occup_2012),
cat_var = "code",
method = c("knn", "rf", "lda"),
features = c("age", "sex", "edu", "exp", "parttime", "salary"),
args = list(k = 10, ntree = 50),
on_fail = "freq", # default policy
fail_warn = TRUE # default reporting
)
# strict mode for QA pipelines
ml_strict <- ml_setup
ml_strict$on_fail <- "error"
# diagnostic mode to inspect failures directly
ml_diag <- ml_setup
ml_diag$on_fail <- "na"
ml_diag$fail_warn <- FALSEEnsemble weights with cross_c2c() and pruning with
prune_c2c():
Different weight methods affect regression coefficients when you filter to a specific occupation group and combine both periods. This is the proper sensitivity analysis: subjects from the base period (new, no replication) plus subjects from the target period (old, weighted by probability of belonging to this group).
Run backward mapping with all ML methods:
result_all <- cat2cat(
data = list(old = occup_2008, new = occup_2010,
cat_var = "code", time_var = "year"),
mappings = list(trans = trans, direction = "backward"),
ml = list(
data = occup_2010, cat_var = "code",
method = c("knn", "rf", "lda", "nb"),
features = c("age", "sex", "edu", "exp", "parttime", "salary"),
args = list(k = 10, ntree = 50)
)
)Weighted counts per group - compare how weight methods redistribute observations:
weight_cols <- c("wei_naive_c2c", "wei_freq_c2c", "wei_knn_c2c", "wei_rf_c2c", "wei_lda_c2c", "wei_nb_c2c")
# Pick groups with high replication
top_groups <- result_all$old %>%
filter(rep_c2c > 1) %>%
count(g_new_c2c, sort = TRUE) %>%
head(6) %>%
pull(g_new_c2c)
# Weighted counts from OLD period (replicated)
old_counts <- lapply(weight_cols, function(wcol) {
result_all$old %>%
filter(g_new_c2c %in% top_groups) %>%
group_by(g_new_c2c) %>%
summarise(n = sum(.data[[wcol]]), .groups = "drop")
}) %>%
setNames(gsub("wei_|_c2c", "", weight_cols)) %>%
bind_rows(.id = "method") %>%
tidyr::pivot_wider(names_from = method, values_from = n)
# Counts from NEW period (no replication, exact)
new_counts <- result_all$new %>%
filter(code %in% top_groups) %>%
count(code, name = "new_period") %>%
rename(g_new_c2c = code)
# Combine for comparison
left_join(old_counts, new_counts, by = "g_new_c2c")
#> # A tibble: 6 × 8
#> g_new_c2c naive freq knn rf lda nb new_period
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 232002 23.1 21.9 29.2 21.4 29.7 21.9 30
#> 2 232003 23.1 19.7 23.9 19.3 25.1 19.7 27
#> 3 232004 23.1 5.10 5 4.66 8.05 5.10 7
#> 4 232005 23.1 3.65 5.1 4.4 7.49 3.65 5
#> 5 232006 23.1 16.8 15.2 14.0 18.5 16.8 23
#> 6 232007 23.1 2.92 2.5 1.9 3.74 2.92 4The new_period column shows the actual counts in 2010.
The other columns show how the 2008 observations are redistributed under
each weight method. naive assigns uniform probability (1/n
candidates), freq uses base period frequencies, and ML
methods (knn, rf, lda,
nb) use predicted probabilities.
Pick a specific group for regression analysis:
# New-period counts per category (no replication, so plain tally)
new_counts_all <- result_all$new %>%
count(code, name = "n_new") %>%
rename(g_new_c2c = code)
# Old-period weighted counts, joined to new-period counts
group_sizes <- result_all$old %>%
group_by(g_new_c2c) %>%
summarise(n_old = sum(wei_freq_c2c), .groups = "drop") %>%
left_join(new_counts_all, by = "g_new_c2c") %>%
filter(n_old >= 10, n_new >= 10) %>%
arrange(desc(n_old))
# Pick a group for regression analysis
target_group <- group_sizes$g_new_c2c[1]
cat("Analysing occupation group:", target_group, "\n")
#> Analysing occupation group: 222101Regression within a single occupation group - combine both periods and compare coefficients:
# Subset old period to target group (with weights)
old_subset <- result_all$old %>%
filter(g_new_c2c == target_group)
# Subset new period to target group (no replication, weight = 1)
new_subset <- result_all$new %>%
filter(code == target_group) %>%
mutate(
wei_naive_c2c = 1, wei_freq_c2c = 1, wei_knn_c2c = 1,
wei_rf_c2c = 1, wei_lda_c2c = 1, wei_nb_c2c = 1
)
# Combine both periods
d <- bind_rows(old_subset, new_subset)
# Compare all regression coefficients across weight methods
f <- I(log(salary)) ~ age + sex + factor(edu) + exp + parttime
coefs <- sapply(weight_cols, function(wcol) {
d$w <- d$multiplier * d[[wcol]]
coef(lm(f, data = d, weights = w))
})
colnames(coefs) <- gsub("wei_|_c2c", "", weight_cols)
round(coefs, 4)
#> naive freq knn rf lda nb
#> (Intercept) 9.3225 9.2194 9.2118 9.2085 9.2229 9.2194
#> age -0.0090 -0.0083 -0.0081 -0.0079 -0.0085 -0.0083
#> sexTRUE -0.0042 -0.1153 -0.1421 -0.1323 -0.1253 -0.1153
#> factor(edu)2 -0.1317 -0.1131 -0.1157 -0.1119 -0.1108 -0.1131
#> factor(edu)3 -0.1036 -0.1065 -0.1090 -0.1055 -0.1033 -0.1065
#> factor(edu)4 -0.1333 -0.1450 -0.1472 -0.1451 -0.1428 -0.1450
#> factor(edu)5 -0.1884 -0.1370 -0.1326 -0.1396 -0.1303 -0.1370
#> exp 0.0138 0.0131 0.0128 0.0128 0.0132 0.0131
#> parttime 1.3797 1.4348 1.4411 1.4311 1.4343 1.4348All coefficients can vary because weight methods change which old-period subjects contribute to this occupation group.
Note: Pruning discards probability information and should be used only after analysis with full weights. Prefer
prune_c2c(method = "nonzero")to remove impossible candidates while preserving the probability distribution. More aggressive pruning (highest1) is appropriate only for descriptive tables or when you need exactly one category per observation.
# Compare regression coefficients under different pruning strategies
prune_methods <- c("nonzero", "highest", "highest1")
prune_coefs <- sapply(prune_methods, function(pm) {
old_pruned <- result_all$old %>%
prune_c2c(method = pm) %>%
filter(g_new_c2c == target_group)
d <- bind_rows(old_pruned, new_subset)
d$w <- d$multiplier * d$wei_freq_c2c
coef(lm(f, data = d, weights = w))
})
round(prune_coefs, 4)
#> nonzero highest highest1
#> (Intercept) 9.2194 9.2143 9.2143
#> age -0.0083 -0.0083 -0.0083
#> sexTRUE -0.1153 -0.1200 -0.1200
#> factor(edu)2 -0.1131 -0.1122 -0.1122
#> factor(edu)3 -0.1065 -0.1068 -0.1068
#> factor(edu)4 -0.1450 -0.1454 -0.1454
#> factor(edu)5 -0.1370 -0.1337 -0.1337
#> exp 0.0131 0.0131 0.0131
#> parttime 1.4348 1.4384 1.4384cross_c2c() creates a weighted average of multiple
weight columns. Vary the mix:
configs <- list(
equal = c(1, 1) / 2,
freq_heavy = c(3, 1) / 4,
ml_heavy = c(1, 3) / 4
)
ens_coefs <- sapply(names(configs), function(nm) {
old_ens <- result_all$old %>%
cross_c2c(c("wei_freq_c2c", "wei_knn_c2c"), configs[[nm]]) %>%
filter(g_new_c2c == target_group)
new_ens <- new_subset %>% mutate(wei_cross_c2c = 1)
d <- bind_rows(old_ens, new_ens)
d$w <- d$multiplier * d$wei_cross_c2c
coef(lm(f, data = d, weights = w))
})
round(ens_coefs, 4)
#> equal freq_heavy ml_heavy
#> (Intercept) 9.2155 9.2175 9.2136
#> age -0.0082 -0.0083 -0.0082
#> sexTRUE -0.1287 -0.1220 -0.1354
#> factor(edu)2 -0.1144 -0.1138 -0.1151
#> factor(edu)3 -0.1078 -0.1072 -0.1084
#> factor(edu)4 -0.1462 -0.1456 -0.1467
#> factor(edu)5 -0.1348 -0.1359 -0.1337
#> exp 0.0130 0.0131 0.0129
#> parttime 1.4381 1.4365 1.4396When regression coefficients are stable across weight methods, pruning strategies, and ensemble compositions, report with confidence. When they diverge, the mapping introduces uncertainty - report the range or investigate the source.
The ml argument in cat2cat() adds ML-based
probability weights, but ML is not guaranteed to improve over simpler
baselines. cat2cat_ml_run() provides per-group holdout
(single train/test split) diagnostics to answer this question
before committing to a method.
cat2cat_ml_run() is doingFor each mapping group (set of candidate categories linked by the
transition table) cat2cat_ml_run():
ml$data whose category
belongs to the group.1 - test_prop) and
test (test_prop) sets.Groups with fewer than 5 observations or only one candidate category
are skipped. Also note that cat2cat_ml_run() does not use
on_fail; it is a diagnostic tool and reports skipped groups
instead of applying row-level fallback weights.
cv_knn <- cat2cat_ml_run(
mappings = list(trans = trans, direction = "backward"),
ml = list(
data = bind_rows(occup_2010, occup_2012),
cat_var = "code",
method = "knn",
features = c("age", "sex", "edu", "exp", "parttime", "salary"),
args = list(k = 10)
)
)
print(cv_knn)
#> === cat2cat ML Cross-Validation Results ===
#>
#> ACCURACY (higher is better):
#> naive (1/k): 0.1805
#> freq (most common): 0.5324
#> knn: accuracy = 0.5228
#>
#> BRIER SCORE (lower is better, range 0-1):
#> naive: 0.4097
#> freq: 0.3026
#> knn: brier = 0.3249
#>
#> MEAN P(TRUE CLASS) (higher is better):
#> naive: 0.1805
#> freq: 0.4191
#> knn: mean P(true) = 0.4492
#>
#> ACCURACY: ML vs BASELINES (percent of groups where ML wins):
#> knn > naive: 89.4%
#> knn > freq: 24.9%
#>
#> SKIPPED GROUPS (single category or <5 observations):
#> knn: 32.6%The print() summary reports:
naive (1/k) is the random-guess
baseline, freq is the majority-class baseline, and each ML
line reports top-class accuracy for that method.cat2cat ultimately uses probability weights, not just hard
classifications.cat2cat, because it measures the
quality of the probability weights themselves.naive or beats freq
on accuracy. This is a win-rate summary, not an average accuracy
gap.So for output like:
knn > naive: 87.7%knn > freq: 18.0%knn: accuracy = 0.5108 vs
freq (most common): 0.5366the right reading is: kNN clearly beats the naive baseline, but it
does not beat the frequency baseline on top-class
accuracy overall. In that case, wei_freq_c2c remains the
default choice if your only goal is classification accuracy.
At the same time, if kNN has a slightly lower Brier score and a
higher mean P(true class) than freq, then it may still be
producing better-calibrated probability weights even though its top
prediction is less often correct. That distinction matters in
cat2cat, because the mapped weights are probabilities
distributed across candidate categories rather than single-class
assignments.
cv_all <- cat2cat_ml_run(
mappings = list(trans = trans, direction = "backward"),
ml = list(
data = bind_rows(occup_2010, occup_2012),
cat_var = "code",
method = c("knn", "lda", "rf", "nb"),
features = c("age", "sex", "edu", "exp", "parttime", "salary"),
args = list(k = 10, ntree = 50)
)
)
print(cv_all)
#> === cat2cat ML Cross-Validation Results ===
#>
#> ACCURACY (higher is better):
#> naive (1/k): 0.1805
#> freq (most common): 0.5402
#> knn: accuracy = 0.5392
#> lda: accuracy = 0.5453
#> rf: accuracy = 0.5428
#> nb: accuracy = 0.3977
#>
#> BRIER SCORE (lower is better, range 0-1):
#> naive: 0.4097
#> freq: 0.2923
#> knn: brier = 0.3105
#> lda: brier = 0.3250
#> rf: brier = 0.3030
#> nb: brier = 0.4542
#>
#> MEAN P(TRUE CLASS) (higher is better):
#> naive: 0.1805
#> freq: 0.4268
#> knn: mean P(true) = 0.4578
#> lda: mean P(true) = 0.4784
#> rf: mean P(true) = 0.4709
#> nb: mean P(true) = 0.4149
#>
#> ACCURACY: ML vs BASELINES (percent of groups where ML wins):
#> knn > naive: 90.8%
#> lda > naive: 91.5%
#> rf > naive: 90.8%
#> nb > naive: 78.8%
#> knn > freq: 21.8%
#> lda > freq: 34.1%
#> rf > freq: 29.2%
#> nb > freq: 17.8%
#>
#> SKIPPED GROUPS (single category or <5 observations):
#> knn: 33.6%
#> lda: 46.3%
#> rf: 33.8%
#> nb: 34.1%Interpretation tip for mixed outputs:
cat2cat() but a non-zero skipped-group rate in
cat2cat_ml_run().The returned object is a named list. Each element corresponds to one mapping group:
# Pick a group with multiple candidates
group_names <- names(cv_all)
example_group <- group_names[
which(vapply(cv_all, function(g) !is.na(g$freq) && g$naive < 1, logical(1)))[1]
]
cv_all[[example_group]]
#> $naive
#> [1] 0.3333333
#>
#> $acc
#> knn lda rf nb
#> 1 NA 1 1
#>
#> $freq
#> [1] 1
#>
#> $brier
#> knn lda rf nb
#> 0.0000 NA 0.0025 NA
#>
#> $mean_prob
#> knn lda rf nb
#> 1.000 NA 0.975 NA
#>
#> $naive_brier
#> [1] 0.3333333
#>
#> $naive_mean_prob
#> [1] 0.3333333
#>
#> $freq_brier
#> [1] 0.00390625
#>
#> $freq_mean_prob
#> [1] 0.9375Each group entry contains the group-level diagnostics behind the printed summary:
$naive - \(1/k\)
random-guess accuracy for that group.$freq - majority-class accuracy for that group.$acc - named numeric vector with ML accuracy by
method.$naive_brier and $freq_brier - baseline
Brier scores.$brier - named numeric vector with ML Brier scores by
method.$naive_mean_prob and $freq_mean_prob -
baseline mean P(true class).$mean_prob - named numeric vector with ML mean P(true
class) by method.Understanding model performance in context: This is multi-class classification - each mapping group can have 3-10+ candidate categories. A naive random guess yields only ~18% accuracy (1/k where k is the number of candidates). Achieving 50%+ is substantial improvement over random - do not compare these numbers to binary classification benchmarks where 80%+ is typical. The key question is whether ML beats the frequency baseline, not whether it reaches some absolute threshold.
| Scenario | Recommendation |
|---|---|
| ML model performance >> freq across most groups | ML weights add genuine signal; use them |
| ML model performance \(\approx\) freq | ML is no better than frequency; prefer wei_freq_c2c
(simpler, faster) |
| ML model performance < freq for many groups | ML is adding noise; do not use ML weights |
| High skipped-group rate (>20%) | Features may have too many missing values, groups are too small, or method fitting is unstable |
Because the train/test split is random, results vary between runs.
For more stable estimates, pool more data into ml$data
(e.g. multiple survey waves) or run cat2cat_ml_run()
several times and average the summaries.
Caveat: high
cat2cat_ml_run()model performance means the model discriminates well within mapping groups. It does not validate the mapping table itself. A perfect model with a wrong transition table will still produce wrong results.