Introduction to cosmic

Load the Seattle example

The demo data are distributed as a single data frame d in the package extdata directory.

path <- system.file("extdata", "dataSPD.RData", package = "cosmic")
if (!nzchar(path)) {
  path_candidates <- c(
    file.path(vignette_dir, "..", "inst", "extdata", "dataSPD.RData"),
    file.path("..", "inst", "extdata", "dataSPD.RData"),
    file.path("inst", "extdata", "dataSPD.RData")
  )
  path <- path_candidates[file.exists(path_candidates)][1]
}

if (is.na(path) || !nzchar(path)) {
  stop("Could not find 'dataSPD.RData' in the installed package or source tree.")
}

load(path)

Each row is an officer-incident record. id identifies the incident, idOff identifies the officer, and y is the officer’s highest observed force level on an ordinal scale from 1 to 4. Incidents with no within-incident variation in force have already been removed, since they do not contribute information to the conditional likelihood. The resulting dataset is already in the format expected by cosmic(): one row per officer within incident, with the outcome coded as consecutive integers starting at 1.

seattle_overview <- data.frame(
  quantity = c("Officer-incident rows", "Incidents", "Officers", "Force levels"),
  value = c(nrow(d),
            length(unique(d$id)),
            length(unique(d$idOff)),
            length(unique(d$y))))
seattle_overview
#>                quantity value
#> 1 Officer-incident rows 37016
#> 2             Incidents  4821
#> 3              Officers  1503
#> 4          Force levels     4

force_distribution <- data.frame(
  force_level = as.integer(names(table(d$y))),
  count = as.integer(table(d$y)))
force_distribution
#>   force_level count
#> 1           1 28807
#> 2           2  6243
#> 3           3  1885
#> 4           4    81

officers_per_incident <- table(table(d$id))
incident_size_distribution <- data.frame(
  officers_in_incident = as.integer(names(officers_per_incident)),
  number_of_incidents = as.integer(officers_per_incident)
)
incident_size_distribution
#>    officers_in_incident number_of_incidents
#> 1                     2                 190
#> 2                     3                 406
#> 3                     4                 581
#> 4                     5                 630
#> 5                     6                 523
#> 6                     7                 427
#> 7                     8                 378
#> 8                     9                 308
#> 9                    10                 258
#> 10                   11                 244
#> 11                   12                 183
#> 12                   13                 174
#> 13                   14                 127
#> 14                   15                 108
#> 15                   16                  99
#> 16                   17                  64
#> 17                   18                  70
#> 18                   19                  51

Fit the COSMIC model

To fit the conditional ordinal stereotype model, use the cosmic() function. This compiles the Stan program, prepares the Seattle data, and draws samples from the posterior distribution for the officer effects \(\boldsymbol\lambda\) and the ordinal score parameters \(\mathbf{s}\). The conditional likelihood calculation is parallelized across incidents, so threads controls within-chain parallelism. In this example it is usually better to let chains run sequentially (cores = 1) and use the available CPU budget inside each chain.

COSMIC fits models through cmdstanr, which is distributed through the Stan R-universe rather than CRAN. If cmdstanr or CmdStan is not already installed, run:

install.packages("cmdstanr",
                 repos = c("https://stan-dev.r-universe.dev",
                           getOption("repos")))
cmdstanr::install_cmdstan()

# 5-6 hours?
fit <- cosmic(d,
              incidentID = id,
              officerID  = idOff,
              y          = y,
              iter       = 10000,
              chains     = 4,
              cores      = 1,
              threads    = 8)

Check sampler diagnostics

After fitting, start by reviewing convergence and Hamiltonian Monte Carlo diagnostics. The most important questions are whether the chains mixed well, whether effective sample sizes are adequate, and whether the Hamiltonian Monte Carlo sampler encountered pathologies such as divergences.

diagnose(fit)

#> COSMIC fit diagnostics: OK 
#>   Max R-hat: 1.001 
#>   Min n_eff: 7172.822 
#>   Divergences: 0 
#>   Max treedepth saturations: 0 
#>   Low E-BFMI chains: none 
#> No major diagnostic problems detected.

Summarize officers relative to their peers

The next step is to summarize each officer relative to their peer group. The officer-specific parameters \(\lambda_i\) are not individually identified; only contrasts between officers are meaningful. For that reason, officer_summary() reports each officer relative to a set of peers for whom the data are informative enough to estimate differences in \(\lambda\) with useful precision.

The prior standard deviation on the \(\lambda_i\) is set with priorSD_lambda when calling cosmic(), with the default equal to 2. The prior variance for the difference of \(\lambda_i\) and \(\lambda_j\) is 2*priorSD_lambda^2. Officers \(i\) and \(j\) will be considered peers if their posterior variance, \(\mathrm{Var}(\lambda_i-\lambda_j\mid\mathrm{data}) < \mathtt{pct\_threshold} \times 2 \times \mathtt{priorSD\_lambda}^2\). pct_threshold is the percent reduction in variance in the difference between two officers’ values of \(\lambda\) needed in order for them to be considered peers, with the default set to 0.25. officer_summary() will also compute the probability that the officer is in the top or bottom flag_officer_tail_pct of the distribution of \(\lambda\) in their peer group, with the default set to 5%.

flag_officer_tail_pct <- 0.05
off_summary <- officer_summary(fit,
                               pct_threshold = 0.25,
                               pct_tail = flag_officer_tail_pct)

head(off_summary)

#>   idOffOrig idOff nPeers nInc force1 force2 force3 force4 pRankToppct
#> 1         1     1    426    2      1      1      0      0     0.25785
#> 2         2     2      0    1      1      0      0      0          NA
#> 3         3     3    471    2      1      1      0      0     0.16500
#> 4         4     4     74    5      5      0      0      0     0.00005
#> 5         5     5    846   33     33      0      0      0     0.00000
#> 6         6     6      0    2      2      0      0      0          NA
#>   pRankBotpct      lamMean    lam025     lam975
#> 1     0.10820  0.333570433 -2.377860  2.8710224
#> 2          NA           NA        NA         NA
#> 3     0.14960 -0.003180971 -2.683507  2.5165607
#> 4     0.95380 -3.242193464 -6.172846 -0.7727395
#> 5     0.95355 -3.510387239 -5.999311 -1.6184494
#> 6          NA           NA        NA         NA

The resulting table includes the number of peers for each officer, incident counts, force-category counts, and posterior summaries of each officer’s latent propensity relative to peers. lamMean reports the posterior mean for the difference between \(\lambda_i\) and the mean of the \(\lambda\)s in officer \(i\)’s peer group. lam025 and lam975 give the endpoints of the 95% credible interval. Some officers have no reported peers because their force counts are too sparse for precise estimation of their \(\lambda\), or because they are weakly connected to the part of the officer network where pairwise contrasts can be estimated well.

The value of \(\lambda\) can be interpreted as a relative risk of using a more serious force level. \[ \frac{P(Y=y_2|\lambda=\lambda_0)}{P(Y=y_1|\lambda=\lambda_0)} = \frac{P(Y=y_2|\lambda=0)}{P(Y=y_1|\lambda=0)} e^{\left(s_{y_2}-s_{y_1}\right)\lambda_0} \] Since \(s_0=0\) and \(s_1=1\), one way of expressing \(\lambda\) is \[ e^{\lambda_0} = \frac{\frac{P(Y=1|\lambda=\lambda_0)}{P(Y=0|\lambda=\lambda_0)}} {\frac{P(Y=1|\lambda=0)}{P(Y=0|\lambda=0)}} \] The relative risk ratio of using Level 1 versus Level 0 force for an officer with \(\lambda=\lambda_0\) compared to the average of their peer group.

Flag officers in the tails of the peer distribution

To focus attention on officers with strong posterior evidence of being unusually high or low relative to peers, filter the officer summary into an outlier report. outlier_report() keeps only those officers with posterior probability exceeding prob_outlier of being in either the top or bottom pct_tail of the peer-group distribution of \(\lambda\).

outliers <- outlier_report(off_summary, prob_outlier = 0.9)
head(outliers, 10)

#>    idOffOrig idOff nPeers nInc force1 force2 force3 force4 pRankToppct
#> 1         28    28   1170   37     12     24      1      0     0.99855
#> 2        836   836   1188   75     33     39      2      1     0.99735
#> 3       1450  1450    864    7      0      6      1      0     0.98625
#> 4       1006  1006   1042    9      1      8      0      0     0.98010
#> 5        182   182   1187   79     25     43     11      0     0.97370
#> 6       1341  1341   1153   29      8     12      9      0     0.96065
#> 7        517   517    104    5      5      0      0      0     0.00040
#> 8       1447  1447    165    7      7      0      0      0     0.00035
#> 9        603   603    179    4      4      0      0      0     0.00035
#> 10       428   428     42    5      5      0      0      0     0.00030
#>    pRankBotpct   lamMean    lam025     lam975
#> 1      0.00000  3.001081  2.278690  3.7535805
#> 2      0.00000  2.634561  2.119185  3.1536983
#> 3      0.00000  3.705152  1.734301  6.0632103
#> 4      0.00000  3.316528  1.781626  5.0991616
#> 5      0.00000  2.445398  1.907371  2.9878141
#> 6      0.00000  2.651643  1.778980  3.5635044
#> 7      0.92700 -2.827792 -5.829419 -0.4328217
#> 8      0.93600 -2.937077 -5.929507 -0.5594303
#> 9      0.95665 -3.219114 -6.088955 -0.7492189
#> 10     0.93415 -2.971142 -5.965520 -0.5442428

  outliers_display <- outliers
  prob_cols <- names(outliers_display) %in% c("pRankToppct", "pRankBotpct")
  tail_pct_label <- formatC(100 * flag_officer_tail_pct,
                            format = "fg", digits = 3)
  names(outliers_display)[names(outliers_display) == "pRankToppct"] <-
    paste0("Prob. top ", tail_pct_label, "%")
  names(outliers_display)[names(outliers_display) == "pRankBotpct"] <-
    paste0("Prob. bottom ", tail_pct_label, "%")
  outliers_display[prob_cols] <-
    lapply(outliers_display[prob_cols],
           function(x) sprintf("%.2f", x))

  digits <- rep(0, ncol(outliers_display))
  digits[seq.int(length(digits)-4,length(digits))] <- 2L

  knitr::kable(
    outliers_display,
    format = "html",
    digits = digits,
    caption = "Officers with strong posterior evidence of being in the tails of the peer distribution",
    align = rep("r", ncol(outliers_display))) |>
    kableExtra::kable_styling(
      bootstrap_options = c("striped", "hover", "condensed", "responsive"),
      full_width = FALSE,
      position = "center") |>
    kableExtra::row_spec(0, bold = TRUE, color = "white",
                         background = "#2C3E50") |>
    kableExtra::column_spec(1, bold = TRUE) |>
    kableExtra::scroll_box(width = "100%", height = "320px")

Officers with strong posterior evidence of being in the tails of the peer distribution
idOffOrig	idOff	nPeers	nInc	force1	force2	force3	force4	Prob. top 5%	Prob. bottom 5%	lamMean	lam025	lam975
28	28	1170	37	12	24	1	0	1.00	0.00	3.00	2.28	3.75
836	836	1188	75	33	39	2	1	1.00	0.00	2.63	2.12	3.15
1450	1450	864	7	0	6	1	0	0.99	0.00	3.71	1.73	6.06
1006	1006	1042	9	1	8	0	0	0.98	0.00	3.32	1.78	5.10
182	182	1187	79	25	43	11	0	0.97	0.00	2.45	1.91	2.99
1341	1341	1153	29	8	12	9	0	0.96	0.00	2.65	1.78	3.56
517	517	104	5	5	0	0	0	0.00	0.93	-2.83	-5.83	-0.43
1447	1447	165	7	7	0	0	0	0.00	0.94	-2.94	-5.93	-0.56
603	603	179	4	4	0	0	0	0.00	0.96	-3.22	-6.09	-0.75
428	428	42	5	5	0	0	0	0.00	0.93	-2.97	-5.97	-0.54
427	427	156	8	8	0	0	0	0.00	0.92	-2.83	-5.78	-0.47
1027	1027	51	7	7	0	0	0	0.00	0.93	-2.94	-5.94	-0.52
321	321	166	5	5	0	0	0	0.00	0.93	-2.90	-5.84	-0.51
372	372	112	15	8	4	3	0	0.00	0.96	-3.25	-6.16	-0.79
737	737	322	8	8	0	0	0	0.00	0.91	-2.83	-5.71	-0.50
265	265	115	8	8	0	0	0	0.00	0.93	-2.85	-5.86	-0.50
191	191	39	13	13	0	0	0	0.00	0.93	-2.94	-5.92	-0.59
31	31	289	3	3	0	0	0	0.00	0.95	-3.25	-6.07	-0.77
1201	1201	270	12	12	0	0	0	0.00	0.91	-2.79	-5.71	-0.48
19	19	182	10	10	0	0	0	0.00	0.92	-2.76	-5.73	-0.43
326	326	507	11	11	0	0	0	0.00	0.90	-2.99	-5.85	-0.75
1356	1356	185	10	10	0	0	0	0.00	0.91	-2.78	-5.74	-0.43
245	245	95	10	10	0	0	0	0.00	0.92	-2.82	-5.85	-0.47
528	528	357	6	6	0	0	0	0.00	0.92	-3.06	-5.84	-0.67
435	435	82	14	13	1	0	0	0.00	0.93	-2.86	-5.88	-0.51
1178	1178	238	9	9	0	0	0	0.00	0.94	-2.98	-5.94	-0.67
152	152	336	6	6	0	0	0	0.00	0.95	-3.24	-6.05	-0.86
4	4	74	5	5	0	0	0	0.00	0.95	-3.24	-6.17	-0.77
112	112	484	5	5	0	0	0	0.00	0.96	-3.49	-6.23	-1.06
420	420	694	17	17	0	0	0	0.00	0.90	-3.12	-5.80	-1.00
841	841	481	16	16	0	0	0	0.00	0.90	-2.95	-5.79	-0.69
328	328	739	32	32	0	0	0	0.00	0.90	-3.14	-5.84	-1.12
391	391	713	14	14	0	0	0	0.00	0.91	-3.18	-5.86	-1.09
21	21	770	19	19	0	0	0	0.00	0.92	-3.24	-5.86	-1.23
100	100	798	35	35	0	0	0	0.00	0.92	-3.23	-5.82	-1.29
57	57	314	12	12	0	0	0	0.00	0.92	-2.88	-5.78	-0.57
1407	1407	349	12	12	0	0	0	0.00	0.92	-2.95	-5.86	-0.66
429	429	315	11	11	0	0	0	0.00	0.93	-2.95	-5.89	-0.61
94	94	795	21	21	0	0	0	0.00	0.93	-3.34	-5.93	-1.34
1461	1461	22	10	10	0	0	0	0.00	0.93	-3.12	-6.15	-0.74
461	461	658	4	4	0	0	0	0.00	0.93	-3.37	-6.03	-1.13
92	92	819	21	21	0	0	0	0.00	0.94	-3.36	-5.90	-1.39
527	527	812	28	28	0	0	0	0.00	0.94	-3.35	-5.93	-1.42
597	597	677	18	18	0	0	0	0.00	0.94	-3.36	-6.02	-1.22
409	409	789	38	38	0	0	0	0.00	0.94	-3.37	-5.96	-1.39
632	632	193	6	6	0	0	0	0.00	0.94	-3.04	-5.96	-0.63
579	579	843	68	68	0	0	0	0.00	0.95	-3.42	-5.89	-1.53
373	373	722	11	11	0	0	0	0.00	0.95	-3.53	-6.15	-1.39
75	75	362	6	6	0	0	0	0.00	0.95	-3.31	-6.15	-0.92
5	5	846	33	33	0	0	0	0.00	0.95	-3.51	-6.00	-1.62
342	342	808	33	33	0	0	0	0.00	0.95	-3.47	-6.03	-1.51
237	237	829	27	27	0	0	0	0.00	0.96	-3.49	-6.03	-1.55
288	288	834	61	61	0	0	0	0.00	0.96	-3.52	-6.01	-1.61
287	287	884	61	60	1	0	0	0.00	0.96	-3.63	-6.03	-1.78
387	387	822	33	33	0	0	0	0.00	0.96	-3.51	-6.06	-1.60
189	189	848	41	41	0	0	0	0.00	0.96	-3.57	-6.10	-1.70
411	411	842	26	26	0	0	0	0.00	0.96	-3.58	-6.08	-1.65
271	271	855	49	49	0	0	0	0.00	0.97	-3.65	-6.15	-1.78
247	247	850	53	53	0	0	0	0.00	0.97	-3.68	-6.17	-1.80
629	629	800	24	24	0	0	0	0.00	0.97	-3.71	-6.21	-1.70
399	399	806	37	37	0	0	0	0.00	0.97	-3.72	-6.23	-1.73
175	175	749	18	18	0	0	0	0.00	0.97	-3.80	-6.37	-1.66
457	457	860	65	64	1	0	0	0.00	0.98	-3.78	-6.26	-1.90
636	636	838	39	39	0	0	0	0.00	0.99	-3.97	-6.47	-2.03

Optional posterior extraction

If you need custom summaries or plots, extract posterior draws directly. In this model, the score parameters are anchored so that the baseline level is 0 and the next score is 1, while Stan samples the positive increments in sDelta. Under the notation used below, the remaining free score parameters are reconstructed as

\[ s_2 = 1 + \mathtt{sDelta}_1, \qquad s_3 = 1 + \mathtt{sDelta}_1 + \mathtt{sDelta}_2. \]

sDelta_draws <- posterior(fit, pars = "sDelta")$sDelta

score_summary <- data.frame(
  parameter = c("s[2]", "s[3]"),
  mean = c(
    mean(1 + sDelta_draws[, 1]),
    mean(1 + sDelta_draws[, 1] + sDelta_draws[, 2])
  ),
  lower_95 = c(
    unname(stats::quantile(1 + sDelta_draws[, 1], probs = 0.025)),
    unname(stats::quantile(1 + sDelta_draws[, 1] + sDelta_draws[, 2],
                           probs = 0.025))
  ),
  upper_95 = c(
    unname(stats::quantile(1 + sDelta_draws[, 1], probs = 0.975)),
    unname(stats::quantile(1 + sDelta_draws[, 1] + sDelta_draws[, 2],
                           probs = 0.975))
  )
)

The table below summarizes the posterior distribution of the non-fixed score parameters. The expression \(\frac{1}{s_{y_2}-s_{y_1}}\log 2\) gives the amount that \(\lambda\) must increase to double the relative risk of force level \(y_2\) over \(y_1\).

Posterior summaries for the estimated score parameters
parameter	mean	lower_95	upper_95
s[2]	1.009	1.000	1.031
s[3]	1.086	1.007	1.267

The next plots show the joint and marginal posterior distributions for these score parameters.

ggplot(data.frame(x = sDelta_draws[,1], 
                  y = sDelta_draws[,2]),
       aes(x=x, y=y)) +
  geom_point(alpha = 0.05, size = 0.5) +
  geom_density_2d(color = "red") +
  labs(x = expression(s[2] - 1),
       y = expression(s[3] - s[2])) +
  theme_minimal()

hist(1+sDelta_draws[,1], xlab=expression(s[2]), main="")
abline(v=1+mean(sDelta_draws[,1]))

hist(1+sDelta_draws[,1]+sDelta_draws[,2], 
     xlab=expression(s[3]), main="")
abline(v=1+mean(sDelta_draws[,1])+mean(sDelta_draws[,2]))