# Load packages
library(dplyr)
library(ggplot2)
library(seqwrap)
library(edgeR)
library(gt)
# This vignette is pre-rendered by the package maintainer (not built on CRAN),
# so n_cores can be tuned to the rendering machine.
n_cores <- parallel::detectCores()
# Load data
metadata <- dungan_counts$metadata
counts <- dungan_counts$countdataFitting models with seqwrap
The seqwrap package is designed to iterate over multiple targets in a high-dimensional data set to fit a user-defined regression model using a user-selected model fitting function. Each target may represent, e.g., a gene, an exon, a protein, a methylation site, or any other outcome from a high-throughput omics platform. Each target is measured across a set of samples, with a metadata set used to describe study covariates. In the basic case, data and metadata are combined in each iteration to fit the selected model.
Dungan et al. [1] evaluated a senolytic treatment (dasatinib and quercetin) in mouse skeletal muscle subjected to mechanical overload through synergist ablation. The study included sham surgery and a vehicle control, allowing analysis of the effect of senolytic treatment and the interaction between mechanical overloading and senolytic treatment. RNA-seq was performed on 20 animals (5 per group). The data are available in the seqwrap package under dungan_counts. We will load the data and filter low-expression genes.
The metadata contains the sample ID corresponding to columns in the counts data set. We are using seq_sample_id as the column name here. The first rows of the prepared metadata are displayed in Table 1.
| seq_sample_id | treatment | surgery |
|---|---|---|
| OS14d5 | Senolytic | Overload |
| OS14d6 | Senolytic | Overload |
| OSS4 | Senolytic | Sham |
| OSS5 | Senolytic | Sham |
| OV14D5 | Vehicle | Overload |
| OV14D6 | Vehicle | Overload |
| OVS4 | Vehicle | Sham |
| OVS5 | Vehicle | Sham |
The count data matrix can be manipulated as a data frame, with the first column indicating the targets (in this case, gene symbols). Importantly, additional column names have names corresponding to the seq_sample_id variable in the metadata. As an example, the first rows of the count data frame are shown for the same samples as in the metadata data frame (see Table 2). seqwrap expects that targets are unique so before going further we should validate this in the data. When doing so we find that five targets have duplicate rows in the data set, for simplicity, we will remove them.
| gene_name | OS14d5 | OS14d6 | OSS4 | OSS5 | OV14D5 | OV14D6 | OVS4 | OVS5 |
|---|---|---|---|---|---|---|---|---|
| Gnai3 | 858 | 773 | 280 | 295 | 700 | 357 | 380 | 236 |
| Cdc45 | 63 | 71 | 45 | 43 | 61 | 39 | 53 | 54 |
| H19 | 63437 | 69156 | 4539 | 4694 | 75553 | 63084 | 3565 | 4981 |
| Narf | 1935 | 2033 | 2573 | 2415 | 2275 | 1807 | 2999 | 2543 |
| Cav2 | 730 | 701 | 491 | 400 | 584 | 441 | 548 | 320 |
| Klf6 | 1300 | 1232 | 356 | 327 | 820 | 458 | 530 | 518 |
| Scmh1 | 672 | 635 | 738 | 547 | 599 | 464 | 826 | 637 |
| Cox5a | 5384 | 6501 | 7794 | 6355 | 7189 | 5354 | 8651 | 8478 |
| Tbx2 | 91 | 103 | 59 | 55 | 131 | 107 | 59 | 35 |
| Tbx4 | 43 | 77 | 25 | 33 | 65 | 45 | 37 | 27 |
# Find targets with more than a unique row in the data
rm_targets <- counts |>
summarise(.by = gene_name,
n = n()) |>
filter(n > 1) |>
pull(gene_name)
# Filter out non-uniqe rows
counts <- counts |>
filter(!(gene_name %in% rm_targets))The data suggests a negative binomial model were the linear predictor can be used to assess the difference between senolytic treatment and control (treat; \(\beta_1g\)), mechanical overloading compared to sham surgery (mov; \(\beta_2g\)), and their interaction (\(\beta_3g\)) for gene g in 1 to 12273 and row i in 1 to 20 of the meta data (Equation 1). We want to control for differences in sequencing depth by including an offset (si). We will use the TMM method [2] to calculate the effective library size.
\[ \begin{align} y_{g[i]} &\sim \operatorname{NB2}(\mu_{g[i]}, \theta_g) \\ \operatorname{log}(\mu_{g[i]}) &= \beta_0 + \beta_{1g}\text{treat}_i + \beta_{2g}\text{mov}_i + \beta_{3g}(\text{treat}_i \times \text{mov}_i) + \operatorname{log}(s_i) \end{align} \tag{1}\]
Below, we are using the edgeR package to calculate the library size normalization factor. Effective library sizes are then expressed as relative to the median effective library and log transformed.
# Use EdgeR to calculate the TMM
y <- edgeR::DGEList(counts[,-1])
y <- edgeR::calcNormFactors(y)
# Combine the data into metadata
metadata <- metadata |>
inner_join(
y$samples |>
tibble::rownames_to_column("seq_sample_id")
) |>
mutate(
efflibsize = (lib.size * norm.factors) / median(lib.size * norm.factors),
efflibsize = log(efflibsize)
) Joining with `by = join_by(seq_sample_id)`
seqwrap_compose is used to combine the data, metadata and the selected modelling function with its arguments into a swobject. A negative binomial model can be fitted using e.g., glmmTMB from the glmmTMB package (glmmTMB::glmmTMB) or MASS::glm.nb, here we will use glmmTMB as it can be extended in subsequent modelling steps.
The specified glmmTMB function (modelfun slot) needs a formula and a family function, these are supplied in the arguments slot. The count data are supplied in the data slot and the metadata in the metadata slot. Notice that we use the namespace operator (::) to specify selected functions to avoid difficulties in parallel processing. The samplename indicates the column name in the metadata that gives sample names.
m1.1 <- seqwrap_compose(
modelfun = glmmTMB::glmmTMB,
data = counts,
metadata = metadata,
arguments = list(formula = y ~ treatment * surgery + offset(efflibsize),
family = glmmTMB::nbinom2),
samplename = "seq_sample_id"
)Next, we will run the model using seqwrap. We recommend using subset to specify a smaller number of targets when prototyping the model. Here, we will assess whether the model output matches expectations. By specifying return_models = TRUE, all model objects will be returned in the results object, which can be inspected if needed. Here we are using a single core for fitting the models, specified in the cores argument.
m1.1_temp <- seqwrap(
m1.1,
cores = 1,
subset = 1:24,
return_models = TRUE,
verbose = FALSE
)Following iterative fitting using seqwrap, we can summarize the models using seqwrap_summarise, which produces a list of two summaries. Below, we are saving the summaries in an object for later access. Running the seqwrap_summarise function also gives us an overview of the model and successfully summarized targets.
In the summaries slot, model parameter values are returned (see Table 3). These are produced using broom.mixed::tidy which is a general function to retrieve model parameters from many regression model packages in R.
In the evaluations slot, each model is evaluated using tests from the DHARMa package (see Table 4).
m1.1_temp_sum <- seqwrap_summarise(m1.1_temp, verbose = FALSE)| target | effect | component | term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|---|---|---|
| Axin2 | fixed | cond | (Intercept) | 5.438 | 0.048 | 113.314 | 0 |
| Axin2 | fixed | cond | treatmentSenolytic | −0.199 | 0.069 | −2.897 | 0.004 |
| Axin2 | fixed | cond | surgeryOverload | −0.27 | 0.069 | −3.939 | 8.177 × 10−5 |
| Axin2 | fixed | cond | treatmentSenolytic:surgeryOverload | 0.174 | 0.097 | 1.788 | 0.074 |
| Brat1 | fixed | cond | (Intercept) | 4.99 | 0.051 | 98.437 | 0 |
| Brat1 | fixed | cond | treatmentSenolytic | −0.048 | 0.071 | −0.67 | 0.503 |
| Brat1 | fixed | cond | surgeryOverload | 0.209 | 0.068 | 3.069 | 0.002 |
| Brat1 | fixed | cond | treatmentSenolytic:surgeryOverload | 0.003 | 0.096 | 0.03 | 0.976 |
| Cav2 | fixed | cond | (Intercept) | 6.072 | 0.047 | 128.043 | 0 |
| Cav2 | fixed | cond | treatmentSenolytic | 0.072 | 0.067 | 1.079 | 0.281 |
| target | uniformity | dispersion | outliers |
|---|---|---|---|
| Axin2 | 0.9670810 | 0.672 | 1 |
| Brat1 | 0.8040232 | 0.600 | 1 |
| Cav2 | 0.9781748 | 0.776 | 1 |
| Ccnd2 | 0.9163917 | 0.392 | 1 |
| Cdc45 | 0.9087045 | 0.720 | 1 |
| Cox5a | 0.4403266 | 0.624 | 1 |
| Dlat | 0.8195092 | 0.488 | 1 |
| Drp2 | 0.7819079 | 0.352 | 1 |
| Fer | 0.9533689 | 0.848 | 1 |
| Fgf6 | 0.9384286 | 0.456 | 1 |
Specifying summary and evaluation function
We can specify custom functions for evaluating the model object. seqwrap allows for two slots (summaries and evaluations). Each function should take the model object as an argument, and for compatibility with seqwrap_summarise, return a data frame. For our next step, we would like to retrieve the dispersion parameter from each model. We will place this parameter in the evaluation function. We can extract the dispersion parameter on the log scale from the glmmTMB model object. We will construct a function around the extraction and include the observed counts from the data:
dispersion_fun <- function(x) {
out <- data.frame(
dispersion = data.frame(summary(x$sdr))["betadisp",1],
dispersion.se = data.frame(summary(x$sdr))["betadisp",2],
log_mu = log(mean(x$frame$y))
)
return(out)
}Using our preliminary models, we can validate the function for one model.
dispersion_fun(m1.1_temp@models[[1]]) dispersion dispersion.se log_mu
1 5.043444 0.5786351 5.228967
The output is, as expected, a data frame with three columns. The dispersion parameter is expected to be on the log scale. To validate this we can directly access the estimated dispersion parameter from the model object using sigma(), which return the dispersion parameter on the natural scale.
log(sigma(m1.1_temp@models[[1]]))[1] 5.043444
Running the full model
When we are happy with preliminary performance on a subset of targets, we can initialize seqwrap to iterate over all targets. We are updating the previous model with the new evaluation function directly in seqwrap. Notice that we also remove the subset and set return_models to FALSE. Saving all model objects will be too memory-consuming.
We should expect some errors from modelling many thousands of targets. As seen in the note from seqwrap, some models are associated with warnings from the modelling algorithm and/or errors from the evaluation function.
m1.2_results <- seqwrap(
m1.1,
eval_fun = dispersion_fun,
cores = n_cores,
return_models = FALSE,
verbose = FALSE
)Errors can be inspected by accessing the errors slot. From the inspection below, warnings are at least to some extent related to model convergence.
warnings <- m1.2_results@errors|>
# filter the non-null elements
filter(purrr::map_lgl(warnings_fit, ~ !is.null(.x[[1]]))) |>
# save in object
pull(warnings_fit)
warnings[1]$`1010001B22Rik`
<simpleWarning in finalizeTMB(TMBStruc, obj, fit, h, data.tmb.old): Model convergence problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')>
Using target data for prior information in glmmTMB
A major benefit of canonical R packages for RNA-seq data modelling is the sharing of information between targets through shrinkage and empirical Bayes strategies. We suggest using priors in glmmTMB as convenient way of flexibly regularize models using an empirical Bayes strategy. In glmmTMB, the priors argument can be used to set priors on parameters. We will use the estimated parameters from the first model to set up the priors.
First, we need to extract the parameters. From the evaluation function we get the average log count and the dispersion parameter estimate. From the summaries, we will get the other parameters (summarized in Figure 1).
m1.2_sum <- seqwrap_summarise(m1.2_results, verbose = FALSE)ggplot2::geom_smooth`.
Except for the dispersion values, we can easily summarize parameter values by their mean and standard deviation. A more elaborate summary of the dispersion values could include an estimate of the mean (log) dispersion, conditional on the average log counts. As the dispersion prior will be slightly different for each target (based on the observed counts), we will combine priors into target-specific data frames and store them as a list. seqwrap will make target-specific list elements available in each iteration, making it possible to build the prior data frame for glmmTMB.
# Summarise parameters from the main output
param_sum <- m1.2_sum$summaries |>
summarise(.by = term,
m = mean(estimate),
s = sd(estimate))
# Fit a model to the log dispersion values
trend_model <- loess(dispersion ~ log_mu,
data = m1.2_sum$evaluations,
span = 0.7)
# Predict dispersion for each gene based on log raw counts
# and combine into a prior.
dispersion_prior <- data.frame(gene = counts[,1],
pred = round(
predict(trend_model,
newdata = data.frame(
log_mu = log(rowMeans(counts[,-1]))
)
), 3),
s = round(
trend_model$s, 3)
) |>
mutate(prior = paste0("normal(", pred, ",", s, ")"))
# Combine parameters in a data frame...
Priors_df <- data.frame(prior = paste0("normal(",
round(param_sum$m, 2),
",",
round(param_sum$s,2 ),
")"),
class = rep("fixef", 4),
coef = param_sum$term)
# ... extract a prediction for the mean dispersion parameter per target
# and combine with other parameters
Priors_list <- list()
for( j in 1:nrow( counts )) {
df <- bind_rows(Priors_df,
data.frame(
prior =
dispersion_prior[dispersion_prior$gene ==
counts[j,1],4],
class = "fixef_disp",
coef = "1"
)
)
Priors_list[[j]] <- df
}Again, we will combine data, metadata, and arguments using seqwrap_compose. We are now also adding the priors to the arguments. The variables used in the priors data frame are made available from the target-specific list referenced in targetdata = Prior_list. Using arguments = alist(... avoids evaluation of the list so that the data frame is not evaluated before being used in each iteration. Priors have been built corresponding what is expected by glmmTMB (see Priors in glmmTMB).
m1.3 <- seqwrap_compose(
modelfun = glmmTMB::glmmTMB,
data = counts,
metadata = metadata,
eval_fun = dispersion_fun,
arguments = alist(
formula = y ~ treatment * surgery + offset(efflibsize),
family = glmmTMB::nbinom2,
priors = data.frame(
prior = prior,
class = class,
coef = coef)),
targetdata = Priors_list
)
m1.3_temp <- seqwrap(
m1.3,
subset = 1:24,
return_models = TRUE,
verbose = FALSE
)
summary(m1.3_temp@models[[1]]) Family: nbinom2 ( log )
Formula: y ~ treatment * surgery + offset(efflibsize)
Data: list(y = c(194L, 186L, 217L, 183L, 168L, 180L, 160L, 158L, 183L,
198L, 183L, 183L, 168L, 200L, 176L, 254L, 232L, 186L, 169L, 154L
), treatment = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), levels = c("Vehicle",
"Senolytic"), class = "factor"), surgery = structure(c(2L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L), levels = c("Sham", "Overload"), class = "factor"), efflibsize = c(0.166155699457707,
0.196470816938507, 0.0682347925124099, 0.0387654466787985, 0.0392563629230484,
-0.0785846626255147, -0.225337792488665, -0.000892800687716758,
0.000892004305606193, -0.0490532038569703, 0.144822834668174,
-0.0984341203861923, -0.0169305794837407, 0.049199853171865,
0.100216656584421, 0.0604222716589449, -0.167592705247281, -0.28188788975239,
-0.182898895921145, -0.191409863872739))
AIC BIC logLik -2*log(L) df.resid
192.3 197.3 -91.2 182.3 15
Dispersion parameter for nbinom2 family (): 149
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.40834 0.04526 119.50 < 2e-16 ***
treatmentSenolytic -0.14000 0.05998 -2.33 0.019601 *
surgeryOverload -0.22861 0.06375 -3.59 0.000335 ***
treatmentSenolytic:surgeryOverload 0.09437 0.08213 1.15 0.250533
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Priors:
fixef((Intercept)) ~ normal(5.81, 1.51)
fixef(treatmentSenolytic) ~ normal(0.02, 0.14)
fixef(surgeryOverload) ~ normal(0.12, 0.59)
fixef(treatmentSenolytic:surgeryOverload) ~ normal(0, 0.16)
fixef_disp(1) ~ normal(5.37, 1.729)
Running the prototyping seems to work, we will proceed with the full data set.
m1.3_results <- seqwrap(
m1.3,
cores = n_cores,
return_models = FALSE,
verbose = FALSE
)
m1.3_sum <- seqwrap_summarise(m1.3_results, verbose = FALSE)Regularization increases the number models without convergence issues, in fact all 12273 targets have evaluations and summaries.
The summaries data frame is easily manipulated in subsequent analysis steps to, e.g., adjust p-values and create basic visualizations of the results (Figure 2).
The above example shows how to build models with seqwrap. Importantly, we make no claims about the validity of the final model but urge the user to assess assumptions in their data and models as part of a principled workflow.