Fitting models with seqwrap

Authors
Affiliations

Chidimma Echebiri

University of Inland Norway

Daniel Hammarström

University of Inland Norway

The seqwrap package is designed to iterate over multiple targets in a high-dimensional data set to fit a user-defined regression model using a user-selected model fitting function. Each target may represent, e.g., a gene, an exon, a protein, a methylation site, or any other outcome from a high-throughput omics platform. Each target is measured across a set of samples, with a metadata set used to describe study covariates. In the basic case, data and metadata are combined in each iteration to fit the selected model.

Dungan et al. [1] evaluated a senolytic treatment (dasatinib and quercetin) in mouse skeletal muscle subjected to mechanical overload through synergist ablation. The study included sham surgery and a vehicle control, allowing analysis of the effect of senolytic treatment and the interaction between mechanical overloading and senolytic treatment. RNA-seq was performed on 20 animals (5 per group). The data are available in the seqwrap package under dungan_counts. We will load the data and filter low-expression genes.

# Load packages
library(dplyr)
library(ggplot2)
library(seqwrap)
library(edgeR)
library(gt)

# This vignette is pre-rendered by the package maintainer (not built on CRAN),
# so n_cores can be tuned to the rendering machine.

n_cores <- parallel::detectCores()

# Load data 
metadata <- dungan_counts$metadata
counts <- dungan_counts$countdata

The metadata contains the sample ID corresponding to columns in the counts data set. We are using seq_sample_id as the column name here. The first rows of the prepared metadata are displayed in Table 1.

Table 1: Selected rows from the Dungan et al. metadata data frame.
seq_sample_id treatment surgery
OS14d5 Senolytic Overload
OS14d6 Senolytic Overload
OSS4 Senolytic Sham
OSS5 Senolytic Sham
OV14D5 Vehicle Overload
OV14D6 Vehicle Overload
OVS4 Vehicle Sham
OVS5 Vehicle Sham

The count data matrix can be manipulated as a data frame, with the first column indicating the targets (in this case, gene symbols). Importantly, additional column names have names corresponding to the seq_sample_id variable in the metadata. As an example, the first rows of the count data frame are shown for the same samples as in the metadata data frame (see Table 2). seqwrap expects that targets are unique so before going further we should validate this in the data. When doing so we find that five targets have duplicate rows in the data set, for simplicity, we will remove them.

Table 2: Selected rows from the Dungan et al. counts data frame.
gene_name OS14d5 OS14d6 OSS4 OSS5 OV14D5 OV14D6 OVS4 OVS5
Gnai3 858 773 280 295 700 357 380 236
Cdc45 63 71 45 43 61 39 53 54
H19 63437 69156 4539 4694 75553 63084 3565 4981
Narf 1935 2033 2573 2415 2275 1807 2999 2543
Cav2 730 701 491 400 584 441 548 320
Klf6 1300 1232 356 327 820 458 530 518
Scmh1 672 635 738 547 599 464 826 637
Cox5a 5384 6501 7794 6355 7189 5354 8651 8478
Tbx2 91 103 59 55 131 107 59 35
Tbx4 43 77 25 33 65 45 37 27
# Find targets with more than a unique row in the data
rm_targets <- counts |> 
  summarise(.by = gene_name, 
            n = n()) |> 
  filter(n > 1) |> 
  pull(gene_name)

# Filter out non-uniqe rows
counts <- counts |> 
  filter(!(gene_name %in% rm_targets))

The data suggests a negative binomial model were the linear predictor can be used to assess the difference between senolytic treatment and control (treat; \(\beta_1g\)), mechanical overloading compared to sham surgery (mov; \(\beta_2g\)), and their interaction (\(\beta_3g\)) for gene g in 1 to 12273 and row i in 1 to 20 of the meta data (Equation 1). We want to control for differences in sequencing depth by including an offset (si). We will use the TMM method [2] to calculate the effective library size.

\[ \begin{align} y_{g[i]} &\sim \operatorname{NB2}(\mu_{g[i]}, \theta_g) \\ \operatorname{log}(\mu_{g[i]}) &= \beta_0 + \beta_{1g}\text{treat}_i + \beta_{2g}\text{mov}_i + \beta_{3g}(\text{treat}_i \times \text{mov}_i) + \operatorname{log}(s_i) \end{align} \tag{1}\]

Below, we are using the edgeR package to calculate the library size normalization factor. Effective library sizes are then expressed as relative to the median effective library and log transformed.

# Use EdgeR to calculate the TMM
y <- edgeR::DGEList(counts[,-1])
y <- edgeR::calcNormFactors(y)

# Combine the data into metadata
metadata <- metadata |> 
  inner_join(
    y$samples |> 
  tibble::rownames_to_column("seq_sample_id")
  ) |> 
   mutate(
     efflibsize = (lib.size * norm.factors) / median(lib.size * norm.factors),
     efflibsize = log(efflibsize)
   ) 
Joining with `by = join_by(seq_sample_id)`

seqwrap_compose is used to combine the data, metadata and the selected modelling function with its arguments into a swobject. A negative binomial model can be fitted using e.g., glmmTMB from the glmmTMB package (glmmTMB::glmmTMB) or MASS::glm.nb, here we will use glmmTMB as it can be extended in subsequent modelling steps.

The specified glmmTMB function (modelfun slot) needs a formula and a family function, these are supplied in the arguments slot. The count data are supplied in the data slot and the metadata in the metadata slot. Notice that we use the namespace operator (::) to specify selected functions to avoid difficulties in parallel processing. The samplename indicates the column name in the metadata that gives sample names.

m1.1 <- seqwrap_compose(
  modelfun = glmmTMB::glmmTMB,
  data = counts, 
  metadata = metadata, 
  arguments = list(formula = y ~ treatment * surgery + offset(efflibsize), 
                family = glmmTMB::nbinom2),
  samplename = "seq_sample_id"
  )

Next, we will run the model using seqwrap. We recommend using subset to specify a smaller number of targets when prototyping the model. Here, we will assess whether the model output matches expectations. By specifying return_models = TRUE, all model objects will be returned in the results object, which can be inspected if needed. Here we are using a single core for fitting the models, specified in the cores argument.

m1.1_temp <- seqwrap(
  m1.1, 
  cores = 1, 
  subset = 1:24, 
  return_models = TRUE, 
  verbose = FALSE
)

Following iterative fitting using seqwrap, we can summarize the models using seqwrap_summarise, which produces a list of two summaries. Below, we are saving the summaries in an object for later access. Running the seqwrap_summarise function also gives us an overview of the model and successfully summarized targets.

In the summaries slot, model parameter values are returned (see Table 3). These are produced using broom.mixed::tidy which is a general function to retrieve model parameters from many regression model packages in R.

In the evaluations slot, each model is evaluated using tests from the DHARMa package (see Table 4).

m1.1_temp_sum <- seqwrap_summarise(m1.1_temp, verbose = FALSE)
Table 3: The first ten rows from summarised preliminary models (model summaries).
target effect component term estimate std.error statistic p.value
Axin2 fixed cond (Intercept)  5.438 0.048 113.314 0
Axin2 fixed cond treatmentSenolytic −0.199 0.069  −2.897 0.004
Axin2 fixed cond surgeryOverload −0.27  0.069  −3.939 8.177 × 10−5
Axin2 fixed cond treatmentSenolytic:surgeryOverload  0.174 0.097   1.788 0.074
Brat1 fixed cond (Intercept)  4.99  0.051  98.437 0
Brat1 fixed cond treatmentSenolytic −0.048 0.071  −0.67  0.503
Brat1 fixed cond surgeryOverload  0.209 0.068   3.069 0.002
Brat1 fixed cond treatmentSenolytic:surgeryOverload  0.003 0.096   0.03  0.976
Cav2 fixed cond (Intercept)  6.072 0.047 128.043 0
Cav2 fixed cond treatmentSenolytic  0.072 0.067   1.079 0.281
Table 4: The first ten rows from summarised preliminary models (model evaluations). Values are p-values from the DHARMa package (see here for details).
target uniformity dispersion outliers
Axin2 0.9670810 0.672 1
Brat1 0.8040232 0.600 1
Cav2 0.9781748 0.776 1
Ccnd2 0.9163917 0.392 1
Cdc45 0.9087045 0.720 1
Cox5a 0.4403266 0.624 1
Dlat 0.8195092 0.488 1
Drp2 0.7819079 0.352 1
Fer 0.9533689 0.848 1
Fgf6 0.9384286 0.456 1

Specifying summary and evaluation function

We can specify custom functions for evaluating the model object. seqwrap allows for two slots (summaries and evaluations). Each function should take the model object as an argument, and for compatibility with seqwrap_summarise, return a data frame. For our next step, we would like to retrieve the dispersion parameter from each model. We will place this parameter in the evaluation function. We can extract the dispersion parameter on the log scale from the glmmTMB model object. We will construct a function around the extraction and include the observed counts from the data:

dispersion_fun <- function(x) {
  
    out <- data.frame(
      dispersion =  data.frame(summary(x$sdr))["betadisp",1],
      dispersion.se = data.frame(summary(x$sdr))["betadisp",2],
      log_mu = log(mean(x$frame$y))
      )
  
    return(out)
  
}

Using our preliminary models, we can validate the function for one model.

dispersion_fun(m1.1_temp@models[[1]])
  dispersion dispersion.se   log_mu
1   5.043444     0.5786351 5.228967

The output is, as expected, a data frame with three columns. The dispersion parameter is expected to be on the log scale. To validate this we can directly access the estimated dispersion parameter from the model object using sigma(), which return the dispersion parameter on the natural scale.

log(sigma(m1.1_temp@models[[1]]))
[1] 5.043444

Running the full model

When we are happy with preliminary performance on a subset of targets, we can initialize seqwrap to iterate over all targets. We are updating the previous model with the new evaluation function directly in seqwrap. Notice that we also remove the subset and set return_models to FALSE. Saving all model objects will be too memory-consuming.

We should expect some errors from modelling many thousands of targets. As seen in the note from seqwrap, some models are associated with warnings from the modelling algorithm and/or errors from the evaluation function.

m1.2_results <- seqwrap(
  m1.1,
  eval_fun = dispersion_fun,
  cores = n_cores,
  return_models = FALSE,
  verbose = FALSE
)

Errors can be inspected by accessing the errors slot. From the inspection below, warnings are at least to some extent related to model convergence.

warnings <- m1.2_results@errors|> 
  # filter the non-null elements
  filter(purrr::map_lgl(warnings_fit, ~ !is.null(.x[[1]]))) |> 
  # save in object
  pull(warnings_fit)

warnings[1]
$`1010001B22Rik`
<simpleWarning in finalizeTMB(TMBStruc, obj, fit, h, data.tmb.old): Model convergence problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')>

Using target data for prior information in glmmTMB

A major benefit of canonical R packages for RNA-seq data modelling is the sharing of information between targets through shrinkage and empirical Bayes strategies. We suggest using priors in glmmTMB as convenient way of flexibly regularize models using an empirical Bayes strategy. In glmmTMB, the priors argument can be used to set priors on parameters. We will use the estimated parameters from the first model to set up the priors.

First, we need to extract the parameters. From the evaluation function we get the average log count and the dispersion parameter estimate. From the summaries, we will get the other parameters (summarized in Figure 1).

m1.2_sum <- seqwrap_summarise(m1.2_results, verbose = FALSE)
Figure 1: Estimates from modelling the Dungan data. The dispersion trend is visualized using the default ggplot2::geom_smooth`.

Except for the dispersion values, we can easily summarize parameter values by their mean and standard deviation. A more elaborate summary of the dispersion values could include an estimate of the mean (log) dispersion, conditional on the average log counts. As the dispersion prior will be slightly different for each target (based on the observed counts), we will combine priors into target-specific data frames and store them as a list. seqwrap will make target-specific list elements available in each iteration, making it possible to build the prior data frame for glmmTMB.

# Summarise parameters from the main output
param_sum <- m1.2_sum$summaries |> 
  summarise(.by = term, 
            m = mean(estimate), 
            s = sd(estimate)) 

# Fit a model to the log dispersion values
trend_model <- loess(dispersion ~ log_mu,
                         data = m1.2_sum$evaluations,
                         span = 0.7)


# Predict dispersion for each gene based on log raw counts
# and combine into a prior.
dispersion_prior <- data.frame(gene =  counts[,1],
                               pred = round(
                                 predict(trend_model,
                                         newdata = data.frame(
                                           log_mu =  log(rowMeans(counts[,-1]))
                                           )
                                   ), 3),
                                 s = round(
                                   trend_model$s, 3)
  ) |>
    mutate(prior = paste0("normal(", pred, ",", s, ")"))

# Combine parameters in a data frame...
Priors_df <- data.frame(prior = paste0("normal(",
                                       round(param_sum$m, 2),
                                       ",",
                                       round(param_sum$s,2 ), 
                                       ")"),
                        class = rep("fixef", 4),
                        coef = param_sum$term)


# ... extract a prediction for the mean dispersion parameter per target
# and combine with other parameters  
Priors_list <- list()
for( j in 1:nrow( counts )) {

    df <- bind_rows(Priors_df,
                    data.frame(
                      prior =
                        dispersion_prior[dispersion_prior$gene ==
                                           counts[j,1],4],
                      class = "fixef_disp",
                      coef = "1"
                    )
    )

    Priors_list[[j]] <- df

}

Again, we will combine data, metadata, and arguments using seqwrap_compose. We are now also adding the priors to the arguments. The variables used in the priors data frame are made available from the target-specific list referenced in targetdata = Prior_list. Using arguments = alist(... avoids evaluation of the list so that the data frame is not evaluated before being used in each iteration. Priors have been built corresponding what is expected by glmmTMB (see Priors in glmmTMB).

m1.3 <- seqwrap_compose(
  modelfun = glmmTMB::glmmTMB,
  data = counts, 
  metadata = metadata, 
  eval_fun = dispersion_fun, 
  arguments =  alist(
    formula = y ~ treatment * surgery + offset(efflibsize), 
    family = glmmTMB::nbinom2,
      priors = data.frame(
        prior = prior,
        class = class,
        coef = coef)), 
  targetdata = Priors_list
)


m1.3_temp <- seqwrap(
  m1.3, 
  subset = 1:24, 
  return_models = TRUE, 
  verbose = FALSE
)


summary(m1.3_temp@models[[1]])
 Family: nbinom2  ( log )
Formula:          y ~ treatment * surgery + offset(efflibsize)
Data: list(y = c(194L, 186L, 217L, 183L, 168L, 180L, 160L, 158L, 183L,  
198L, 183L, 183L, 168L, 200L, 176L, 254L, 232L, 186L, 169L, 154L 
), treatment = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,  
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), levels = c("Vehicle",  
"Senolytic"), class = "factor"), surgery = structure(c(2L, 2L,  
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,  
1L, 1L), levels = c("Sham", "Overload"), class = "factor"), efflibsize = c(0.166155699457707,  
0.196470816938507, 0.0682347925124099, 0.0387654466787985, 0.0392563629230484,  
-0.0785846626255147, -0.225337792488665, -0.000892800687716758,  
0.000892004305606193, -0.0490532038569703, 0.144822834668174,  
-0.0984341203861923, -0.0169305794837407, 0.049199853171865,  
0.100216656584421, 0.0604222716589449, -0.167592705247281, -0.28188788975239,  
-0.182898895921145, -0.191409863872739))

      AIC       BIC    logLik -2*log(L)  df.resid 
    192.3     197.3     -91.2     182.3        15 


Dispersion parameter for nbinom2 family ():  149 

Conditional model:
                                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)                         5.40834    0.04526  119.50  < 2e-16 ***
treatmentSenolytic                 -0.14000    0.05998   -2.33 0.019601 *  
surgeryOverload                    -0.22861    0.06375   -3.59 0.000335 ***
treatmentSenolytic:surgeryOverload  0.09437    0.08213    1.15 0.250533    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Priors:
fixef((Intercept)) ~ normal(5.81, 1.51) 
fixef(treatmentSenolytic) ~ normal(0.02, 0.14) 
fixef(surgeryOverload) ~ normal(0.12, 0.59) 
fixef(treatmentSenolytic:surgeryOverload) ~ normal(0, 0.16) 
fixef_disp(1) ~ normal(5.37, 1.729) 

Running the prototyping seems to work, we will proceed with the full data set.

m1.3_results <- seqwrap(
  m1.3,
  cores = n_cores,
  return_models = FALSE,
  verbose = FALSE
)

m1.3_sum <- seqwrap_summarise(m1.3_results, verbose = FALSE)

Regularization increases the number models without convergence issues, in fact all 12273 targets have evaluations and summaries.

The summaries data frame is easily manipulated in subsequent analysis steps to, e.g., adjust p-values and create basic visualizations of the results (Figure 2).

Figure 2: A visualization of results from the regularized model of the Dungan et al. data.

The above example shows how to build models with seqwrap. Importantly, we make no claims about the validity of the final model but urge the user to assess assumptions in their data and models as part of a principled workflow.

References

1. Dungan CM, Figueiredo VC, Wen Y, VonLehmden GL, Zdunek CJ, Thomas NT, et al. Senolytic treatment rescues blunted muscle hypertrophy in old mice. GeroScience. 2022;44:1925–40.
2. Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology. 2010;11:R25.