Fitting models with seqwrap

Authors

Affiliations

Chidimma Echebiri

University of Inland Norway

Daniel Hammarström

University of Inland Norway

The seqwrap package is designed to iterate over multiple targets in a high-dimensional data set to fit a user-defined regression model using a user-selected model fitting function. Each target may represent, e.g., a gene, an exon, a protein, a methylation site, or any other outcome from a high-throughput omics platform. Each target is measured across a set of samples, with a metadata set used to describe study covariates. In the basic case, data and metadata are combined in each iteration to fit the selected model.

Dungan et al. [1] evaluated a senolytic treatment (dasatinib and quercetin) in mouse skeletal muscle subjected to mechanical overload through synergist ablation. The study included sham surgery and a vehicle control, allowing analysis of the effect of senolytic treatment and the interaction between mechanical overloading and senolytic treatment. RNA-seq was performed on 20 animals (5 per group). The data are available in the seqwrap package under dungan_counts. We will load the data and filter low-expression genes.

# Load packages
library(dplyr)
library(ggplot2)
library(seqwrap)
library(edgeR)
library(gt)

# This vignette is pre-rendered by the package maintainer (not built on CRAN),
# so n_cores can be tuned to the rendering machine.

n_cores <- parallel::detectCores()

# Load data 
metadata <- dungan_counts$metadata
counts <- dungan_counts$countdata

The metadata contains the sample ID corresponding to columns in the counts data set. We are using seq_sample_id as the column name here. The first rows of the prepared metadata are displayed in Table 1.

Table 1: Selected rows from the Dungan et al. metadata data frame.

seq_sample_id	treatment	surgery
OS14d5	Senolytic	Overload
OS14d6	Senolytic	Overload
OSS4	Senolytic	Sham
OSS5	Senolytic	Sham
OV14D5	Vehicle	Overload
OV14D6	Vehicle	Overload
OVS4	Vehicle	Sham
OVS5	Vehicle	Sham

The count data matrix can be manipulated as a data frame, with the first column indicating the targets (in this case, gene symbols). Importantly, additional column names have names corresponding to the seq_sample_id variable in the metadata. As an example, the first rows of the count data frame are shown for the same samples as in the metadata data frame (see Table 2). seqwrap expects that targets are unique so before going further we should validate this in the data. When doing so we find that five targets have duplicate rows in the data set, for simplicity, we will remove them.

Table 2: Selected rows from the Dungan et al. counts data frame.

gene_name	OS14d5	OS14d6	OSS4	OSS5	OV14D5	OV14D6	OVS4	OVS5
Gnai3	858	773	280	295	700	357	380	236
Cdc45	63	71	45	43	61	39	53	54
H19	63437	69156	4539	4694	75553	63084	3565	4981
Narf	1935	2033	2573	2415	2275	1807	2999	2543
Cav2	730	701	491	400	584	441	548	320
Klf6	1300	1232	356	327	820	458	530	518
Scmh1	672	635	738	547	599	464	826	637
Cox5a	5384	6501	7794	6355	7189	5354	8651	8478
Tbx2	91	103	59	55	131	107	59	35
Tbx4	43	77	25	33	65	45	37	27

# Find targets with more than a unique row in the data
rm_targets <- counts |> 
  summarise(.by = gene_name, 
            n = n()) |> 
  filter(n > 1) |> 
  pull(gene_name)

# Filter out non-uniqe rows
counts <- counts |> 
  filter(!(gene_name %in% rm_targets))

The data suggests a negative binomial model were the linear predictor can be used to assess the difference between senolytic treatment and control (treat; \(\beta_1g\)), mechanical overloading compared to sham surgery (mov; \(\beta_2g\)), and their interaction (\(\beta_3g\)) for gene g in 1 to 12273 and row i in 1 to 20 of the meta data (Equation 1). We want to control for differences in sequencing depth by including an offset (s_i). We will use the TMM method [2] to calculate the effective library size.

\[ \begin{align} y_{g[i]} &\sim \operatorname{NB2}(\mu_{g[i]}, \theta_g) \\ \operatorname{log}(\mu_{g[i]}) &= \beta_0 + \beta_{1g}\text{treat}_i + \beta_{2g}\text{mov}_i + \beta_{3g}(\text{treat}_i \times \text{mov}_i) + \operatorname{log}(s_i) \end{align} \tag{1}\]

Below, we are using the edgeR package to calculate the library size normalization factor. Effective library sizes are then expressed as relative to the median effective library and log transformed.

# Use EdgeR to calculate the TMM
y <- edgeR::DGEList(counts[,-1])
y <- edgeR::calcNormFactors(y)

# Combine the data into metadata
metadata <- metadata |> 
  inner_join(
    y$samples |> 
  tibble::rownames_to_column("seq_sample_id")
  ) |> 
   mutate(
     efflibsize = (lib.size * norm.factors) / median(lib.size * norm.factors),
     efflibsize = log(efflibsize)
   )

Joining with `by = join_by(seq_sample_id)`

seqwrap_compose is used to combine the data, metadata and the selected modelling function with its arguments into a swobject. A negative binomial model can be fitted using e.g., glmmTMB from the glmmTMB package (glmmTMB::glmmTMB) or MASS::glm.nb, here we will use glmmTMB as it can be extended in subsequent modelling steps.

The specified glmmTMB function (modelfun slot) needs a formula and a family function, these are supplied in the arguments slot. The count data are supplied in the data slot and the metadata in the metadata slot. Notice that we use the namespace operator (::) to specify selected functions to avoid difficulties in parallel processing. The samplename indicates the column name in the metadata that gives sample names.

m1.1 <- seqwrap_compose(
  modelfun = glmmTMB::glmmTMB,
  data = counts, 
  metadata = metadata, 
  arguments = list(formula = y ~ treatment * surgery + offset(efflibsize), 
                family = glmmTMB::nbinom2),
  samplename = "seq_sample_id"
  )

Next, we will run the model using seqwrap. We recommend using subset to specify a smaller number of targets when prototyping the model. Here, we will assess whether the model output matches expectations. By specifying return_models = TRUE, all model objects will be returned in the results object, which can be inspected if needed. Here we are using a single core for fitting the models, specified in the cores argument.

m1.1_temp <- seqwrap(
  m1.1, 
  cores = 1, 
  subset = 1:24, 
  return_models = TRUE, 
  verbose = FALSE
)

Following iterative fitting using seqwrap, we can summarize the models using seqwrap_summarise, which produces a list of two summaries. Below, we are saving the summaries in an object for later access. Running the seqwrap_summarise function also gives us an overview of the model and successfully summarized targets.

In the summaries slot, model parameter values are returned (see Table 3). These are produced using broom.mixed::tidy which is a general function to retrieve model parameters from many regression model packages in R.

In the evaluations slot, each model is evaluated using tests from the DHARMa package (see Table 4).

m1.1_temp_sum <- seqwrap_summarise(m1.1_temp, verbose = FALSE)

Table 3: The first ten rows from summarised preliminary models (model summaries).

target	effect	component	term	estimate	std.error	statistic	p.value
Axin2	fixed	cond	(Intercept)	5.438	0.048	113.314	0
Axin2	fixed	cond	treatmentSenolytic	−0.199	0.069	−2.897	0.004
Axin2	fixed	cond	surgeryOverload	−0.27	0.069	−3.939	8.177 × 10⁻⁵
Axin2	fixed	cond	treatmentSenolytic:surgeryOverload	0.174	0.097	1.788	0.074
Brat1	fixed	cond	(Intercept)	4.99	0.051	98.437	0
Brat1	fixed	cond	treatmentSenolytic	−0.048	0.071	−0.67	0.503
Brat1	fixed	cond	surgeryOverload	0.209	0.068	3.069	0.002
Brat1	fixed	cond	treatmentSenolytic:surgeryOverload	0.003	0.096	0.03	0.976
Cav2	fixed	cond	(Intercept)	6.072	0.047	128.043	0
Cav2	fixed	cond	treatmentSenolytic	0.072	0.067	1.079	0.281

Table 4: The first ten rows from summarised preliminary models (model evaluations). Values are p-values from the DHARMa package (see here for details).

target	uniformity	dispersion	outliers
Axin2	0.9670810	0.672	1
Brat1	0.8040232	0.600	1
Cav2	0.9781748	0.776	1
Ccnd2	0.9163917	0.392	1
Cdc45	0.9087045	0.720	1
Cox5a	0.4403266	0.624	1
Dlat	0.8195092	0.488	1
Drp2	0.7819079	0.352	1
Fer	0.9533689	0.848	1
Fgf6	0.9384286	0.456	1

Specifying summary and evaluation function

We can specify custom functions for evaluating the model object. seqwrap allows for two slots (summaries and evaluations). Each function should take the model object as an argument, and for compatibility with seqwrap_summarise, return a data frame. For our next step, we would like to retrieve the dispersion parameter from each model. We will place this parameter in the evaluation function. We can extract the dispersion parameter on the log scale from the glmmTMB model object. We will construct a function around the extraction and include the observed counts from the data:

dispersion_fun <- function(x) {
  
    out <- data.frame(
      dispersion =  data.frame(summary(x$sdr))["betadisp",1],
      dispersion.se = data.frame(summary(x$sdr))["betadisp",2],
      log_mu = log(mean(x$frame$y))
      )
  
    return(out)
  
}

Using our preliminary models, we can validate the function for one model.

dispersion_fun(m1.1_temp@models[[1]])

  dispersion dispersion.se   log_mu
1   5.043444     0.5786351 5.228967

The output is, as expected, a data frame with three columns. The dispersion parameter is expected to be on the log scale. To validate this we can directly access the estimated dispersion parameter from the model object using sigma(), which return the dispersion parameter on the natural scale.

log(sigma(m1.1_temp@models[[1]]))

[1] 5.043444

Running the full model

When we are happy with preliminary performance on a subset of targets, we can initialize seqwrap to iterate over all targets. We are updating the previous model with the new evaluation function directly in seqwrap. Notice that we also remove the subset and set return_models to FALSE. Saving all model objects will be too memory-consuming.

We should expect some errors from modelling many thousands of targets. As seen in the note from seqwrap, some models are associated with warnings from the modelling algorithm and/or errors from the evaluation function.

m1.2_results <- seqwrap(
  m1.1,
  eval_fun = dispersion_fun,
  cores = n_cores,
  return_models = FALSE,
  verbose = FALSE
)

Errors can be inspected by accessing the errors slot. From the inspection below, warnings are at least to some extent related to model convergence.

warnings <- m1.2_results@errors|> 
  # filter the non-null elements
  filter(purrr::map_lgl(warnings_fit, ~ !is.null(.x[[1]]))) |> 
  # save in object
  pull(warnings_fit)

warnings[1]

$`1010001B22Rik`
<simpleWarning in finalizeTMB(TMBStruc, obj, fit, h, data.tmb.old): Model convergence problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')>

Using target data for prior information in `glmmTMB`

A major benefit of canonical R packages for RNA-seq data modelling is the sharing of information between targets through shrinkage and empirical Bayes strategies. We suggest using priors in glmmTMB as convenient way of flexibly regularize models using an empirical Bayes strategy. In glmmTMB, the priors argument can be used to set priors on parameters. We will use the estimated parameters from the first model to set up the priors.

First, we need to extract the parameters. From the evaluation function we get the average log count and the dispersion parameter estimate. From the summaries, we will get the other parameters (summarized in Figure 1).

m1.2_sum <- seqwrap_summarise(m1.2_results, verbose = FALSE)

Figure 1: Estimates from modelling the Dungan data. The dispersion trend is visualized using the default `ggplot`2::geom_smooth`.

Except for the dispersion values, we can easily summarize parameter values by their mean and standard deviation. A more elaborate summary of the dispersion values could include an estimate of the mean (log) dispersion, conditional on the average log counts. As the dispersion prior will be slightly different for each target (based on the observed counts), we will combine priors into target-specific data frames and store them as a list. seqwrap will make target-specific list elements available in each iteration, making it possible to build the prior data frame for glmmTMB.

# Summarise parameters from the main output
param_sum <- m1.2_sum$summaries |> 
  summarise(.by = term, 
            m = mean(estimate), 
            s = sd(estimate)) 

# Fit a model to the log dispersion values
trend_model <- loess(dispersion ~ log_mu,
                         data = m1.2_sum$evaluations,
                         span = 0.7)


# Predict dispersion for each gene based on log raw counts
# and combine into a prior.
dispersion_prior <- data.frame(gene =  counts[,1],
                               pred = round(
                                 predict(trend_model,
                                         newdata = data.frame(
                                           log_mu =  log(rowMeans(counts[,-1]))
                                           )
                                   ), 3),
                                 s = round(
                                   trend_model$s, 3)
  ) |>
    mutate(prior = paste0("normal(", pred, ",", s, ")"))

# Combine parameters in a data frame...
Priors_df <- data.frame(prior = paste0("normal(",
                                       round(param_sum$m, 2),
                                       ",",
                                       round(param_sum$s,2 ), 
                                       ")"),
                        class = rep("fixef", 4),
                        coef = param_sum$term)


# ... extract a prediction for the mean dispersion parameter per target
# and combine with other parameters  
Priors_list <- list()
for( j in 1:nrow( counts )) {

    df <- bind_rows(Priors_df,
                    data.frame(
                      prior =
                        dispersion_prior[dispersion_prior$gene ==
                                           counts[j,1],4],
                      class = "fixef_disp",
                      coef = "1"
                    )
    )

    Priors_list[[j]] <- df

}

Again, we will combine data, metadata, and arguments using seqwrap_compose. We are now also adding the priors to the arguments. The variables used in the priors data frame are made available from the target-specific list referenced in targetdata = Prior_list. Using arguments = alist(... avoids evaluation of the list so that the data frame is not evaluated before being used in each iteration. Priors have been built corresponding what is expected by glmmTMB (see Priors in glmmTMB).

m1.3 <- seqwrap_compose(
  modelfun = glmmTMB::glmmTMB,
  data = counts, 
  metadata = metadata, 
  eval_fun = dispersion_fun, 
  arguments =  alist(
    formula = y ~ treatment * surgery + offset(efflibsize), 
    family = glmmTMB::nbinom2,
      priors = data.frame(
        prior = prior,
        class = class,
        coef = coef)), 
  targetdata = Priors_list
)


m1.3_temp <- seqwrap(
  m1.3, 
  subset = 1:24, 
  return_models = TRUE, 
  verbose = FALSE
)


summary(m1.3_temp@models[[1]])

 Family: nbinom2  ( log )
Formula:          y ~ treatment * surgery + offset(efflibsize)
Data: list(y = c(194L, 186L, 217L, 183L, 168L, 180L, 160L, 158L, 183L,  
198L, 183L, 183L, 168L, 200L, 176L, 254L, 232L, 186L, 169L, 154L 
), treatment = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,  
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), levels = c("Vehicle",  
"Senolytic"), class = "factor"), surgery = structure(c(2L, 2L,  
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,  
1L, 1L), levels = c("Sham", "Overload"), class = "factor"), efflibsize = c(0.166155699457707,  
0.196470816938507, 0.0682347925124099, 0.0387654466787985, 0.0392563629230484,  
-0.0785846626255147, -0.225337792488665, -0.000892800687716758,  
0.000892004305606193, -0.0490532038569703, 0.144822834668174,  
-0.0984341203861923, -0.0169305794837407, 0.049199853171865,  
0.100216656584421, 0.0604222716589449, -0.167592705247281, -0.28188788975239,  
-0.182898895921145, -0.191409863872739))

      AIC       BIC    logLik -2*log(L)  df.resid 
    192.3     197.3     -91.2     182.3        15 


Dispersion parameter for nbinom2 family ():  149 

Conditional model:
                                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)                         5.40834    0.04526  119.50  < 2e-16 ***
treatmentSenolytic                 -0.14000    0.05998   -2.33 0.019601 *  
surgeryOverload                    -0.22861    0.06375   -3.59 0.000335 ***
treatmentSenolytic:surgeryOverload  0.09437    0.08213    1.15 0.250533    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Priors:
fixef((Intercept)) ~ normal(5.81, 1.51) 
fixef(treatmentSenolytic) ~ normal(0.02, 0.14) 
fixef(surgeryOverload) ~ normal(0.12, 0.59) 
fixef(treatmentSenolytic:surgeryOverload) ~ normal(0, 0.16) 
fixef_disp(1) ~ normal(5.37, 1.729)

Running the prototyping seems to work, we will proceed with the full data set.

m1.3_results <- seqwrap(
  m1.3,
  cores = n_cores,
  return_models = FALSE,
  verbose = FALSE
)

m1.3_sum <- seqwrap_summarise(m1.3_results, verbose = FALSE)

Regularization increases the number models without convergence issues, in fact all 12273 targets have evaluations and summaries.

The summaries data frame is easily manipulated in subsequent analysis steps to, e.g., adjust p-values and create basic visualizations of the results (Figure 2).

Figure 2: A visualization of results from the regularized model of the Dungan et al. data.

The above example shows how to build models with seqwrap. Importantly, we make no claims about the validity of the final model but urge the user to assess assumptions in their data and models as part of a principled workflow.

References

1. Dungan CM, Figueiredo VC, Wen Y, VonLehmden GL, Zdunek CJ, Thomas NT, et al. Senolytic treatment rescues blunted muscle hypertrophy in old mice. GeroScience. 2022;44:1925–40.

2. Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology. 2010;11:R25.

Specifying summary and evaluation function

Running the full model

Using target data for prior information in glmmTMB

References

Using target data for prior information in `glmmTMB`