Abstract
FLASHMM is a package for single-cell differential expression analysis using linear mixed-effects models (LMMs). Mixed-effects models have become a powerful tool in single-cell studies due to their ability to model intra-subject correlation and inter-subject variability. However, classic LMM estimation methods face limitations in computational speed and memory usage when applied to large-scale single-cell transcriptomics data. The FLASHMM package provides a fast and scalable approach to address scalability and memory efficiency in LMM fitting. This vignette describes the methods for LMM estimation and hypothesis testing, as well as the R functions for implementing these methods, and demonstrates the use of the package through an example.Consider the linear mixed-effects model (LMM) as expressed below
(Searle, Casella, and
McCulloch 2006)
Here
Hartley and Rao (1967) developed the maximum
likelihood (ML) method for estimating the LMM parameters (fixed effects
and variance components). Patterson and Thompson
(1971) proposed a
modified maximum likelihood procedure which partitions the data into two
mutually uncorrelated parts, one being free of the fixed effects used
for estimating variance components, called restricted maximum likelihood
(REML) method. The REML estimator is unbiased. The ML estimator of
variance components is biased in general. With variance components,
Estimating the variance components by either ML or REML is a
numerical optimization problem. Various iterative methods based on the
log likelihood, called gradient methods, have been proposed (Searle, Casella, and
McCulloch 2006). The gradient methods are represented by the
iteration equation
The hypotheses for testing fixed effects and variance components can
be respectively defined as
Allowing the parameters of variance components to take negative value
avoids the zero boundary at the null hypothesis for the variance
components. Consequently, the asymptotic normality of the maximum
likelihood estimation at the null hypothesis holds under regularity
conditions, which enables us to use z-statistics or t-statistics for
hypothesis testing of the fixed effects and variance components. The
t-statistics for fixed effects are given by
We developed a summary statistics based algorithm for implementing
the gradient method
FLASHMM package provides two functions, lmm and
lmmfit, for fitting LMM. The lmm function uses summary
statistics as arguments. The lmmfit function is a wrapper
function of lmm, which directly uses cell-level data and
computes the summary statistics inside the function. The lmmfit
function is simple to be operated but it has a limitation of memory use.
For extremely large scale data, we can precompute the summary-level data
by
In summary, FLASHMM package provides the following main functions.
We use a simulated multi-sample multi-cell-type single-cell RNA-seq (scRNA-seq) dataset to illustrate how to utilize FLASHMM to perform single-cell differential expression analysis. In this example, we are interested in identifying the genes differentially expressed between two treatments (conditions) within a cell type.
We simulate the multi-sample multi-cell-type scRNA-seq data by simuRNAseq function in FLASHMM package. The simuRNAseq function can generate scRNA-seq data, with or without a reference dataset. The reference dataset contains count data and metadata. The count data is a genes-by-cells matrix. The metadata must be a data frame containing three columns: samples (subjects), cell-types (clusters), and treatments (experimental conditions), and the three columns must be named ‘sam’, ‘cls’, and ‘trt’, respectively.
For simplicity and demonstration purposes, we use a simulated reference dataset to generate the scRNA-seq data. First we simulate the reference dataset.
##Generate a reference dataset by simuRNAseq function.
set.seed(2502)
refdata <- simuRNAseq(nGenes = 1000, nCells = 10000)
##counts
counts <- refdata$counts
##metadata
metadata <- refdata$metadata
head(metadata)
#> sam cls trt libsize
#> Cell1 A5 9 A 1941
#> Cell2 A9 9 A 1951
#> Cell3 A3 8 A 2005
#> Cell4 A6 3 A 1906
#> Cell5 B8 5 B 1913
#> Cell6 A11 4 A 1985
rm(refdata)
For the simulated reference dataset, we don’t need to change the metadata column names because the columns of samples, clusters, and treatments, are already named ‘sam’, ‘cls’, and ‘trt’, respectively. But, for a real biological reference dataset, we need to change the metadata column names.
Next we use the reference counts and metadata to simulate scRNA-seq data that contains 100 genes and 100,000 cells from 25 samples (subjects) and 10 clusters (cell-types) with 2 treatments. There are 10 differentially expressed genes.
##Generate the scRNA-seq data by simuRNAseq function.
set.seed(2503)
dat <- simuRNAseq(counts, metadata = metadata, nGenes = 100, nCells = 100000, nsam = 25,
ncls = 10, ntrt = 2, nDEgenes = 10, minbeta = 0.5, maxbeta = 1, var.randomeffects = 0.1)
str(dat)
#> List of 5
#> $ ref.mean.dispersion:'data.frame': 1000 obs. of 2 variables:
#> ..$ mu : num [1:1000] 3.19 3.56 1.02 4.24 2.11 ...
#> ..$ dispersion: num [1:1000] 0.814 1.476 0.791 0.829 2.319 ...
#> $ metadata :'data.frame': 100000 obs. of 4 variables:
#> ..$ sam : chr [1:100000] "B10" "A2" "A10" "A3" ...
#> ..$ cls : chr [1:100000] "8" "6" "3" "1" ...
#> ..$ trt : chr [1:100000] "B" "A" "A" "A" ...
#> ..$ libsize: num [1:100000] 2292 2108 2176 1938 2183 ...
#> $ counts : num [1:100, 1:100000] 0 2 3 0 1 0 1 0 7 4 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:100] "Gene260" "Gene724" "Gene774" "Gene616" ...
#> .. ..$ : chr [1:100000] "Cell9224" "Cell3950" "Cell530" "Cell2039" ...
#> $ DEgenes :'data.frame': 10 obs. of 3 variables:
#> ..$ gene : chr [1:10] "Gene201" "Gene864" "Gene522" "Gene597" ...
#> ..$ beta : num [1:10] 0.977 0.819 -0.532 0.896 0.887 ...
#> ..$ cluster: chr [1:10] "1" "10" "2" "2" ...
#> $ treatment : chr "B"
rm(counts, metadata)
The simulated data contains a genes-by-cells matrix of expression counts and a data frame of metadata consisting of 3 columns: samples (sam), cell-types (cls) and treatments (trt).
We perform differential expression analysis of the simulated dataset using FLASHMM package. We are to identify the significantly differentially expressed genes between two treatments in a cell-type. The analyses involve following steps: LMM design, LMM fitting and hypothesis testing, exploring LMM fit and the differetially expressed (DE) genes.
LMM design: construct design matrices for fixed and
random effects described as
LMM fitting and hypothesis testing: We use either
lmm or lmmfit function to fit LMMs for all genes. With the cell-level
data matrices
In addition, the lmm and lmmfit functions perform hypothesis testing on the fixed effects (coefficients). We can also use the lmmtest function to conduct statistical tests on both the fixed effects and their contrasts for various comparisons between different levels.
LMM fitting returns a list of estimates of LMM parameters (coefficients and variance components), standard errors of the estimates, covariance matrix of the coefficients (fixed effects), and t-values and p-values for the hypothesis testing on the coefficients, for each gene. The LMM fitting also returns the number of iterations, the first partial derivatives of the log likelihood, and the log-likelihood at the termination of the iterations for each gene.
Exploring LMM fit and DE genes: We check if the LMM fitting is convergent, and perform hypothesis testing on the variance components of random effects. Then we identify DE genes based on hypothesis testing p-values for the coefficients (fixed effects). If the absolute first partial derivatives of log likelihood are all less than the convergence tolerance, the LMM fitting converges, otherwise it doesn’t converge. The genes for which LMM fitting doesn’t converge should be excluded in the subsequent analysis because the estimated coefficients for these genes are not reliable. DE genes can be identified by adjusting p-values using the false discovery rate (FDR) or family-wise error rate (Bonferroni correction). Additionally, we may exclude genes with small coefficients (effect size or log-fold change).
##Gene expression matrix, Y = log2(1 + counts)
Y <- log2(1 + dat$counts)
dat$counts <- NULL #Remove the counts.
##Design matrix for fixed effects
X <- model.matrix(~ 0 + log(libsize) + cls + cls:trt, data = dat$meta)
##Design matrix for random effects
Z <- model.matrix(~ 0 + as.factor(sam), data = dat$meta)
##Dimension of random effects
d <- ncol(Z)
rm(dat)
##Option 1: fit LMM by cell-level data.
max.iter <- 100
epsilon <- 1e-5
fit <- lmmfit(Y, X, Z, d = d, max.iter = max.iter, epsilon = epsilon)
##Option 2: fit LMM by summary-level data.
##(1) Compute the summary-level data.
n <- nrow(X)
XX <- t(X)%*%X
XY <- t(Y%*%X)
ZX <- t(Z)%*%X
ZY <- t(Y%*%Z)
ZZ <- t(Z)%*%Z
Ynorm <- rowSums(Y*Y)
rm(X, Y, Z) #release the memory.
##(2) Fit LMM.
fitss <- lmm(XX, XY, ZX, ZY, ZZ, Ynorm = Ynorm, n = n, d = d,
max.iter = max.iter, epsilon = epsilon)
identical(fit, fitss) #The two LMM fits are identical.
#> [1] TRUE
rm(fitss)
##Testing coefficients (fixed effects)
test <- lmmtest(fit)
#head(test)
##The testing t-value and p-values are also provided in the LMM fit.
range(test - cbind(t(fit$coef), t(fit$t), t(fit$p))) #identical
#> [1] 0 0
#fit$t[, 1:4]
#fit$p[, 1:4]
fit$coef[, 1:4]
#> Gene260 Gene724 Gene774 Gene616
#> log(libsize) 0.823538186 0.9338158 0.84825642 0.47303182
#> cls1 -5.033788454 -5.4098141 -5.22008950 -3.01064722
#> cls10 -5.016216620 -5.4194900 -5.23064461 -2.97351422
#> cls2 -5.023813654 -5.4187129 -5.23977394 -2.96924442
#> cls3 -4.997721940 -5.3974915 -5.23458730 -2.97942173
#> cls4 -5.020303938 -5.4443891 -5.22523214 -2.98406521
#> cls5 -5.034040308 -5.4158073 -5.21912587 -2.96949879
#> cls6 -5.039858094 -5.4040391 -5.22563185 -3.01283214
#> cls7 -5.032770530 -5.4027480 -5.21182988 -2.96310244
#> cls8 -5.027633571 -5.4037420 -5.22567389 -2.99787326
#> cls9 -5.011484228 -5.3934569 -5.21931480 -2.97552877
#> cls1:trtB 0.042932556 -0.2901492 -0.05952280 0.11278682
#> cls10:trtB 0.027706615 -0.2490376 -0.03298227 0.05911209
#> cls2:trtB 0.043924185 -0.2725758 -0.06385726 0.06042628
#> cls3:trtB 0.034137125 -0.2807799 -0.03698257 0.04462700
#> cls4:trtB 0.048231220 -0.2272270 -0.01519083 0.08609152
#> cls5:trtB 0.040408456 -0.2817817 -0.05896284 0.05373212
#> cls6:trtB 0.075970031 -0.2511360 -0.03689410 0.11972056
#> cls7:trtB 0.029742613 -0.2712344 -0.07932668 0.06190330
#> cls8:trtB 0.021522916 -0.2660295 -0.04673630 0.07685644
#> cls9:trtB 0.005106265 -0.3013190 -0.08311758 0.06462308
The coefficients of the interaction term cls
contrast <- cbind("BvsA" = numeric(nrow(fit$coef)))
index <- grep(":", rownames(fit$coef))
contrast[index, ] <- 1/length(index)
##Test the contrast.
test <- lmmtest(fit, contrast = contrast)
head(test)
#> BvsA_coef BvsA_t BvsA_p
#> Gene260 0.03696820 0.4499201 0.65276901
#> Gene724 -0.26912699 -2.0665983 0.03877459
#> Gene774 -0.05135733 -0.4228609 0.67239767
#> Gene616 0.07398792 1.4665370 0.14250518
#> Gene581 0.05802063 0.7702806 0.44113530
#> Gene591 -0.06573230 -0.8597515 0.38992809
The contrast can also be constructed by contrast.matrix function as follows.
BvsA <- paste0(paste0("cls", 1:10, ":trtB"), collapse = "+")
BvsA <- paste0("(", BvsA, ")/10")
BvsA
#> [1] "(cls1:trtB+cls2:trtB+cls3:trtB+cls4:trtB+cls5:trtB+cls6:trtB+cls7:trtB+cls8:trtB+cls9:trtB+cls10:trtB)/10"
contrast <- contrast.matrix(contrast = c(BvsA = BvsA), model.matrix.names = rownames(fit$coef))
test <- lmmtest(fit, contrast = contrast)
head(test)
#> BvsA_coef BvsA_t BvsA_p
#> Gene260 0.03696820 0.4499201 0.65276901
#> Gene724 -0.26912699 -2.0665983 0.03877459
#> Gene774 -0.05135733 -0.4228609 0.67239767
#> Gene616 0.07398792 1.4665370 0.14250518
#> Gene581 0.05802063 0.7702806 0.44113530
#> Gene591 -0.06573230 -0.8597515 0.38992809
##(1) Check the genes for which LMM fitting converges.
gc <- (apply(abs(fit$dlogL), 2, max) < epsilon)
sum(gc)
#> [1] 100
##(2) Hypothesis testing for variance components:
## H0, theta </= 0 vs H1, theta > 0.
z <- fit$theta["var1", ]/fit$se.theta["var1", ]
p <- pnorm(z, lower.tail = FALSE)
sum(p < 0.05)
#> [1] 100
##(3) The DE genes specific to a cell-type
##Coefficients, t-values, and p-values for the genes specific to a cell-type.
index <- grep(":", rownames(fit$coef))
ce <- fit$coef[index, gc]
tv <- fit$t[index, gc]
pv <- fit$p[index, gc]
out <- data.frame(
gene = rep(colnames(ce), nrow(ce)),
cluster = rep(rownames(ce), each = ncol(ce)),
coef = c(t(ce)), t = c(t(tv)), p = c(t(pv)))
##Adjust p-values by FDR.
out$FDR <- p.adjust(out$p, method = "fdr")
##The DE genes with FDR < 0.05 and abs(logFC) > 0.5
out <- out[order(out$p), ]
rownames(out) <- NULL
out[(out$FDR < 0.05) & (abs(out$coef) > 0.5) , ]
#> gene cluster coef t p FDR
#> 1 Gene456 cls4:trtB 0.6159160 10.144524 3.600651e-24 3.600651e-21
#> 2 Gene201 cls1:trtB 0.7149779 8.625628 6.465361e-18 3.232680e-15
#> 3 Gene864 cls10:trtB 0.8716965 7.992823 1.332776e-15 4.442588e-13
#> 4 Gene597 cls2:trtB 0.7638176 7.737909 1.020050e-14 2.550125e-12
#> 5 Gene982 cls8:trtB 0.9979503 7.288876 3.148478e-13 6.296956e-11
#> 6 Gene499 cls8:trtB -0.6079898 -5.757651 8.554122e-09 1.425687e-06
#> 7 Gene699 cls7:trtB -0.5373567 -4.919868 8.674004e-07 1.239143e-04
#> 8 Gene632 cls4:trtB -0.5355741 -4.810256 1.509563e-06 1.886953e-04
#> 9 Gene378 cls9:trtB -0.6419090 -4.427605 9.538617e-06 1.059846e-03
##The true DE genes
#dat$DEgenes
##Fitting LMM by ML method
#fit <- lmmfit(Y, X, Z, d = d, method = "ML", max.iter = max.iter, epsilon = epsilon)
fit <- lmm(XX, XY, ZX, ZY, ZZ, Ynorm = Ynorm, n = n, d = d, method = "ML",
max.iter = max.iter, epsilon = epsilon)
##The DE genes specific to a cell-type
##Coefficients, t-values, and p-values
index <- NULL
index <- grep(":", rownames(fit$coef))
ce <- fit$coef[index, gc]
tv <- fit$t[index, gc]
pv <- fit$p[index, gc]
out <- NULL
out <- data.frame(
gene = rep(colnames(ce), nrow(ce)),
cluster = rep(rownames(ce), each = ncol(ce)),
coef = c(t(ce)), t = c(t(tv)), p = c(t(pv)))
##Adjusting p-values by FDR
out$FDR <- p.adjust(out$p, method = "fdr")
##The DE genes with FDR < 0.05 and abs(logFC) > 0.5
out <- out[order(out$p), ]
rownames(out) <- NULL
out[(out$FDR < 0.05) & (abs(out$coef) > 0.5) , ]
#> gene cluster coef t p FDR
#> 1 Gene456 cls4:trtB 0.6159049 10.542755 5.661896e-26 5.661896e-23
#> 2 Gene201 cls1:trtB 0.7149782 8.979640 2.762145e-19 1.381073e-16
#> 3 Gene864 cls10:trtB 0.8717021 8.322877 8.691959e-17 2.897320e-14
#> 4 Gene597 cls2:trtB 0.7638143 8.058659 7.797845e-16 1.949461e-13
#> 5 Gene982 cls8:trtB 0.9979550 7.589598 3.236647e-14 6.473295e-12
#> 6 Gene499 cls8:trtB -0.6079984 -5.993079 2.066072e-09 3.443453e-07
#> 7 Gene699 cls7:trtB -0.5373524 -5.121801 3.031907e-07 4.331296e-05
#> 8 Gene632 cls4:trtB -0.5355670 -5.009191 5.475210e-07 6.844013e-05
#> 9 Gene378 cls9:trtB -0.6419070 -4.612034 3.992413e-06 4.436014e-04
##
sessionInfo()
#> R version 4.4.3 (2025-02-28)
#> Platform: x86_64-apple-darwin20
#> Running under: macOS Monterey 12.7.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: America/Toronto
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] FLASHMM_1.2.1
#>
#> loaded via a namespace (and not attached):
#> [1] digest_0.6.37 R6_2.6.1 fastmap_1.2.0 Matrix_1.7-3
#> [5] xfun_0.51 lattice_0.22-6 cachem_1.1.0 knitr_1.50
#> [9] htmltools_0.5.8.1 rmarkdown_2.29 lifecycle_1.0.4 cli_3.6.4
#> [13] grid_4.4.3 sass_0.4.9 jquerylib_0.1.4 compiler_4.4.3
#> [17] rstudioapi_0.17.1 tools_4.4.3 evaluate_1.0.3 bslib_0.9.0
#> [21] yaml_2.3.10 rlang_1.1.5 jsonlite_1.9.1 MASS_7.3-65
Building design matrices for fixed and random effects is a key step in LMM-based differential expression analysis. Including library size, a normalization factor for scRNA-seq, as a fixed effect can help reduce p-value inflation. If necessary, we can add principal components as fixed effects to further mitigate unknown batch effects.
In differential expression analysis, modeling samples (subjects) as random effects accounts for between-sample variability and within-sample correlation. If samples have different effects on gene expression, we can also model them as fixed effects. However, this can lead to overfitting when the number of samples (subjects) is large. For scRNA-seq data, this may also result in collinearity when the samples are nested within an experimental condition.