This vignette demonstrates how to perform Gene Set Enrichment Analysis (GSEA) on PICRUSt2 predicted functional data using the ggpicrust2 package. GSEA is a powerful method for interpreting gene expression data by focusing on gene sets (pathways) rather than individual genes. In the context of microbiome functional prediction, GSEA can help identify pathways that are enriched in different conditions.
First, make sure you have the necessary packages installed:
# Install ggpicrust2
if (!requireNamespace("ggpicrust2", quietly = TRUE)) {
devtools::install_github("cafferychen777/ggpicrust2")
}
# Install required Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install(c("limma", "fgsea", "clusterProfiler", "enrichplot", "DOSE", "pathview"))
# Load the package
library(ggpicrust2)
library(dplyr)
library(ggplot2)The pathway_gsea() function supports multiple methods
for gene set enrichment analysis:
| Method | Type | Covariate Support | Description |
|---|---|---|---|
camera (default) |
Competitive | ✅ Yes | Recommended. Accounts for inter-gene correlations, providing more reliable p-values |
fry |
Self-contained | ✅ Yes | Fast rotation test, efficient for large gene set collections |
fgsea |
Preranked | ❌ No | Fast preranked GSEA, but p-values may be unreliable |
clusterProfiler |
Preranked | ❌ No | Traditional GSEA implementation |
Why use camera/fry? Wu et al. (2012) demonstrated
that preranked GSEA methods can produce “spectacularly wrong p-values”
due to not accounting for inter-gene correlations. The
camera and fry methods from limma address this
limitation.
Let’s start with a basic GSEA analysis using the recommended
camera method:
# Load example data
data(ko_abundance)
data(metadata)
# Prepare abundance data
abundance_data <- as.data.frame(ko_abundance)
rownames(abundance_data) <- abundance_data[, "#NAME"]
abundance_data <- abundance_data[, -1]
# Run GSEA analysis with camera (recommended)
gsea_results <- pathway_gsea(
abundance = abundance_data,
metadata = metadata,
group = "Environment",
pathway_type = "KEGG",
method = "camera",
min_size = 5,
max_size = 500,
p.adjust = "BH"
)
# View the top results
head(gsea_results)One of the most powerful features of the camera and
fry methods is the ability to adjust for confounding
variables. This is particularly important in microbiome studies where
factors like age, sex, BMI, and batch effects can influence results.
# Example with covariate adjustment
# Assuming metadata has columns: group, age, sex, BMI
gsea_results_adjusted <- pathway_gsea(
abundance = abundance_data,
metadata = metadata,
group = "Disease",
covariates = c("age", "sex", "BMI"), # Adjust for these confounders
pathway_type = "KEGG",
method = "camera"
)
# The results now reflect the group effect after adjusting for confounders
head(gsea_results_adjusted)For large datasets or when computational efficiency is important, use
the fry method:
For compatibility with existing workflows, preranked GSEA methods are still available:
# Note: Preranked methods don't account for inter-gene correlations
# P-values may be less reliable (Wu et al., 2012)
gsea_results_fgsea <- pathway_gsea(
abundance = abundance_data,
metadata = metadata,
group = "Environment",
pathway_type = "KEGG",
method = "fgsea",
rank_method = "signal2noise",
nperm = 1000,
min_size = 10,
max_size = 500,
p.adjust = "BH",
seed = 42
)
# View the top results
head(gsea_results_fgsea)To make the results more interpretable, we can annotate them with pathway names and descriptions:
The ggpicrust2 package provides several visualization options for
GSEA results. The visualize_gsea() function automatically
detects whether pathway names are available (from
gsea_pathway_annotation()) and uses them for better
readability, falling back to pathway IDs if names are not available.
The visualize_gsea() function offers flexible pathway
labeling:
# Option 1: Use raw GSEA results (shows pathway IDs)
plot_with_ids <- visualize_gsea(
gsea_results = gsea_results,
plot_type = "barplot",
n_pathways = 10
)
# Option 2: Use annotated results (automatically shows pathway names)
plot_with_names <- visualize_gsea(
gsea_results = annotated_results,
plot_type = "barplot",
n_pathways = 10
)
# Option 3: Explicitly specify which column to use for labels
plot_custom_labels <- visualize_gsea(
gsea_results = annotated_results,
plot_type = "barplot",
pathway_label_column = "pathway_name",
n_pathways = 10
)
# Compare the plots
plot_with_ids
plot_with_names
plot_custom_labelsRidge plots (also known as joy plots) provide a powerful way to visualize the distribution of gene abundances or fold changes within enriched pathways. This helps understand whether pathways are predominantly up- or down-regulated.
# Create a ridge plot for GSEA results
# Note: Requires ggridges package to be installed
ridge_plot <- pathway_ridgeplot(
gsea_results = gsea_results,
abundance = abundance_data,
metadata = metadata,
group = "Environment",
pathway_type = "KEGG",
n_pathways = 10,
sort_by = "p.adjust",
show_direction = TRUE,
colors = c("Down" = "#3182bd", "Up" = "#de2d26")
)
# Display the plot
ridge_plotThe ridge plot shows: - Each pathway as a density ridge - Color indicates enrichment direction (Up = red, Down = blue) - The distribution of log2 fold changes for genes within each pathway - A vertical dashed line at 0 for reference
It can be informative to compare the results from GSEA with those from Differential Abundance Analysis (DAA):
# Run DAA analysis
daa_results <- pathway_daa(
abundance = abundance_data,
metadata = metadata,
group = "Environment",
daa_method = "ALDEx2"
)
# Annotate DAA results
annotated_daa_results <- pathway_annotation(
pathway = "KO",
daa_results_df = daa_results,
ko_to_kegg = TRUE
)
# Compare GSEA and DAA results
comparison <- compare_gsea_daa(
gsea_results = annotated_results,
daa_results = annotated_daa_results,
plot_type = "venn",
p_threshold = 0.05
)
# Display the comparison plot
comparison$plot
# View the comparison results
comparison$resultsGene Set Enrichment Analysis provides a complementary approach to Differential Abundance Analysis for interpreting PICRUSt2 predicted functional data. By focusing on pathways rather than individual features, GSEA can help identify biologically meaningful patterns that might be missed by traditional methods.
Use camera method (default) for
most analyses - it provides more reliable p-values by accounting for
inter-gene correlations
Adjust for covariates when you have confounding factors like age, sex, or BMI in your study design
Use fry method when you need faster
computation for large gene set collections
Use preranked methods (fgsea) only
when you need compatibility with existing workflows, and interpret
p-values with caution
The ggpicrust2 package now offers a comprehensive suite of tools for both DAA and GSEA analyses, making it easier to gain insights from microbiome functional prediction data.