This vignette demonstrates the methodology behind Constrained Linear Genomic Selection Indices (LGSI) based on Chapter 6 of the book Linear Selection Indices in Modern Plant Breeding. The objective of constrained selection indices is to maximize genetic advance for specific traits while imposing defined restrictions on other traits.
In a genomic selection context, these indices use Genomic Estimated Breeding Values (GEBVs). The core indices covered here include:
We will utilize synthetic phenotypic and genotypic maize data
provided in the selection.index package to illustrate these
methods.
Before applying these indices, we first estimate genomic breeding
values. We use Ridge Regression (lm.ridge) from the
MASS package as a rapid and simple method to simulate this
step linking marker data to phenotype data.
library(selection.index)
library(MASS) # For Ridge regression
# Load the maize datasets
data("maize_pheno", package = "selection.index")
data("maize_geno", package = "selection.index")
# Extract only the traits we care about to prevent missing value logic failures
traits <- c("Yield", "PlantHeight", "DaysToMaturity")
maize_pheno_clean <- na.omit(maize_pheno[, c("Genotype", "Block", traits)])
# Calculate mean performance for phenotypic data
# Providing only the traits to `data` to accurately generate the table
pheno_means_full <- mean_performance(
data = maize_pheno_clean[, traits],
genotypes = maize_pheno_clean$Genotype,
replications = maize_pheno_clean$Block,
method = "Mean"
)
# Remove the trailing 9 rows composed of summary text-stats automatically generated
pheno_means <- pheno_means_full[1:(nrow(pheno_means_full) - 9), ]
# Extract numerical phenotype matrix
Y <- as.matrix(pheno_means[, traits])
mode(Y) <- "numeric"
rownames(Y) <- pheno_means$Genotypes
# Format genotype matrix
X <- as.matrix(maize_geno)
# Ensure matching dimensions and align the matrices
common_genotypes <- intersect(rownames(Y), rownames(X))
# Filter and sort matrices to match genotypes sequence exactly
Y_filtered <- Y[common_genotypes, ]
X_filtered <- X[common_genotypes, ]
# Calculate GEBVs using a simple Ridge Regression model per trait
gebvs_sim <- matrix(0, nrow = nrow(Y_filtered), ncol = ncol(Y_filtered))
colnames(gebvs_sim) <- colnames(Y_filtered)
rownames(gebvs_sim) <- rownames(Y_filtered)
lambda_ridge <- 0.1
for (j in seq_len(ncol(Y_filtered))) {
# Fit ridge regression to simulate marker effects
model_ridge <- lm.ridge(Y_filtered[, j] ~ X_filtered, lambda = lambda_ridge)
beta_ridge <- coef(model_ridge)[-1]
# Predict GEBVs
gebvs_sim[, j] <- X_filtered %*% matrix(beta_ridge, ncol = 1)
}
# Define Covariance Matrices
Gamma <- cov(gebvs_sim) # Genomic Covariance Matrix estimated from simulated GEBVs
P_matrix <- cov(Y_filtered) # Phenotypic Covariance Matrix
C_matrix <- Gamma # Genetic Covariance (Assuming Gamma captures genetic additive cov)
# Define vector of economic weights for Yield, PlantHeight, DaysToMaturity
w <- matrix(c(5, -0.1, -0.1), ncol = 1)The RLGSI is applied in a testing population when only GEBVs are available. The goal is to maximize the response for some traits while restricting the expected generic advance of other traits to exactly zero.
The expected genetic advance for RLGSI is restricted by: \(Cov(I_{RG}, \mathbf{U}'\mathbf{g}) = \mathbf{U}'\boldsymbol{\Gamma}\boldsymbol{\beta}_{RG} = \mathbf{0}\), where \(\mathbf{U}'\) is a matrix of 1s and 0s identifying the restricted traits.
Following the maximization step of the selection response, the index coefficients \(\boldsymbol{\beta}_{RG}\) optimally resolve as: \[ \boldsymbol{\beta}_{RG} = \mathbf{K}_{G}\mathbf{w} \] where \(\mathbf{K}_{G} = [\mathbf{I} - \mathbf{Q}_{G}]\) and \(\mathbf{Q}_{G} = \mathbf{U} (\mathbf{U}' \boldsymbol{\Gamma} \mathbf{U})^{-1} \mathbf{U}' \boldsymbol{\Gamma}\).
Suppose we want to maximize the gain using our three traits, but we want to hold Plant Height and Days To Maturity completely fixed (\(0\) gain).
# Define the restriction matrix U
# We restrict traits 2 (PlantHeight) and 3 (DaysToMaturity).
# Trait 1 (Yield) is left unrestricted.
U_rlgsi <- matrix(c(
0, 0, # Yield unrestricted
1, 0, # PlantHeight restricted
0, 1 # DaysToMaturity restricted
), nrow = 3, byrow = TRUE)
# Calculate RLGSI
rlgsi_res <- rlgsi(Gamma = Gamma, wmat = w, U = U_rlgsi)
# View the resulting index coefficients and Expected Genetic Gain
cat("RLGSI Coefficients (beta_RG):\n")
#> RLGSI Coefficients (beta_RG):
print(rlgsi_res$b)
#> [1] 5.0000 -111.1516 175.9708
cat("\nExpected Genetic Gain per Trait:\n")
#>
#> Expected Genetic Gain per Trait:
print(rlgsi_res$Summary[, "Delta_H", drop = FALSE])
#> NULLNotice in the output that the Expected Genetic Gain
(Delta_H) for traits 2 and 3 evaluates to a precision limit
of \(0\), fulfilling the null-gain
restriction.
Instead of forcing a trait to remain fixed at \(0\), we might want a trait’s mean value to undergo a highly specific, predefined amount of selection shift.
Let \(\mathbf{d}' = [d_1, d_2, \ldots, d_r]\) be the proportional gain values requested by the breeder. Similar to the RLGSI, the coefficient array evaluates identically with a targeted constant proportionality factor restricting expected proportional shifts: \[ \boldsymbol{\beta}_{PG} = \mathbf{K}_{P}\mathbf{w} \]
Suppose we want to target Yield to proportionally increase by \(7.0\) units, and identically target Plant Height to slightly decrease by relatively \(-3.0\) proportion targets.
# Define the restriction matrix U for PPG
# Restrict traits 1 (Yield) and 2 (PlantHeight) to predetermined scalars
U_ppg <- matrix(c(
1, 0, # Yield restricted
0, 1, # PlantHeight restricted
0, 0 # DaysToMaturity unrestricted
), nrow = 3, byrow = TRUE)
# Define the predetermined values vector 'd'
d_vec <- matrix(c(7.0, -3.0), ncol = 1)
# Calculate PPG-LGSI
ppg_res <- ppg_lgsi(Gamma = Gamma, d = d_vec, wmat = w, U = U_ppg)
cat("PPG-LGSI Coefficients (beta_PG):\n")
#> PPG-LGSI Coefficients (beta_PG):
print(ppg_res$b)
#> [1] 0.2441 -11.1971 -0.1000The CRLGSI extends the basic RLGSI by integrating both phenotypic information and genomic breeding values structurally as blocks in a combined index \(I_C\). It is mainly applied to a training population where joint data is uniformly available.
Its mathematical structure models parallel coefficient distributions across phenotypic and genomic parameters uniformly: \[ \boldsymbol{\beta}_{CR} = \mathbf{K}_{C}\boldsymbol{\beta}_C \]
For an index combining phenotypes and GEBVs, restrictions must concurrently be assigned to both empirical dimensions globally.
# 1. Provide Combined Matrices
# The combined Phenotypic-Genomic covariance matrix T_C
T_C <- rbind(
cbind(P_matrix, Gamma),
cbind(Gamma, Gamma)
)
# The combined Genetic-Genomic covariance matrix Psi_C
Psi_C <- rbind(
cbind(C_matrix, Gamma),
cbind(Gamma, Gamma)
)
# 2. Define Restriction Matrix U_C
# Note: For combined indices of `t` traits, U_C spans exactly 2t rows.
# Restricting Trait 1 (Yield) on both empirical and genomic parameters:
U_crlgsi <- matrix(0, nrow = 6, ncol = 2)
U_crlgsi[1, 1] <- 1 # Trait 1 phenotypic component
U_crlgsi[4, 2] <- 1 # Trait 1 genomic component
# 3. Calculate CRLGSI
crlgsi_res <- crlgsi(T_C = T_C, Psi_C = Psi_C, wmat = w, U = U_crlgsi)
cat("CRLGSI Combined Coefficients (beta_CR):\n")
#> CRLGSI Combined Coefficients (beta_CR):
print(crlgsi_res$b)
#> [1] -6.843302e-05 -1.079215e-03 -5.103276e-02 2.017002e-03 -9.892084e-02
#> [6] -4.895374e-02The CPPG-LGSI combines the dynamic proportional gain vectors to the
\(T_C\) and \(\Psi_C\) data blocks identically leveraging
the cppg_lgsi() logic.
Suppose we arbitrarily evaluate Yield to predetermined restriction constants.
# Target Yield using predetermined combined proportions d
d_combined <- matrix(c(7.0, 3.5), ncol = 1)
# Calculate CPPG-LGSI dynamically
cppg_res <- cppg_lgsi(T_C = T_C, Psi_C = Psi_C, d = d_combined, wmat = w, U = U_crlgsi)
cat("CPPG-LGSI Coefficients (beta_CP):\n")
#> CPPG-LGSI Coefficients (beta_CP):
print(cppg_res$b)
#> [1] 4.491881e-06 7.083882e-05 3.349690e-03 4.998047e+00 -7.083498e-05
#> [6] -3.350577e-03Below we extract the standard metrics associated with each index model (i.e Selection Response \(R\)):
comparison_df <- data.frame(
Index = c("RLGSI", "PPG-LGSI", "CRLGSI", "CPPG-LGSI"),
Selection_Response = c(
rlgsi_res$R,
ppg_res$R,
crlgsi_res$R,
cppg_res$R
),
Overall_Genetic_Advance = c(
rlgsi_res$GA,
ppg_res$GA,
crlgsi_res$GA,
cppg_res$GA
)
)
print(comparison_df)
#> Index Selection_Response Overall_Genetic_Advance
#> 1 RLGSI 1490.8413294 1490.8413294
#> 2 PPG-LGSI 102.6128081 102.9592598
#> 3 CRLGSI 0.9390596 0.9390641
#> 4 CPPG-LGSI 2123.2848118 2123.2848118