The linear genomic selection index (LGSI) is a linear combination of genomic estimated breeding values (GEBVs) used to predict the individual net genetic merit and select individual candidates from a non-phenotyped testing population as parents of the next selection cycle. In the LGSI, phenotypic and marker data from the training population are fitted into a statistical model to estimate all individual genome marker effects; these estimates can then be used in subsequent selection cycles to obtain GEBVs that are predictors of breeding values in a testing population for which there is only marker information.
Applying the LGSI requires predicting and ranking the net genetic merit of candidates for selection using only marker information. The combined LGSI (CLGSI), conversely, uses both phenotypic and GEBV information jointly to predict the net genetic merit. The CLGSI is typically used in the training populations to estimate true merit.
This vignette covers calculating LGSI and CLGSI using the
selection.index package.
To construct a valid LGSI, four essential conditions must be met: 1. All marker effects must be estimated simultaneously in the training population. 2. The estimated marker effects must be used in subsequent selection cycles to obtain GEBVs that act as predictors of the individual breeding values in the testing population (where there is only marker information). 3. The GEBV values must be composed entirely of the additive genetic effects. 4. Phenotypes must be used to estimate all marker effects in the training population, and not to make selections in the testing population (Heffner et al., 2009; Lorenz et al., 2011).
The objective of LGSI is to predict the net genetic merit \(H = \mathbf{w}'\mathbf{g}\), where \(\mathbf{g}\) is a vector of unobservable true breeding values, and \(\mathbf{w}\) is a vector of economic weights. Let \(\mathbf{\gamma}\) be the vector of known Genomic Estimated Breeding Values (GEBVs). The basic LGSI equation is:
\[ {I}_{\mathrm{G}}={\boldsymbol{\beta}}^{\prime}\boldsymbol{\gamma} \]
where \(\boldsymbol{\beta}\) is the unknown variable vector of optimal weights. For LGSI, it has been shown that maximizing the accuracy of \(\rho_{HI_G}\) results simply in \(\boldsymbol{\beta} = \mathbf{w}\).
The main advantage of the LGSI over the conventional Linear Phenotypic Selection Index (LPSI) lies in the possibility of reducing the intervals between selection cycles from roughly 4 years to 1.5 years in plant breeding. The selection response equation adjusts for cycle length \(L_G\):
\[ {R}_{I_G}=\frac{k_I}{L_G}\sqrt{{\mathbf{w}}^{\prime}\boldsymbol{\Gamma} \mathbf{w}} \] Where \(\boldsymbol{\Gamma}\) is the covariance matrix of the GEBVs.
The LGSI operates as a direct genomic extension of the LPSI theory, retaining the exact same statistical properties: 1. The variance of \(I_G\) (\(\sigma_{I_G}^2\)) and the covariance between \(H\) and \(I_G\) (\(\sigma_{HI_G}\)) are equal (\(\sigma_{I_G}^2 = \sigma_{HI_G}\)). 2. The maximized correlation between \(H\) and \(I_G\) (the LGSI accuracy) is equal to \(\rho_{HI_G} = \frac{\sigma_{I_G}}{\sigma_H}\). 3. The variance of the predicted error, \(Var(H - I_G) = (1 - \rho_{HI_G}^2)\sigma_H^2\), is minimal. 4. The total variance of \(H\) explained by \(I_G\) is \(\sigma_{I_G}^2 = \rho_{HI_G}^2 \sigma_H^2\).
To formalize the efficiency between the Genomic (LGSI) and Phenotypic (LPSI) selection index efficiency, we use the parameter \(\lambda_G\): \[ p_G=100\left(\lambda_G-1\right) \]
If \(p_G > 0\), LGSI efficiency is greater than LPSI efficiency; if \(p_G = 0\), both are equal, and if \(p_G < 0\), LPSI is more efficient. Additionally, accounting for the interval lengths between selection cycles, LGSI will be strictly more efficient than LPSI when: \[ L_G < \frac{\rho_{HI_G}}{h_I}L_P \]
Because the selection.index package does not perform
prediction modeling itself, the GEBVs must be calculated prior. In this
example, we generate GEBVs (\(\boldsymbol{\gamma}\)) using an unpenalized
Ridge Regression, where markers estimate trait values. For a rigorous
methodology it is acceptable to generate GEBVs using Bayesian Alphabets,
rrBLUP, or equivalent prediction models.
library(selection.index)
library(MASS) # For robust ridge regression implementation (lm.ridge)
# 1. Load the built-in synthetic maize datasets
data("maize_pheno")
data("maize_geno")
# Ensure markers are numeric matrices
X <- as.matrix(maize_geno)
# Ensure pheno lines up perfectly with geno
# (Note: maize_pheno contains 4 repetitions per genotype; we take the means)
Y_agg <- mean_performance(
data = maize_pheno[, c("Yield", "PlantHeight", "DaysToMaturity")],
genotypes = maize_pheno$Genotype,
replications = maize_pheno$Block
)
#> Warning in sqrt(EMS/r): NaNs produced
#> Warning in sqrt(2 * EMS/r): NaNs produced
#> Warning in sqrt(2 * EMS/r): NaNs produced
#> Warning in sqrt(EMS): NaNs produced
#> Warning in sqrt(EMS/r): NaNs produced
#> Warning in sqrt(2 * EMS/r): NaNs produced
#> Warning in sqrt(2 * EMS/r): NaNs produced
#> Warning in sqrt(EMS): NaNs produced
#> Warning in sqrt(EMS/r): NaNs produced
#> Warning in sqrt(2 * EMS/r): NaNs produced
#> Warning in sqrt(2 * EMS/r): NaNs produced
#> Warning in sqrt(EMS): NaNs produced
# Sort both datasets to ensure identical ordering
Y_agg <- Y_agg[order(Y_agg$Genotypes), ]
X <- X[order(rownames(X)), ]
# Ensure overlapping genotypes
common_genotypes <- intersect(Y_agg$Genotypes, rownames(X))
Y_agg <- Y_agg[Y_agg$Genotypes %in% common_genotypes, ]
X <- X[rownames(X) %in% common_genotypes, ]
# Select only multi-traits relevant to breeding
# Yield, PlantHeight, DaysToMaturity
Y <- as.matrix(Y_agg[, c("Yield", "PlantHeight", "DaysToMaturity")])We now calculate \(\boldsymbol{\gamma}\) (the
gebv_mat):
# Simulate Genomic Estimated Breeding Values (GEBVs) using Ridge Regression
# In best practice, you'd use cross-validation or a separate training/testing set
# We use lambda = 100 to handle the p >> n dimensionality problem
gebv_mat <- matrix(0, nrow = nrow(Y), ncol = ncol(Y))
colnames(gebv_mat) <- colnames(Y)
for (i in seq_len(ncol(Y))) {
# Fit a ridge regression model for trait `i` using markers `X`
model_ridge <- lm.ridge(Y[, i] ~ X, lambda = 100)
# Predict values using coef() which correctly un-scales the coefficients: Include intercept
betas <- coef(model_ridge)
intercept <- betas[1]
beta <- betas[-1] # Exclude intercept
# Calculate the GEBV for the trait
gebv_mat[, i] <- intercept + (X %*% beta)
}
head(gebv_mat, 3)
#> Yield PlantHeight DaysToMaturity
#> [1,] 8241.151 219.4007 115.8738
#> [2,] 7965.285 208.8754 116.4623
#> [3,] 7832.597 204.5494 116.4592With gebv_mat in hand, it is simple to calculate the
LGSI metrics using lgsi():
# 2. Compute Trait Covariances
pmat <- cov(Y) # Phenotypic Covariance (Approximation)
# In optimal practice, the true Genomic Covariance Matrix (\Gamma) is
# estimated using Restricted Maximum Likelihood (REML) utilizing both general
# phenotypic and genotypic information. Here, we simulate \Gamma as proportional
# to the phenotypic variance (assuming a high heritability correlation).
gmat <- pmat * 0.4 # Approximate Genotypic Covariance (Approximating \Gamma)
# 3. Define Economic Weights
weights <- data.frame(
Trait = c("Yield", "PlantHeight", "DaysToMaturity"),
Weight = c(5, -0.1, -0.1)
)
wmat <- weight_mat(weights)
# 4. Calculate Linear Genomic Selection Index (LGSI)
# For the testing population where we only use genomic values
lgsi_result <- lgsi(
gebv_mat = gebv_mat,
gmat = gmat,
wmat = wmat
)
# Output Summary
lgsi_result$summary
#> b.1 b.2 b.3 GA PRE hI2 rHI
#> 1 5 -0.1 -0.1 1718.838 171883.8 1 1The LGSI function output shows the calculated standard deviation (\(\sigma_I\)), index heritability (\(h^2\)), the overall net genetic advance (\(GA\)), and expected gains per trait (\(\Delta H\)) associated with genomic selection.
The Combined LGSI (CLGSI) developed by Dekkers (2007) is a slightly modified version of the LMSI, which, instead of using the raw marker scores, uses the GEBVs and the phenotypic information jointly to predict the net genetic merit.
The CLGSI model represents: \[ {I}_C={\boldsymbol{\beta}}_{\mathbf{y}}^{\prime}\mathbf{y}+{\boldsymbol{\beta}}_G^{\prime}\boldsymbol{\gamma} \]
Where \(\mathbf{y}\) is the predicted phenotype and \(\boldsymbol{\gamma}\) is the subset of genetic markers. As \(p \rightarrow \infty\), the true variance approaches the genomic predicted variance, and CLGSI becomes identical to LGSI. Because the CLGSI uses GEBVs and phenotypic information jointly, it can ideally complement prediction workflows inside training populations where field trials are simultaneously held.
Similar to LGSI, the CLGSI holds strict conditions ensuring theoretical validation constraints: 1. \(\sigma_{I_C}^2 = \sigma_{HI_C}\), meaning the variance of \(I_C\) and the covariance between \(H\) and \(I_C\) are unified. 2. The maximized correlation between \(H\) and \(I_C\) operates strictly as \(\rho_{HI_C} = \frac{\sigma_{I_C}}{\sigma_H}\). 3. The variance of the predicted error, \(Var(H - I_C) = (1 - \rho_{HI_C}^2)\sigma_H^2\), is mathematically minimal. 4. The total variance of \(H\) explained by \(I_C\) scales linearly via \(\sigma_{I_C}^2 = \rho_{HI_C}^2 \sigma_H^2\).
You can measure the CLGSI utilizing the clgsi() function
and feeding it both the observed phenotypes and estimated genomic
variants.
clgsi_result <- clgsi(
phen_mat = Y, # Observed phenotypic data
gebv_mat = gebv_mat, # Genomic Estimated Breeding Values
pmat = pmat, # Expected Phenotypic traits covariance
gmat = gmat, # Expected Genotypic traits covariance
wmat = wmat
)
clgsi_result$summary
#> b_y.1 b_y.2 b_y.3 b_g.1 b_g.2 b_g.3 GA PRE hI2
#> 1 -241.2056 -69.0278 -11711.7 301.3417 -12.6428 13421.05 9325.477 932547.7 1
#> rHI
#> 1 1Notice that the Combined LGSI has two sets of optimized unitless
\(\beta\) weightings—one for the
phenotype values (b_y), and one for the overall marker
value (b_g) per trait. Selection Response (\(R\)) represents the final expected genetic
aggregate improvement.
lm.ridge) to derive the Genomic Estimated Breeding Values
without singular/non-invertible covariance anomalies.