Overview

Compared to the linear model, one advantage of the generalized linear model is its ability to model different relationships between the response variable and the predictors. One challenge is knowing which link to use. In this vignette, we will explore how different relationships affect correlation and the visual appearance of scatter plots.

Inverse Link

Mathematical Background

For the identity link, the underlying model is \[Y = \beta_2X_2 + \beta_1X_1 + \beta_0\] Note there is no need to rearrange for Y because the link is the identity function.

Using the inverse link function, the underlying model is \[ 1/Y = \beta_2X_2 + \beta_1X_1 + \beta_0\].

Rearranging for Y, we get \[ Y = 1 / (\beta_2X_2 + \beta_1X_1 + \beta_0) \] We see the relationship between Y and X is different between the two models. This is the beauty of the glm framework. It can handle many different relationships between Y and X.

Exploring Data

First, lets generate data.

library(GlmSimulatoR)
library(ggplot2)
library(stats)

set.seed(1)
simdata <- simulate_gaussian(
  N = 1000, weights = c(1, 3), link = "inverse",
  unrelated = 1, ancillary = .005
)

The response is Gaussian.

ggplot(simdata, aes(x = Y)) +
  geom_histogram(bins = 30)

The connection between Y and X1 is not obvious. There is only a slight downward trend. One might see it as unrelated.

ggplot(simdata, aes(x = X1, y = Y)) +
  geom_point()

There is a connection between Y and X2. No surprise as the true weight is three.

ggplot(simdata, aes(x = X2, y = Y)) +
  geom_point()

The scatter plot between the unrelated variable and Y looks like random noise. It is interesting to note the scatter plot for X1 looks more similar to this one than X2’s scatter plot despite being included in the model.

ggplot(simdata, aes(x = Unrelated1, y = Y)) +
  geom_point()

The correlation is very strong between Y and X2. This is no surprise considering the above graph. The correlation between Y and X1 is somewhat larger in absolute value than the unrelated variable.

cor(x = simdata$X1, y = simdata$Y)
#> [1] -0.2566767
cor(x = simdata$X2, y = simdata$Y)
#> [1] -0.8667127
cor(x = simdata$Unrelated1, y = simdata$Y)
#> [1] 0.04608871

Building Models

Pretending the correct model is unknown, lets try to find it. Three models are built. One with just X2, one with X1 and X2, and one with everything. Will the correct model stand out?

glm_inverse_x2 <- glm(Y ~ X2,
  data = simdata,
  family = gaussian(link = "inverse")
)
glm_inverse_x1_x2 <- glm(Y ~ X1 + X2,
  data = simdata,
  family = gaussian(link = "inverse")
)
glm_inverse_x1x2u1 <- glm(Y ~ X1 + X2 + Unrelated1,
  data = simdata,
  family = gaussian(link = "inverse")
)

summary(glm_inverse_x2)$aic
#> [1] -7245.43
summary(glm_inverse_x1_x2)$aic # correct model
#> [1] -7647.507
summary(glm_inverse_x1x2u1)$aic
#> [1] -7645.924

The correct model has the lowest AIC.

Log Link

Mathematical Background

Above we saw using the identity link assumes an additive relationship between Y and X. \[Y = \beta_2X_2 + \beta_1X_1 + \beta_0\]

For the log link, the underlying model is \[\ln(Y) = \beta_2X_2 + \beta_1X_1 + \beta_0\]

Rearranging for Y, we get \[ Y = \exp (\beta_2X_2 + \beta_1X_1 + \beta_0) \] Splitting up the exponent, we get \[ Y = \exp (\beta_2X_2) * \exp (\beta_1X_1) * \exp (\beta_0) \] Thus the relationship between Y and X is not additive for the log link.

Exploring Data

First, lets generate data.

library(GlmSimulatoR)
library(ggplot2)
library(stats)

set.seed(1)
simdata <- simulate_gaussian(
  N = 1000, weights = c(.3, .8), link = "log",
  unrelated = 1, ancillary = 1
)

We see the response is somewhat Gaussian.

ggplot(simdata, aes(x = Y)) +
  geom_histogram(bins = 30)

The connection between Y and X1 is not obvious…

ggplot(simdata, aes(x = X1, y = Y)) +
  geom_point()

There is a connection between Y and X2. No surprise as the true weight is .8 on the log scale.

ggplot(simdata, aes(x = X2, y = Y)) +
  geom_point()

The scatter plot between the unrelated variable and Y looks like random noise.

ggplot(simdata, aes(x = Unrelated1, y = Y)) +
  geom_point()

Again X2’s correlation is large. X1 is in the gray area. The unrelated variable’s correlation is near zero.

cor(x = simdata$X1, y = simdata$Y)
#> [1] 0.3058583
cor(x = simdata$X2, y = simdata$Y)
#> [1] 0.8764507
cor(x = simdata$Unrelated1, y = simdata$Y)
#> [1] -0.04017587

Building Models

Pretending the correct model is unknown, three models are built. For a change, links vary and predictor variables are held constant.

glm_identity <- glm(Y ~ X1 + X2,
  data = simdata,
  family = gaussian(link = "identity")
)
glm_inverse <- glm(Y ~ X1 + X2,
  data = simdata,
  family = gaussian(link = "inverse")
)
glm_log <- glm(Y ~ X1 + X2,
  data = simdata,
  family = gaussian(link = "log")
)

summary(glm_identity)$aic
#> [1] 3032.25
summary(glm_inverse)$aic
#> [1] 3069.088
summary(glm_log)$aic # correct model.
#> [1] 2948.311

Again, the correct model has the lowest AIC.

Summary

Different links for the Gaussian distribution were explored, but the Gaussian distribution is not a special case. Everything that was done here could be done for any distribution in the glm framework.

Exploring Links for the Gaussian Distribution

Overview

Inverse Link

Mathematical Background

Exploring Data

Building Models

Log Link

Mathematical Background

Exploring Data

Building Models

Summary