Visualizing Collinearity

While there are many resources out there to describe the issues arising with multicollinearity in independent variables, there’s a visualization for the problem that I came across in a book once and haven’t found replicated online. This replicates the visualization.

Set-up

Let \(Y\) be a response and \(X_1\) and \(X_2\) be the predictors, such that

\[ Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \epsilon_i \]

for individual \(i\).

For simplicity, let’s say that

\[ \beta_0 = 0,\\ \beta_1 = 1,\\ \beta_2 = 1. \]

I carry out a simulation by generating 1,000 data-sets with a specific correlation between predictors and obtain their coefficients.

reps <- 1000
n <- 100
save <- matrix(nrow = reps, ncol = 3)

for (i in 1:reps) {
  x1 <- rnorm(n)
  x2 <- rnorm(n)
  y <- x1 + x2 + rnorm(n)

  mod <- lm(y ~ x1 + x2)
  save[i, ] <- c(coef(mod)[-1], cor(x1, x2))
}

The line x2 <- rnorm(n) gets replaced with x2 <- x1 + rnorm(n, sd = _), where the _ is replaced with difference values to induce more correlation between x1 and x2.

Simulation Results

Each point represents the estimated coefficients for a single simulated data set. The red dot represents the data-generating coefficients \((1, 1)\). Note that bias is not a concern; the true coefficients are on average for each level of collinearity.

In this simulation, the average correlation between \(X_1\) and \(X_2\) is 0.002.

With low correlation, we see no relationship between the estimated coefficients.

Why is this a problem?

So this simulation shows that at correlations around .9 or higher between \(X_1\) and \(X_2\), there is negative correlation between \(\hat{X_1}\) and \(\hat{X_2}\). Why is this a problem?

Consider the “extremely high correlation” results. With such high correlation, we have that \(X_1 \approx X_2\). We can use this approximate equality to rewrite the model:

\[ \begin{aligned} Y_i &= \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \epsilon_i\\ &\approx \beta_0 + (\beta_1 + \beta_2)X_{1i} + \epsilon_i\\ &\approx \beta_0 + (\beta_1 + \beta_2)X_{2i} + \epsilon_i \end{aligned} \]

In other words, the model has that \(\beta_1 + \beta_2 = 2\) (since we assumed above that both coefficients have values of 1). So while all of those models would have the same predictive power for \(Y\), they would have drastically different interpretations. For example, we could obtain \(\hat{X_1} = -1\) and \(\hat{X_2} = 3\), which not only over-emphasizes the relationship betwen \(X_2\) and \(Y\), but suggests a inverse relationship between \(X_1\) and \(Y\)!

This work is licensed under CC BY-NC 4.0 Creative Commons BY-NC image