Introduction

While there are many resources out there to describe the issues arising with multicollinearity in independent variables, there’s a visualization for the problem that I came across in a book once and haven’t found replicated online. This replicates the visualization.

Set-up

Let \(Y\) be a response and \(X_1\) and \(X_2\) be the predictors, such that

\[ Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \epsilon_i \]

for individual \(i\).

For simplicity, let’s say that

\[ \beta_0 = 0,\\ \beta_1 = 1,\\ \beta_2 = 1. \]

No Collinearity

Here’s a simulation. We generate data with no correlation between \(X_1\) and \(X_2\)

reps <- 1000
n <- 100
save <- matrix(nrow = reps, ncol = 3)

for (i in 1:reps) {
  x1 <- rnorm(n)
  x2 <- rnorm(n)
  y <- x1 + x2 + rnorm(n)
  
  mod <- lm(y ~ x1 + x2)
  save[i,] <- c(coef(mod)[-1], cor(x1, x2))
}

Now, we have an average correlation between \(X_1\) and \(X_2\) of -0.001. When we plot \(b_1\) and \(b_2\) (the estimates of \(\beta_1\) and \(\beta_2\)) against each other,