Introduction

While there are many resources out there to describe the issues arising with multicollinearity in independent variables, there’s a visualization for the problem that I came across in a book once and haven’t found replicated online. This replicates the visualization.

Set-up

Let \(Y\) be a response and \(X_1\) and \(X_2\) be the predictors, such that

\[ Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \epsilon_i \]

for individual \(i\).

For simplicity, let’s say that

\[ \beta_0 = 0,\\ \beta_1 = 1,\\ \beta_2 = 1. \]

I carry out a simulation by generating 1,000 simulations with a given correlation between predictors and obtain their coefficients.

reps <- 1000
n <- 100
save <- matrix(nrow = reps, ncol = 3)

for (i in 1:reps) {
  x1 <- rnorm(n)
  x2 <- rnorm(n)
  y <- x1 + x2 + rnorm(n)

  mod <- lm(y ~ x1 + x2)
  save[i,] <- c(coef(mod)[-1], cor(x1, x2))
}

The line x2 <- rnorm(n) gets replaced with x2 <- x1 + rnorm(n, sd = _), where the _ is replaced with difference values to induce more correlation between x1 and x2.

Following these simulations, the coefficients are plotted against each out.

Simulation

No Collinearity

The average correlation between \(X_1\) and \(X_2\) is 0.007.

The red dot represents the true coefficients. We see no relationship between the estimated coefficients, and each are well centered around the truth.

Moderate Collinearity of .5

The average correlation between \(X_1\) and \(X_2\) is 0.556.