# Introduction

While there are many resources out there to describe the issues arising with multicollinearity in independent variables, there’s a visualization for the problem that I came across in a book once and haven’t found replicated online. This replicates the visualization.

# Set-up

Let $$Y$$ be a response and $$X_1$$ and $$X_2$$ be the predictors, such that

$Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \epsilon_i$

for individual $$i$$.

For simplicity, let’s say that

$\beta_0 = 0,\\ \beta_1 = 1,\\ \beta_2 = 1.$

# No Collinearity

Here’s a simulation. We generate data with no correlation between $$X_1$$ and $$X_2$$

reps <- 1000
n <- 100
save <- matrix(nrow = reps, ncol = 3)

for (i in 1:reps) {
x1 <- rnorm(n)
x2 <- rnorm(n)
y <- x1 + x2 + rnorm(n)

mod <- lm(y ~ x1 + x2)
save[i,] <- c(coef(mod)[-1], cor(x1, x2))
}

Now, we have an average correlation between $$X_1$$ and $$X_2$$ of -0.001. When we plot $$b_1$$ and $$b_2$$ (the estimates of $$\beta_1$$ and $$\beta_2$$) against each other,