# Introduction

While there are many resources out there to describe the issues arising with multicollinearity in independent variables, thereâ€™s a visualization for the problem that I came across in a book once and havenâ€™t found replicated online. This replicates the visualization.

# Set-up

Let \(Y\) be a response and \(X_1\) and \(X_2\) be the predictors, such that

\[
Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \epsilon_i
\]

for individual \(i\).

For simplicity, letâ€™s say that

\[
\beta_0 = 0,\\
\beta_1 = 1,\\
\beta_2 = 1.
\]

I carry out a simulation by generating 1,000 simulations with a given correlation between predictors and obtain their coefficients.

```
reps <- 1000
n <- 100
save <- matrix(nrow = reps, ncol = 3)
for (i in 1:reps) {
x1 <- rnorm(n)
x2 <- rnorm(n)
y <- x1 + x2 + rnorm(n)
mod <- lm(y ~ x1 + x2)
save[i,] <- c(coef(mod)[-1], cor(x1, x2))
}
```

The line `x2 <- rnorm(n)`

gets replaced with `x2 <- x1 + rnorm(n, sd = _)`

, where the `_`

is replaced with difference values to induce more correlation between `x1`

and `x2`

.

Following these simulations, the coefficients are plotted against each out.

# Simulation

## No Collinearity

The average correlation between \(X_1\) and \(X_2\) is 0.007.

The red dot represents the true coefficients. We see no relationship between the estimated coefficients, and each are well centered around the truth.

## Moderate Collinearity of .5

The average correlation between \(X_1\) and \(X_2\) is 0.556.