1  Mixed Model Theory

When fitting a regression model, the most important assumption the models make (whether it’s linear regression or generalized linear regression) is that of independence - each row of your data set is independent on all other rows.

Now in general, this is almost never entirely true. If this violation is mild, it can be ignored. For example, if you give an exam to a class full of students, it’s reasonable to assume some students study together and therefore their answers on some questions (right or wrong) will tend to be similar.

Here we are more concerned with a structured violation of independence. The most straightforward situation in which this arises is repeated measures. Say you’re administering an experiment where you are testing stress to different stimuli and are measuring quantities like blood pressure or heart rate. If I were to take the test multiple times, my measurements are correlated with each other - if I tend to have higher blood pressure, my blood pressure will be higher in each round regardless of stimuli.

Another common repeated measures situation is follow-up over time. Say you track individuals who undergo some surgery, and follow-up with them every 2 months, asking them to take a survey on their quality of life. While there are many research questions which could be asked about such data, one might be to examine the patient’s quality of life depending on the various health markers they are experiencing. The surgery is targeting symptom A, is improvement in symptom A associated with better QoL? Is it symptom B which was unaffected by the surgery that has the largest effect on QoL? Obviously a given patient’s QoL and health markers at one time-point are highly correlated with each other.

Another situation where this arises which isn’t explicitly repeated measures is when your data is collected in some sort of clustered fashion. Note that if your data is collected via a complex survey design, this is an entirely different beast that needs to be addressed appropriately (with the svyset and svy set of commands). Instead I’m assuming a simpler set-up here. Say you are conducting a survey that asks participants to come into a lab with their family. If the unit of analysis is a person (instead of a family), it’s very likely that two individuals in the same household will have similar food habits. Therefore we gain less information by getting information on a new individual in an existing household, then adding a new individual from a new household.

The canonical example of this is students in classrooms in schools in districts. All the situations above are 2-level (defined precisely below), but here we have four levels. Again, adding a new student from an existing class/school/district will likely not add as much information as a new student from a new class in a new school in a new district.

To address the lack of independence, we will move from normal regression (linear or otherwise) into a mixed models framework, which models for this dependence structure. It does this (at the most basic level) by allowing each higher level unit to have it’s own intercept which we do not estimate.

1.1 Terminology

There are several different names for mixed models which you might encounter, that all fit essentially the same model:

  • Mixed model
  • Mixed Effects regression/model
  • Multilevel regression/model
  • Hierarchical regression/model (specifically HLM, hierarchical linear model)

The hierarchical/multilevel variations require thinking about the levels of the data and involves “nesting”, where one variable only occurs within another, e.g. family members nested in a household. The most canonical example of this is students in classrooms, we could have

  • Level 1: The lowest level, the students.
  • Level 2: Classroom or teacher (this could also be two separate levels of classrooms inside teacher)
  • Level 3: District
  • Level 4: State
  • Level 5: Country

This is taking it a bit far; it’s rare to see more than 3 levels, but in theory, any number can exist.

For this workshop, we will only briefly discuss this from hierarchical point of view, preferring the mixed models view (with the reminder again that they are the same!).

1.1.1 Econometric terminology

To make the terminology a bit more complicated, in econometrics, some of the terms we will use here are overloaded. When you are discussing mixed models with someone with econometric or economics training, it’s important to differentiate between the statistical terms of “fixed effects” and “random effects” which are the two components of a mixed model that we discuss below, and what econometricians called “fixed effects regression” and “random effects regression”.

Without going into the full details of the econometric world, what econometricians called “random effects regression” is essentially what statisticians called “mixed models”, what we’re talking about here. The Stata command xtreg handles those econometric models. I have a document which demostrates the equality of these, as well as explaing the ecoonometric “fixed effects regression” in terms of the statistical view of regression.

1.2 Wide vs Long data

Before you begin your analysis, you need to ensure that the data is in the proper format. The data can be either in wide-form or long-form. Long-format is sometimes called tall-form.

If the data is longitudinal (follows the same person over time, collecting data at intervals), the wide form would entail each row representing a single person, with the variables representing the questions at each wave. For example, if you were asking about income every 2 years, you’d have variables income14, income16, income18 representing the individuals income in 2014, 2016 and 2018.

The tall form would have each row of data represent a person and a year. So you’d have a column for ID, a column for year, and then, continuing the example above, a single variable income.

If the data is clustered in some sense, e.g. students in classroom, the wide form would have each row be a single classroom, and have a separate set of variables for the first student, second student, etc. In this format, unbalanced data (classrooms having different number of students) would result in a lot of missing values.

The tall form would have a single row per student, and a variable representing their class.

In a lot of situations, it is easier to collect and manage data in wide form. However, to fit a mixed model, we need the data in long format. We can use the reshape command to transform wide data to long. This is covered in my Introduction to Stata set of notes.

1.3 Level-1 versus Level-2 variables

Often you will have variables that measure something about the unit of analysis, and other variables which measure something about the grouping variable. For example, you may have a variable indicating the GPA of a student, another variable indicating the size of their classroom, and yet another variable indicating the percent minority in their school.

When thinking about this from the hierarchical view, it’s important to differentiate between these types of variables. If you take my advice and think of this from the mixed model point of view, the difference is irrelevant - each is just a variable associated with the student.

The only thing that matters is ensuring the data is correct - if you had 10 students in a class, and the variable sizeofclass was 10 for half them and 8 for the other half, Stata wouldn’t know this is an issue and would fit the model without complaining - but now you’ve got a data issue that could be affecting your analysis.

When the data is repeated measures over time, these are sometimes called time-variant (e.g. patient follow-ups, measured at each follow-up) or time-invariant (baseline characteristics or immutable demographics).

1.4 Theory

The equation for ordinal least squares (linear regression) is

\[ Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \cdots + \beta_pX_{pi} + \epsilon_i \]

where \(Y_i\) represents the response of individual \(i\), the various \(X_{ki}\) represent the predictor variables for the same respondent, the \(\beta_j\) are the coefficients to be estimated (which are constant across individuals). \(\epsilon_i\) is the additional error for this individual.

As mentioned above, there are two ways to think about mixed models - as a mixed model, or as a hierarchical model. Let’s talk about the mixed model first.

The most basic form of a mixed model (which is also the most commonly used form) modifies the regression model above by separating the error term \(\epsilon_i\) into a contribution to the error from each level of the data:

\[ Y_{ij} = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \cdots + \beta_pX_{pi} + \kappa_j + \epsilon_{ij} \]

The subscript notation helps us keep track of things. Here, I’ve set it up such that each observation \(i\) belongs to a group \(j\). The response belonging to individual \(i\) in group \(j\) is predicted based upon some \(X\) variables, plus some additive effect which is unique to group \(j\) (\(\kappa_j\)), and additional error unique to that individual ($_{ij}).1

Both error terms, \(\epsilon\) and \(\kappa\), are assumed to have a mean of 0. This is important here because it means that on average the random intercept has no affect, but varies from individual to individual.

The \(\beta\) terms are the fixed effects, and the \(\kappa\) error terms are the random effects.

The catch is that we do not estimate \(\kappa_j\). If we include a categorical variable for the grouping variable as a fixed effect, we could estimate all those intercepts, however, doing so would in most situations overfit the model (if each individual had two measurements, we’d be including \(n/2\) predictors in a model with \(n\) observations). Instead, we estimate only the variance of \(\kappa_j\) which allows us to determine whether the intercepts differ between groups.

Let’s use a concrete example to make this more precise. Let your data set consist of \(n\) students, labeled \(s = 1, 2, \cdots, n\), each belonging to one of \(m\) classrooms, labeled \(c = 1, 2, \cdots, m\). For further simplicity, let’s assume there is only a single fixed predictor \(X\).

\[ Y_{sc} = \beta_0 + \beta_1X_s + \kappa_c + \epsilon_{sc} \]

You predict \(Y\) based upon the \(X\)’s, then there is some common error amongst all students in classroom \(c\) which captured by \(\kappa_c\), then there is individual error captured by \(\epsilon_{sc}\).

(You sometimes see the \(\kappa_j\) term written as \(\kappa_j Z_j\) where \(Z_j\) would be the variable indicating group membership. I find the above notation clearer, though they are mathematically equivalent.)

In the above example, I’ve assumed that the sole \(X\) in the model is measured as the student level (hence \(X_s\)). There is no need for that. I could instead fit the above model with one variable measured per student (say GPA) and one variable measured per classroom (say average teacher evaluation).

\[ Y_{sc} = \beta_0 + \beta_1X_{1s} + \beta_2X_{2c} + \kappa_c + \epsilon_{sc} \]

The nice part is that when fitting the model, this distinction doesn’t matter! Both \(X_{1s}\) and \(X_{2c}\) are treated the same way.

The error terms can be expanded if desired. For students (\(s\)) nested inside classrooms (\(c\)) nested inside districts (\(d\)):

\[ Y_{scd} = \beta_0 + \beta_1X_{1s} + \beta_2X_{2c} + \gamma_d + \kappa_{cd} + \epsilon_{scd} \]

Here \(\gamma_d\) is the error common to all students in a given district, \(\kappa_{cd}\) is the additional error common to all students in a given classroom, and \(\epsilon_{scd}\) is any left over student error.

1.4.1 The Hierarchical framework

The hierarchical way to write these models is, in my opinion, unnecessarily complicated and does not improve on the understanding. It is more complicated for two reasons: 1) it is simply more complicated to write and understand, and 2) it requires conceptualizing the levels of the model more than needed. For completeness, I will briefly re-write the school and classroom above. In mixed form, the model would be:

\[ Y_{sc} = \beta_0 + \beta_1X_{1s} + \kappa_c + \epsilon_{sc} \]

In the hierarchical form, the model is:

\[ Y_{sc} = \beta_{s0} + \beta_1X_{1s} + \epsilon_{sc} \] \[ \beta_{s0} = \gamma_{0} + \gamma_{1}Z_c + \sigma_c \]

If you were to plug \(\beta_{s0}\) back into the first equation, you can see the equivalence of the two forms.


  1. The choice of \(\epsilon\) for the individual error in a regression is fairly standardized in the literature. My choice of \(\kappa\) is not, as the literature does not have any standard choice for the random effect.↩︎