Chapter 6 Mixed models
Let’s violate another regression assumption, independence. While this is usually thought of in the repeated measurements setting, it is not exclusive to that. For example,
 Repeated measures: You’re conducting a trial on individuals who undergo an intervention. You generate a survey, and have the participants fill out a copy when the intervention begins, 30 days into the intervention, and 1 year after the intervention. If we have a single outcome measure of interest from this survey, we have three measurements for person. These values are not independent; it’s reasonable to think that your measurements are more related to each other than any of mine.
 Nonrepeated measures: You conduct doortodoor sampling of individuals in households, asking about food habits. You collect information on each individual in the house, and want to use individuals as the unit of analysis. It’s likely that two individuals in the same house will have similar food patterns, as opposed to two individuals from different houses.
To address the lack of dependence, we will move from normal regression (linear or otherwise) into a mixed models framework, which accounts for this dependence structure. It does this (at the most basic level) by allowing each [individual from the intervention example, household from the doortodoor example] to have its own intercept which we do not estimate.
For this chapter, we’ll turn away from the “auto” data set and instead use a sample of the “National Longitudinal Survey” contained in Stata:
. webuse nlswork, clear
(National Longitudinal Survey. Young Women 1426 years of age in 1968)
This data is a survey taken from 19681988, and this specific sample of the data is salary information for women. We have repeated measures in the sense that we have yearly data for women, so each woman can have up to 20 data points.
6.1 Terminology
There are several different names for mixed models which you might encounter, that all fit essentially the same model:
 Mixed model
 Mixed Effects regression/model
 Multilevel regression/model
 Hierarchical regression/model (specifically HLM, hierarchical linear model)
The hierarchical/multilevel variations require thinking about the levels of the data and involves “nesting”, where one variable only occurs within another, e.g. family members nested in a household. The most canonical example of this is students in classrooms, we could have
 Level 1: The lowest level, the students.
 Level 2: Classroom or teacher (this could also be two separate levels of classrooms inside teacher)
 Level 3: District
 Level 4: State
 Level 5: Country
This is taking it a bit far; it’s rare to see more than 3 levels, but in theory, any number can exist.
For this workshop, we will only briefly discuss this from hierarchical point of view, preferring the mixed models view (with the reminder again that they are the same!).
6.1.1 Econometric terminology
To make the terminology a bit more complicated, in econometrics, some of the terms we will use here are overloaded. When you are discussing mixed models with someone with econometric or economics training, it’s important to differentiate between the statistical terms of “fixed effects” and “random effects” which are the two components of a mixed model that we discuss below, and what econometricians called “fixed effects regression” and “random effects regression”.
Without going into the full details of the econometric world, what econometricians called “random effects regression” is essentially what statisticians called “mixed models”, what we’re talking about here. The Stata command xtreg
handles those econometric models.
6.2 Wide vs Long data, Timevarying vs Timeinvariant
Before you begin your analysis, you need to ensure that the data is in the proper format. Let’s consider the NLS data, where we have measures of women’s salary over 20 years.
Wide format of the data would have row represent a woman, and she would have 20 columns worth of salary information^{18}(plus additional demographics).
Long format of the data would have each row represent a woman and a year, so that each woman can have up to 20 rows (if a woman wasn’t measured in a given year, that row & year is missing).
To fit a mixed model, we need the data in long format. We can use the reshape
command to transform wide data to long. This is covered in the Stata I set of notes.
Additionally, there is the concept of timevarying vs timeinvariant variables. Timevarying variables are those which can be different for each entry within the same individual. Examples include weight or salary. Timeinvariant are those which are the same across all entries. Examples include race or baseline characteristics.
When data is long, timeinvariant variables need to be constant per person.
6.3 Linear Mixed Model
The most basic mixed model is the linear mixed model, which extends the linear regression model. A model is called “mixed” because it contains a mixture of fixed effects and random effects.
 Fixed effects: These are the predictors that are present in regular linear regression. We will obtain coefficients for these predictors and be able to test and interpret them. Technically, an OLS linear model is a mixed model with only fixed effects.^{19}

Random effects: These are the “grouping” variables, and must be categorical (Stata will force every variable used to produce random effects as if it were prefaced by
i.
). These are essentially just predictors as well, however, we do not obtain coefficients to test or interpret. We do get a measure of the variability across groups, and a test of whether the random effect is benefiting the model.
Let’s fit a model using the mixed
command. It works similar to regress
with a slight tweak. We’ll try and predict log of wages^{20} using work experience, race and age. The variable idcode
identifies individuals.
. mixed ln_w ttl_exp i.race age  idcode:
Performing EM optimization:
Performing gradientbased optimization:
Iteration 0: log likelihood = 10471.727
Iteration 1: log likelihood = 10471.727
Computing standard errors:
Mixedeffects ML regression Number of obs = 28,510
Group variable: idcode Number of groups = 4,710
Obs per group:
min = 1
avg = 6.1
max = 15
Wald chi2(4) = 5246.25
Log likelihood = 10471.727 Prob > chi2 = 0.0000

ln_wage  Coef. Std. Err. z P>z [95% Conf. Interval]
+
ttl_exp  .042996 .0010126 42.46 0.000 .0410113 .0449807

race 
black  .1178248 .0115814 10.17 0.000 .1405238 .0951257
other  .0995432 .0484246 2.06 0.040 .0046328 .1944536

age  .0069699 .0006836 10.20 0.000 .0083097 .0056302
_cons  1.644184 .0161148 102.03 0.000 1.612599 1.675768


Randomeffects Parameters  Estimate Std. Err. [95% Conf. Interval]
+
idcode: Identity 
var(_cons)  .1037959 .0026579 .0987151 .1091381
+
var(Residual)  .089042 .0008183 .0874526 .0906603

LR test vs. linear model: chibar2(01) = 11678.16 Prob >= chibar2 = 0.0000
The fixed part of the equation, ln_w ttl_exp i.race age
is the same as with linear regression, ln_w
is the outcome and the rest are predictors, with race
being categorical. The new part is  idcode:
. The 
separates the fixed on the left from the random effects on the right. idcode
identifies individuals. The :
is to enable the more complicated feature of random slopes which we won’t cover here; for our purposes the :
is just required.
Let’s walk through the output. Note that what we are calling the random effects (e.g. individuals in a repeated measures situation, classrooms in a students nested in classroom situation), Stata refers to as “groups” in much of the output.
 At the very top, you’ll see that the solution is arrived at iteratively, similar to logistic regression (you probably also noticed how slow it is)!
 The log likelihood is how the iteration works; essentially the model “guesses” choices for the coefficients, and finds the set of coefficients that minimize the log likelihood. Of course, the “guess” is much smarter than random. The actual value of the log likelihood is meaningless.
 Since we are dealing with repeated measures of some sort, instead of a single sample size, we record the total number of obs, the number of groups (unique entries in the random effects) and min/mean/max of the groups. As before, just ensure these numbers seem right.

As with logistic regression, the \(\chi^2\) test tests the hypothesis that all coefficients are simultaneously 0.
 We gave a significant pvalue, so we continue with the interpretation.

The coefficients table is interpreted just as in linear regression, with the addendum that each coefficient is also controlling for the structure introduced by the random effects.

Increased values of
ttl_exp
is associated with higher log incomes. 
The
race
baseline is “white”; compared to white, blacks have lower average income and others have higher average income.  Higher age is associated with lower income.

Increased values of

The second table (“Randomeffects parameters”) gives us information about the error structure. The “idcode:” section is examining whether there is variation across individuals above and beyond the differences in characteristics such as age and race. Since the estimate of
var(_cons)
(the estimated variance of the constant per person  the individual level random effect) is nonzero (and not close to zero), that is evidence that the random effect is beneficial. If the estimate was 0 or close to 0, that would be evidence that the random effect is unnecessary and that any difference between individuals is already accounted for by the covariates.  The estimated variance of the residuals is any additional variation between observations. This is akin to the residuals from linear regression.
 The \(\chi^2\) test at the bottom is a formal test of the inclusion of the random effects versus a linear regression model without the random effects. We reject the null that the models are equivalent, so it is appropriate to include the random effects.
6.4 Assumptions
The linear additivity remains necessary. rvfplot
will not work following a mixed
command, but you can generate the residuals vs fitted plot manually.
The homogeneity of residuals assumption is violated by design in a mixed model. However, some forms of heterogeneity, such as increasing variance as fitted values increase, are not supported. Therefore we can still use the residuals vs fitted plot to examine this.
Again, the independence assumption is violated by design, but observations between groups (e.g. between individuals) should be independent.
6.5 Miscellaneous
As we’ve discussed before, collinearity, overfitting, and model selection remain concerns.
Sample size considerations are tricky with mixed models. Typically these are done with simulations. At a rough pass, the rules of thumb from linear regression remain; 1020 observations per predictor. Adding a new person will improve the power more than adding another observation for an existing group.
The margins
and predict
command work similarly to regress
, however note that both (by default) ignore the random effects; that is, the results the produce are averaged across all individuals.
As with linear regression and logistic regression, mixed
supports vce(robust)
to enable robust standard errors.
6.6 Convergence issues
As with logistic regression, the solution is arrived at iteratively, which means it can fail to converge for a number of reasons. Separation isn’t an issue here (though it will be in logistic mixed models), but there can be other causes of a failure to converge.
Generally, failure to converge will be due to an issue with the data. Things to look for include:

Different scales of predictors. For example, salary (in dollars) and number of children. The scales are drastically different which can cause issues. Try rescaling any variables on extreme scales (you can do this with
egen scaledvar = std(origvar)
). This will affect interpretation (the estimated coefficient will be the average predicted change with a one standard deviation increase in the predictor) but not the overall model fit. 
High correlation can cause this. Check correlations (
pwcorr
orcorr
) between your predictors (including any categorical variables) and if you find a highly correlated pair, try removing one. 
If the iteration keeps running (as opposed to ending and complaining about lack of convergence), try passing the option
emiterate(#)
with a few “large” (“large” is relative to sample size) numbers to tell the algorithm to stop after # iterations, regardless of convergence. You’re looking for two things: First, if there are any estimated standard errors that are extremely close to zero, that predictor may be causing the issue. Try removing it.
 Second, if you try a few different max iterations (say 50, 100 and 200), and the estimated coefficients and standard errors are relatively constant, you could consider that model as “good enough”. You wouldn’t have much confidence in the point estimates of the coefficients, but you could at least gain insight into the direction and approximate magnitude of the effect.

You can try use the “reml” optimizer, by passing the
reml
option. This optimizer can be a bit easier to converge.
6.7 Logistic Mixed Model
Similar to logistic regression being an extension to linear regression, logistic mixed models are an extension to linear mixed models when the outcome variable is binary.
The command for logistic mixed models is melogit
. The rest of the command works very similarly to mixed
, and interpretation is the best of logistic regression (for fixed effects) and linear mixed models (for random effects). Unfortunately, neither lroc
nor estat gof
is supported, so goodness of fit must be measured solely on the \(\chi^2\) test and perhaps a manual model fit comparison.
By default the logodds are reported, give the or
option to report the odds ratios.
6.7.1 meqrlogit
There is a different solver that can be used based upon QRdecomposition. This is run with the command meqrlogit
. It functions identically to melogit
. If melogit
has convergence issues, try using meqrlogit
instead.
6.8 Exercise 6
Load up the “chicken” data set from Stata’s website:
webuse chicken, clear
The data contains order information from a number of restaurants and records whether the order resulted in a complaint. We’d like to see what attributes (if any) of the servers may increase the odds of a complaint. Since we have multiple orders per restaurant, it’s reasonable to assume that certain restaurants just recieve more complaints than others, regardless of the server, so we’ll need to include random effects for those.
Fit a mixed effects logistic regression model predicting complain
, based upon server characteristics (grade
, race
, gender
, tenure
, age
, income
) and a few restaurant characteristics (genderm
for gender of manager and nworkers
for number of workers). Include a random effect for restaurant
.
 Does the model fit better than chance?
 Interpret the model. What predicts a higher odds of recieving a complaint?
 Does it appear that adding the random effect was needed?