5 Mixed models
Let’s violate another regression assumption, independence. While this is usually thought of in the repeated measurements setting, it is not exclusive to that. For example,
- Repeated measures: You’re conducting a trial on individuals who undergo an intervention. You generate a survey, and have the participants fill out a copy when the intervention begins, 30 days into the intervention, and 1 year after the intervention. If we have a single outcome measure of interest from this survey, we have three measurements for person. These values are not independent; it’s reasonable to think that your measurements are more related to each other than any of mine.
- Non-repeated measures: You conduct door-to-door sampling of individuals in households, asking about food habits. You collect information on each individual in the house, and want to use individuals as the unit of analysis. It’s likely that two individuals in the same house will have similar food patterns, as opposed to two individuals from different houses.
To address the lack of dependence, we will move from normal regression (linear or otherwise) into a mixed models framework, which accounts for this dependence structure. It does this (at the most basic level) by allowing each [individual from the intervention example, household from the door-to-door example] to have its own intercept which we do not estimate.
For this chapter, we’ll turn away from the “auto” data set and instead use a sample of the “National Longitudinal Survey” contained in Stata:
webuse nlswork, clear
. of Young Women, 14-24 years old in 1968) (National Longitudinal Survey
This data is a survey taken from 1968-1988, and this specific sample of the data is salary information for women. We have repeated measures in the sense that we have yearly data for women, so each woman can have up to 20 data points.
5.1 Terminology
There are several different names for mixed models which you might encounter, that all fit essentially the same model:
- Mixed model
- Mixed Effects regression/model
- Multilevel regression/model
- Hierarchical regression/model (specifically HLM, hierarchical linear model)
The hierarchical/multilevel variations require thinking about the levels of the data and involves “nesting”, where one variable only occurs within another, e.g. family members nested in a household. The most canonical example of this is students in classrooms, we could have
- Level 1: The lowest level, the students.
- Level 2: Classroom or teacher (this could also be two separate levels of classrooms inside teacher)
- Level 3: District
- Level 4: State
- Level 5: Country
This is taking it a bit far; it’s rare to see more than 3 levels, but in theory, any number can exist.
For this workshop, we will only briefly discuss this from hierarchical point of view, preferring the mixed models view (with the reminder again that they are the same!).
5.1.1 Econometric terminology
To make the terminology a bit more complicated, in econometrics, some of the terms we will use here are overloaded. When you are discussing mixed models with someone with econometric or economics training, it’s important to differentiate between the statistical terms of “fixed effects” and “random effects” which are the two components of a mixed model that we discuss below, and what econometricians called “fixed effects regression” and “random effects regression”.
Without going into the full details of the econometric world, what econometricians called “random effects regression” is essentially what statisticians called “mixed models”, what we’re talking about here. The Stata command xtreg
handles those econometric models.
5.2 Wide vs Long data, Time-varying vs Time-invariant
Before you begin your analysis, you need to ensure that the data is in the proper format. Let’s consider the NLS data, where we have measures of women’s salary over 20 years.
Wide format of the data would have row represent a woman, and she would have 20 columns worth of salary information1 (plus additional demographics).
Long format of the data would have each row represent a woman and a year, so that each woman can have up to 20 rows (if a woman wasn’t measured in a given year, that row & year is missing).
To fit a mixed model, we need the data in long format. We can use the reshape
command to transform wide data to long. This is covered in the Intro to Stata set of notes.
Additionally, there is the concept of time-varying vs time-invariant variables. Time-varying variables are those which can be different for each entry within the same individual. Examples include weight or salary. Time-invariant are those which are the same across all entries. Examples include race or baseline characteristics.
When data is long, time-invariant variables need to be constant per person.
5.3 Linear Mixed Model
The most basic mixed model is the linear mixed model, which extends the linear regression model. A model is called “mixed” because it contains a mixture of fixed effects and random effects.
- Fixed effects: These are the predictors that are present in regular linear regression. We will obtain coefficients for these predictors and be able to test and interpret them. Technically, an OLS linear model is a mixed model with only fixed effects.
- Random effects: These are the “grouping” variables, and must be categorical (Stata will force every variable used to produce random effects as if it were prefaced by
i.
). These are essentially just predictors as well, however, we do not obtain coefficients to test or interpret. We do get a measure of the variability across groups, and a test of whether the random effect is benefiting the model.
Let’s fit a model using the mixed
command. It works similar to regress
with a slight tweak. We’ll try and predict log of wages2 using work experience, race and age. The variable idcode
identifies individuals.
. mixed ln_w ttl_exp i.race age || idcode:
Performing EM optimization ...
gradient-based optimization:
Performing
Iteration 0: Log likelihood = -10471.727
Iteration 1: Log likelihood = -10471.727
Computing standard errors ...
effects ML regression Number of obs = 28,510
Mixed-variable: idcode Number of groups = 4,710
Group group:
Obs per min = 1
avg = 6.1max = 15
chi2(4) = 5246.25
Wald chi2 = 0.0000
Log likelihood = -10471.727 Prob >
------------------------------------------------------------------------------
ln_wage | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
ttl_exp | .042996 .0010126 42.46 0.000 .0410113 .0449807
|
race |
Black | -.1178248 .0115814 -10.17 0.000 -.1405238 -.0951257
Other | .0995432 .0484246 2.06 0.040 .0046328 .1944536
|
age | -.0069699 .0006836 -10.20 0.000 -.0083097 -.0056302_cons | 1.644184 .0161148 102.03 0.000 1.612599 1.675768
------------------------------------------------------------------------------
------------------------------------------------------------------------------effects parameters | Estimate Std. err. [95% conf. interval]
Random-
-----------------------------+------------------------------------------------
idcode: Identity |var(_cons) | .1037959 .0026579 .0987151 .1091381
-----------------------------+------------------------------------------------var(Residual) | .089042 .0008183 .0874526 .0906603
------------------------------------------------------------------------------test vs. linear model: chibar2(01) = 11678.16 Prob >= chibar2 = 0.0000 LR
The fixed part of the equation, ln_w ttl_exp i.race age
is the same as with linear regression, ln_w
is the outcome and the rest are predictors, with race
being categorical. The new part is || idcode:
. The ||
separates the fixed on the left from the random effects on the right. idcode
identifies individuals. The :
is to enable the more complicated feature of random slopes which we won’t cover here; for our purposes the :
is just required.
Let’s walk through the output. Note that what we are calling the random effects (e.g. individuals in a repeated measures situation, classrooms in a students nested in classroom situation), Stata refers to as “groups” in much of the output.
- At the very top, you’ll see that the solution is arrived at iteratively, similar to logistic regression (you probably also noticed how slow it is)!
- The log likelihood is how the iteration works; essentially the model “guesses” choices for the coefficients, and finds the set of coefficients that minimize the log likelihood. Of course, the “guess” is much smarter than random. The actual value of the log likelihood is meaningless.
- Since we are dealing with repeated measures of some sort, instead of a single sample size, we record the total number of obs, the number of groups (unique entries in the random effects) and min/mean/max of the groups. As before, just ensure these numbers seem right.
- As with logistic regression, the \(\chi^2\) test tests the hypothesis that all coefficients are simultaneously 0.
- We gave a significant p-value, so we continue with the interpretation.
- The coefficients table is interpreted just as in linear regression, with the addendum that each coefficient is also controlling for the structure introduced by the random effects.
- Increased values of
ttl_exp
is associated with higher log incomes. - The
race
baseline is “white”; compared to white, blacks have lower average income and others have higher average income. - Higher age is associated with lower income.
- Increased values of
- The second table (
Random-effects parameters
) gives us information about the error structure. The “idcode:” section is examining whether there is variation across individuals above and beyond the differences in characteristics such as age and race. Since the estimate ofvar(_cons)
(the estimated variance of the constant per person - the individual level random effect) is non-zero (and not close to zero), that is evidence that the random effect is beneficial. If the estimate was 0 or close to 0, that would be evidence that the random effect is unnecessary and that any difference between individuals is already accounted for by the covariates. - The estimated variance of the residuals is any additional variation between observations. This is akin to the residuals from linear regression.
- The \(\chi^2\) test at the bottom is a formal test of the inclusion of the random effects versus a linear regression model without the random effects. We reject the null that the models are equivalent, so it is appropriate to include the random effects.
5.4 Assumptions
The linear additivity remains necessary. rvfplot
will not work following a mixed
command, but you can generate the residuals vs fitted plot manually.
The homogeneity of residuals assumption is violated by design in a mixed model. However, some forms of heterogeneity, such as increasing variance as fitted values increase, are not supported. Therefore we can still use the residuals vs fitted plot to examine this.
Again, the independence assumption is violated by design, but observations between groups (e.g. between individuals) should be independent.
5.5 Miscellaneous
As we’ve discussed before, collinearity, overfitting, and model selection remain concerns.
Sample size considerations are tricky with mixed models. Typically these are done with simulations. At a rough pass, the rules of thumb from linear regression remain; 10-20 observations per predictor. Adding a new person will improve the power more than adding another observation for an existing group.
The margins
and predict
command work similarly to regress
, however note that both (by default) ignore the random effects; that is, the results the produce are averaged across all individuals.
As with linear regression and logistic regression, mixed
supports vce(robust)
to enable robust standard errors.
5.6 Convergence issues
As with logistic regression, the solution is arrived at iteratively, which means it can fail to converge for a number of reasons. Separation isn’t an issue here (though it will be in logistic mixed models), but there can be other causaes of a failure to converge.
Generally, failure to converge will be due to an issue with the data. Things to look for include:
- Different scales of predictors. For example, salary (in dollars) and number of children. The scales are drastically different which can cause issues. Try re-scaling any variables on extreme scales (you can do this with
egen scaledvar = std(origvar)
). This will affect interpretation (the estimated coefficient will be the average predicted change with a one standard deviation increase in the predictor) but not the overall model fit. - High correlation can cause this. Check correlations (
pwcorr
orcorr
) between your predictors (including any categorical variables) and if you find a highly correlated pair, try removing one. - If the iteration keeps running (as opposed to ending and complaining about lack of convergence), try passing the option
emiterate(#)
with a few “large” (“large” is relative to sample size) numbers to tell the algorithm to stop after # iterations, regardless of convergence. You’re looking for two things:- First, if there are any estimated standard errors that are extremely close to zero, that predictor may be causing the issue. Try removing it.
- Second, if you try a few different max iterations (say 50, 100 and 200), and the estimated coefficients and standard errors are relatively constant, you could consider that model as “good enough”. You wouldn’t have much confidence in the point estimates of the coefficients, but you could at least gain insight into the direction and approximate magnitude of the effect.
- You can try use the “reml” optimizer, by passing the
reml
option. This optimizer can be a bit easier to converge.
5.7 Logistic Mixed Model
Similar to logistic regression being an extension to linear regression, logistic mixed models are an extension to linear mixed models when the outcome variable is binary.
The command for logistic mixed models is melogit
. The rest of the command works very similarly to mixed
, and interpretation is the best of logistic regression (for fixed effects) and linear mixed models (for random effects). Unfortunately, neither lroc
nor estat gof
is supported, so goodness of fit must be measured solely on the \(\chi^2\) test and perhaps a manual model fit comparison.
By default the log-odds are reported, give the or
option to report the odds ratios.
5.7.1 meqrlogit
There is a different solver that can be used based upon QR-decomposition. This is run with the command meqrlogit
. It functions identically to melogit
. If melogit
has convergence issues, try using meqrlogit
instead.
5.8 Exercise 6
Load up the “chicken” data set from Stata’s website:
webuse chicken, clear
The data contains order information from a number of restaurants and records whether the order resulted in a complaint. We’d like to see what attributes (if any) of the servers may increase the odds of a complaint. Since we have multiple orders per restaurant, it’s reasonable to assume that certain restaurants just recieve more complaints than others, regardless of the server, so we’ll need to include random effects for those.
Fit a mixed effects logistic regression model predicting complain
, based upon server characteristics (grade
, race
, gender
, tenure
, age
, income
) and a few restaurant characteristics (genderm
for gender of manager and nworkers
for number of workers). Include a random effect for restaurant
.
- Does the model fit better than chance?
- Interpret the model. What predicts a higher odds of recieving a complaint?
- Does it appear that adding the random effect was needed?