Stata’s xtreg
versus mixed
/regress
In Stata, panel data (repeated measures) can be modeled using
mixed
(and its siblings e.g. melogit
,
mepoisson
) or using the xt
toolkit, including
xtset
and xtreg
.
This document is an attempt to show the equivalency of the models between the two commands. There will be slight differences due to the algorithms used in the backend but the results should generally be equivalent.
Data
We’ll use the "nlswork" data:
. webuse nlswork, clear (National Longitudinal Survey of Young Women, 14-24 years old in 1968) . describe, short Contains data from https://www.stata-press.com/data/r18/nlswork.dta Observations: 28,534 National Longitudinal Survey of Young Women, 14-24 years old in 1968 Variables: 21 27 Nov 2022 08:14 Sorted by: idcode year . describe idcode year ln_wage grade age ttl_exp tenure not_smsa south Variable Storage Display Value name type format label Variable label ---------------------------------------------------------------------------------------------------------------- idcode int %8.0g NLS ID year byte %8.0g Interview year ln_wage float %9.0g ln(wage/GNP deflator) grade byte %8.0g Current grade completed age byte %8.0g Age in current year ttl_exp float %9.0g Total work experience tenure float %9.0g Job tenure, in years not_smsa byte %8.0g 1 if not SMSA south byte %8.0g 1 if south
idcode
represents each individual, data is measured over
the years. Lets set it up using xtset
:
. xtset idcode Panel variable: idcode (unbalanced)
Theory
Within the panel/xt
framework, there are three separate
models:
-
fe
or “fixed effects”. This is modeling the within variation - Ignoring differences between individuals, is there a difference per values of time-varying variables? -
be
or “between effects”. This is modeling the between variation - Averaging across individuals (collapsing over time), is there a difference per values of subject-varying variables? -
re
or “random effects”. This generates a weighted average of the above two models.
Each is fitted via xtreg
:
xtreg, fe xtreg , be xtreg , re
The short version of how to fit each model using a command other than
xtreg
is:
-
fe
: Run a linear model withreg
including the group as a categorical variable (this is called the Least Squares Dummy Variable, LSDV, model). -
be
: Collapse over individual, and run a linear model withreg
. -
re
: A traditional mixed model with a random effect (this is random effect in the sense of a mixed model, not in thext
setting) for individual.
The Math
Picture a typical mixed model setup:
Here i is an index for individuals, t is an index for time. and are some outcome and predictor which are both time and individual varying. is an error associated with each individual and is an additional error per observation.
If this model is true, then the following must be true:
Each bar’d variable is average over each individual. In this model, and are indistinguishable, so this is just a linear model.
Since we have that both models are equivalent, if we difference them, we remain equivalent:
Again, we have just a linear model.
Finally, the random effects model doesn’t add much clarity, but it essentially is a weighted combination of the other two, with the weight being a function of the variance of and If the variance of is 0, then there’s no individual level effect and the first model can be fit lineally (because is constant and folds into the intercept).
Assumptions
There is one key different assumption between the models: The random effects model assumes that unobservable variables are uncorrelated with other covariates. The other models don’t.
The between effects and random effects models assume that and are uncorrelated individual intercepts are independent of predictors).
xtsum
: Estimating between/within/overall variance
The xtsum
command can be used to estimate the variance of a
variable within versus between.
. xtsum ln_wage Variable | Mean Std. dev. Min Max | Observations -----------------+--------------------------------------------+---------------- ln_wage overall | 1.674907 .4780935 0 5.263916 | N = 28534 between | .424569 0 3.912023 | n = 4711 within | .29266 -.4077221 4.78367 | T-bar = 6.05689
We can replicate all these results without xt
. As a
sidenote, "T-bar
" represents the average number of measures
per individual, or N/n
.
Overall variation
Easily done:
. summarize ln_wage Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- ln_wage | 28,534 1.674907 .4780935 0 5.263916
Within variation
Taking our cue from the notes in the theory, to obtain within variation we will center the variable by individual.
. egen meanln_wage = mean(ln_wage), by(idcode) . generate cln_wage = ln_wage - meanln_wage . summarize cln_wage Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- cln_wage | 28,534 -2.83e-10 .29266 -2.082629 3.108763
Note that the mean is 0 (within rounding error), as we’d expect. To get the mean/min/max back into the same scale as the raw data we can re-add the overall mean to
. egen overallmean = mean(ln_wage) . generate cln_wage2 = cln_wage + overallmean . summarize cln_wage2 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- cln_wage2 | 28,534 1.674907 .29266 -.4077221 4.783669
This works because each individual has mean of 0 or in the second
case, overallmean
, in either case, since the means are
constant, we’ve removed any between variance and isolated the within
variance.
Between variation
We simply collapse by id.
. preserve . collapse (mean) ln_wage, by(idcode) . summarize ln_wage Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- ln_wage | 4,711 1.650605 .424569 0 3.912023 . restore
Doing this works because each subject now has a single observation, hence the within variance is identically 0, so the remaining variance is between-variance.
Fitting the models
Let’s use the following as our model:
ln_wage ~ grade + age + ttl_exp + tenure + not_smsa + south
xtreg, fe
: Fixed Effect model (Within variance)
. xtreg ln_wage grade age ttl_exp tenure not_smsa south, fe note: grade omitted because of collinearity. Fixed-effects (within) regression Number of obs = 28,091 Group variable: idcode Number of groups = 4,697 R-squared: Obs per group: Within = 0.1491 min = 1 Between = 0.3526 avg = 6.0 Overall = 0.2517 max = 15 F(5, 23389) = 819.94 corr(u_i, Xb) = 0.2348 Prob > F = 0.0000 ------------------------------------------------------------------------------ ln_wage | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- grade | 0 (omitted) age | -.0026787 .000863 -3.10 0.002 -.0043703 -.0009871 ttl_exp | .0287709 .0014474 19.88 0.000 .0259339 .0316079 tenure | .0114355 .0009229 12.39 0.000 .0096265 .0132445 not_smsa | -.0921689 .0096641 -9.54 0.000 -.1111112 -.0732266 south | -.0633396 .0110819 -5.72 0.000 -.0850608 -.0416184 _cons | 1.591678 .0186849 85.19 0.000 1.555054 1.628302 -------------+---------------------------------------------------------------- sigma_u | .36167618 sigma_e | .29477563 rho | .60086475 (fraction of variance due to u_i) ------------------------------------------------------------------------------ F test that all u_i=0: F(4696, 23389) = 6.63 Prob > F = 0.0000
To replicate, we’ll include idcode
as a categorical
variable. One of the benefits of xtreg ..., fe
is
efficiency; since there are over 4000 idcode
, the
regression model will fail to run. Consequently, we’ll demostrate on a
subset of the data.
. preserve . keep if idcode < 100 (27,959 observations deleted) . xtreg ln_wage grade age ttl_exp tenure not_smsa south, fe note: grade omitted because of collinearity. Fixed-effects (within) regression Number of obs = 567 Group variable: idcode Number of groups = 89 R-squared: Obs per group: Within = 0.1895 min = 1 Between = 0.2312 avg = 6.4 Overall = 0.1900 max = 15 F(5, 473) = 22.12 corr(u_i, Xb) = 0.1420 Prob > F = 0.0000 ------------------------------------------------------------------------------ ln_wage | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- grade | 0 (omitted) age | -.0045718 .0054274 -0.84 0.400 -.0152366 .0060931 ttl_exp | .0313655 .0091803 3.42 0.001 .0133263 .0494046 tenure | .0149681 .0061975 2.42 0.016 .0027901 .0271462 not_smsa | .0003284 .0743637 0.00 0.996 -.1457957 .1464524 south | .1353543 .170972 0.79 0.429 -.2006042 .4713129 _cons | 1.770705 .1125176 15.74 0.000 1.549609 1.991801 -------------+---------------------------------------------------------------- sigma_u | .35036238 sigma_e | .26779328 rho | .6312319 (fraction of variance due to u_i) ------------------------------------------------------------------------------ F test that all u_i=0: F(88, 473) = 9.01 Prob > F = 0.0000 . regress ln_wage grade age ttl_exp tenure not_smsa south i.idcode, noconstant Source | SS df MS Number of obs = 567 -------------+---------------------------------- F(94, 473) = 305.03 Model | 2056.18728 94 21.8743328 Prob > F = 0.0000 Residual | 33.9203619 473 .071713239 R-squared = 0.9838 -------------+---------------------------------- Adj R-squared = 0.9805 Total | 2090.10764 567 3.68625687 Root MSE = .26779 ------------------------------------------------------------------------------ ln_wage | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- grade | .1664948 .0111166 14.98 0.000 .1446507 .1883389 age | -.0045718 .0054274 -0.84 0.400 -.0152366 .0060931 ttl_exp | .0313655 .0091803 3.42 0.001 .0133263 .0494046 tenure | .0149681 .0061975 2.42 0.016 .0027901 .0271462 not_smsa | .0003284 .0743637 0.00 0.996 -.1457957 .1464524 south | .1353543 .170972 0.79 0.429 -.2006042 .4713129 | idcode | 2 | -.4226575 .1098026 -3.85 0.000 -.6384187 -.2068963 3 | -.5725985 .104336 -5.49 0.000 -.7776178 -.3675791 [Many more `idcode` rows trimmed] 98 | -.0877863 .1183023 -0.74 0.458 -.3202494 .1446767 99 | -.127322 .1556316 -0.82 0.414 -.4331369 .1784928 ------------------------------------------------------------------------------ . restore
grade
is not estimated in the fixed effects model because
it is time-invarying; within each individual it is constant. In the
regress
model, we are able to estimate it.
xtreg
reports 3 R-squared statistics; this is a within
variance model so we can use that value (which agrees with the
regression R-squared). Note that the regress
R-squared
estimate is artificially inflated due to the massive amount of
predictors.
xtreg, be
: Between Effect model (Between variance)
. xtreg ln_wage grade age ttl_exp tenure not_smsa south, be Between regression (regression on group means) Number of obs = 28,091 Group variable: idcode Number of groups = 4,697 R-squared: Obs per group: Within = 0.1427 min = 1 Between = 0.4787 avg = 6.0 Overall = 0.3562 max = 15 F(6,4690) = 717.89 sd(u_i + avg(e_i.)) = .3068161 Prob > F = 0.0000 ------------------------------------------------------------------------------ ln_wage | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- grade | .064188 .0019539 32.85 0.000 .0603575 .0680185 age | -.0041071 .0010618 -3.87 0.000 -.0061886 -.0020255 ttl_exp | .0287514 .002033 14.14 0.000 .0247658 .0327369 tenure | .0286782 .0022174 12.93 0.000 .024331 .0330254 not_smsa | -.175568 .0111952 -15.68 0.000 -.1975159 -.1536202 south | -.1086271 .0098438 -11.04 0.000 -.1279256 -.0893287 _cons | .8066724 .0329873 24.45 0.000 .7420017 .871343 ------------------------------------------------------------------------------
To replicate, collapse over idcode
and run a regression:
. preserve . collapse (mean) ln_wage grade age ttl_exp tenure not_smsa south, by(idcode) . regress ln_wage grade age ttl_exp tenure not_smsa south Source | SS df MS Number of obs = 4,697 -------------+---------------------------------- F(6, 4690) = 724.96 Model | 406.076398 6 67.6793996 Prob > F = 0.0000 Residual | 437.838178 4,690 .093355688 R-squared = 0.4812 -------------+---------------------------------- Adj R-squared = 0.4805 Total | 843.914575 4,696 .179709237 Root MSE = .30554 ------------------------------------------------------------------------------ ln_wage | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- grade | .0642015 .0019467 32.98 0.000 .0603849 .068018 age | -.0039009 .0010611 -3.68 0.000 -.0059812 -.0018206 ttl_exp | .0285569 .0020267 14.09 0.000 .0245836 .0325301 tenure | .0288392 .0022077 13.06 0.000 .024511 .0331673 not_smsa | -.1758847 .0111494 -15.78 0.000 -.1977428 -.1540266 south | -.1080377 .0098055 -11.02 0.000 -.1272611 -.0888142 _cons | .8000955 .0328792 24.33 0.000 .7356368 .8645542 ------------------------------------------------------------------------------ . restore
Again, the coefficients agree to three decimals and the between R-square agrees.
All predictors here are estimated; if we had any time-variant by individual-invariant predictors (e.g. time), they would not be estimable here.
xtreg, re
: Random Effect model (Both variances)
. xtreg ln_wage grade age ttl_exp tenure not_smsa south, re Random-effects GLS regression Number of obs = 28,091 Group variable: idcode Number of groups = 4,697 R-squared: Obs per group: Within = 0.1483 min = 1 Between = 0.4701 avg = 6.0 Overall = 0.3569 max = 15 Wald chi2(6) = 8304.62 corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ ln_wage | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- grade | .0691836 .0017689 39.11 0.000 .0657166 .0726506 age | -.0038386 .0006544 -5.87 0.000 -.0051212 -.0025559 ttl_exp | .0301313 .0011215 26.87 0.000 .0279331 .0323294 tenure | .0134656 .0008442 15.95 0.000 .011811 .0151202 not_smsa | -.128591 .0072246 -17.80 0.000 -.142751 -.114431 south | -.0932646 .007231 -12.90 0.000 -.107437 -.0790921 _cons | .7544109 .0273445 27.59 0.000 .7008168 .8080051 -------------+---------------------------------------------------------------- sigma_u | .26027808 sigma_e | .29477563 rho | .43808743 (fraction of variance due to u_i) ------------------------------------------------------------------------------
Using mixed
:
. mixed ln_wage grade age ttl_exp tenure not_smsa south || idcode: Performing EM optimization ... Performing gradient-based optimization: Iteration 0: Log likelihood = -9218.9773 Iteration 1: Log likelihood = -9218.9773 Computing standard errors ... Mixed-effects ML regression Number of obs = 28,091 Group variable: idcode Number of groups = 4,697 Obs per group: min = 1 avg = 6.0 max = 15 Wald chi2(6) = 8496.81 Log likelihood = -9218.9773 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ ln_wage | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- grade | .0691186 .0017231 40.11 0.000 .0657414 .0724959 age | -.003869 .0006491 -5.96 0.000 -.0051411 -.0025969 ttl_exp | .030151 .0011135 27.08 0.000 .0279687 .0323334 tenure | .013591 .0008441 16.10 0.000 .0119365 .0152454 not_smsa | -.1299789 .007154 -18.17 0.000 -.1440004 -.1159575 south | -.0941264 .0071291 -13.20 0.000 -.1080991 -.0801537 _cons | .7566548 .0267655 28.27 0.000 .7041954 .8091142 ------------------------------------------------------------------------------ ------------------------------------------------------------------------------ Random-effects parameters | Estimate Std. err. [95% conf. interval] -----------------------------+------------------------------------------------ idcode: Identity | var(_cons) | .0626522 .0017678 .0592815 .0662147 -----------------------------+------------------------------------------------ var(Residual) | .087569 .000811 .0859938 .0891732 ------------------------------------------------------------------------------ LR test vs. linear model: chibar2(01) = 7277.75 Prob >= chibar2 = 0.0000
The results are very close; we can get even closer by fitting the
xtreg
model with the mle
option, which uses
a different estimation strategy.
. xtreg ln_wage grade age ttl_exp tenure not_smsa south, re mle Fitting constant-only model: Iteration 0: Log likelihood = -12663.954 Iteration 1: Log likelihood = -12649.756 Iteration 2: Log likelihood = -12649.614 Iteration 3: Log likelihood = -12649.614 Fitting full model: Iteration 0: Log likelihood = -9271.8615 Iteration 1: Log likelihood = -9219.1214 Iteration 2: Log likelihood = -9218.9773 Iteration 3: Log likelihood = -9218.9773 Random-effects ML regression Number of obs = 28,091 Group variable: idcode Number of groups = 4,697 Random effects u_i ~ Gaussian Obs per group: min = 1 avg = 6.0 max = 15 LR chi2(6) = 6861.27 Log likelihood = -9218.9773 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ ln_wage | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- grade | .0691186 .0017233 40.11 0.000 .0657411 .0724962 age | -.003869 .0006491 -5.96 0.000 -.0051413 -.0025967 ttl_exp | .030151 .0011135 27.08 0.000 .0279687 .0323334 tenure | .013591 .0008454 16.08 0.000 .0119341 .0152478 not_smsa | -.1299789 .0071711 -18.13 0.000 -.144034 -.1159239 south | -.0941264 .0071356 -13.19 0.000 -.1081119 -.0801409 _cons | .7566548 .0267773 28.26 0.000 .7041722 .8091374 -------------+---------------------------------------------------------------- /sigma_u | .2503043 .0035313 .2434779 .2573221 /sigma_e | .2959207 .0013704 .293247 .2986188 rho | .4170663 .0074745 .4024774 .4317704 ------------------------------------------------------------------------------ LR test of sigma_u=0: chibar2(01) = 7277.75 Prob >= chibar2 = 0.000