Stata’s xtreg versus mixed/regress
In Stata, panel data (repeated measures) can be modeled using
mixed (and its siblings e.g. melogit,
mepoisson) or using the xt toolkit, including
xtset and xtreg.
This document is an attempt to show the equivalency of the models between the two commands. There will be slight differences due to the algorithms used in the backend but the results should generally be equivalent.
Data
We’ll use the "nlswork" data:
. webuse nlswork, clear
(National Longitudinal Survey of Young Women, 14-24 years old in 1968)
. describe, short
Contains data from https://www.stata-press.com/data/r18/nlswork.dta
Observations: 28,534 National Longitudinal Survey of Young Women, 14-24 years old in
1968
Variables: 21 27 Nov 2022 08:14
Sorted by: idcode year
. describe idcode year ln_wage grade age ttl_exp tenure not_smsa south
Variable Storage Display Value
name type format label Variable label
----------------------------------------------------------------------------------------------------------------
idcode int %8.0g NLS ID
year byte %8.0g Interview year
ln_wage float %9.0g ln(wage/GNP deflator)
grade byte %8.0g Current grade completed
age byte %8.0g Age in current year
ttl_exp float %9.0g Total work experience
tenure float %9.0g Job tenure, in years
not_smsa byte %8.0g 1 if not SMSA
south byte %8.0g 1 if south
idcode represents each individual, data is measured over
the years. Lets set it up using xtset:
. xtset idcode Panel variable: idcode (unbalanced)
Theory
Within the panel/xt framework, there are three separate
models:
-
feor “fixed effects”. This is modeling the within variation - Ignoring differences between individuals, is there a difference per values of time-varying variables? -
beor “between effects”. This is modeling the between variation - Averaging across individuals (collapsing over time), is there a difference per values of subject-varying variables? -
reor “random effects”. This generates a weighted average of the above two models.
Each is fitted via xtreg:
xtreg, fe xtreg , be xtreg , re
The short version of how to fit each model using a command other than
xtreg is:
-
fe: Run a linear model withregincluding the group as a categorical variable (this is called the Least Squares Dummy Variable, LSDV, model). -
be: Collapse over individual, and run a linear model withreg. -
re: A traditional mixed model with a random effect (this is random effect in the sense of a mixed model, not in thextsetting) for individual.
The Math
Picture a typical mixed model setup:
Here i is an index for individuals, t is an index for time. and are some outcome and predictor which are both time and individual varying. is an error associated with each individual and is an additional error per observation.
If this model is true, then the following must be true:
Each bar’d variable is average over each individual. In this model, and are indistinguishable, so this is just a linear model.
Since we have that both models are equivalent, if we difference them, we remain equivalent:
Again, we have just a linear model.
Finally, the random effects model doesn’t add much clarity, but it essentially is a weighted combination of the other two, with the weight being a function of the variance of and If the variance of is 0, then there’s no individual level effect and the first model can be fit lineally (because is constant and folds into the intercept).
Assumptions
There is one key different assumption between the models: The random effects model assumes that unobservable variables are uncorrelated with other covariates. The other models don’t.
The between effects and random effects models assume that and are uncorrelated individual intercepts are independent of predictors).
xtsum: Estimating between/within/overall variance
The xtsum command can be used to estimate the variance of a
variable within versus between.
. xtsum ln_wage
Variable | Mean Std. dev. Min Max | Observations
-----------------+--------------------------------------------+----------------
ln_wage overall | 1.674907 .4780935 0 5.263916 | N = 28534
between | .424569 0 3.912023 | n = 4711
within | .29266 -.4077221 4.78367 | T-bar = 6.05689
We can replicate all these results without xt. As a
sidenote, "T-bar" represents the average number of measures
per individual, or N/n.
Overall variation
Easily done:
. summarize ln_wage
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
ln_wage | 28,534 1.674907 .4780935 0 5.263916
Within variation
Taking our cue from the notes in the theory, to obtain within variation we will center the variable by individual.
. egen meanln_wage = mean(ln_wage), by(idcode)
. generate cln_wage = ln_wage - meanln_wage
. summarize cln_wage
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
cln_wage | 28,534 -2.83e-10 .29266 -2.082629 3.108763
Note that the mean is 0 (within rounding error), as we’d expect. To get the mean/min/max back into the same scale as the raw data we can re-add the overall mean to
. egen overallmean = mean(ln_wage)
. generate cln_wage2 = cln_wage + overallmean
. summarize cln_wage2
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
cln_wage2 | 28,534 1.674907 .29266 -.4077221 4.783669
This works because each individual has mean of 0 or in the second
case, overallmean, in either case, since the means are
constant, we’ve removed any between variance and isolated the within
variance.
Between variation
We simply collapse by id.
. preserve
. collapse (mean) ln_wage, by(idcode)
. summarize ln_wage
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
ln_wage | 4,711 1.650605 .424569 0 3.912023
. restore
Doing this works because each subject now has a single observation, hence the within variance is identically 0, so the remaining variance is between-variance.
Fitting the models
Let’s use the following as our model:
ln_wage ~ grade + age + ttl_exp + tenure + not_smsa + south
xtreg, fe: Fixed Effect model (Within variance)
. xtreg ln_wage grade age ttl_exp tenure not_smsa south, fe
note: grade omitted because of collinearity.
Fixed-effects (within) regression Number of obs = 28,091
Group variable: idcode Number of groups = 4,697
R-squared: Obs per group:
Within = 0.1491 min = 1
Between = 0.3526 avg = 6.0
Overall = 0.2517 max = 15
F(5, 23389) = 819.94
corr(u_i, Xb) = 0.2348 Prob > F = 0.0000
------------------------------------------------------------------------------
ln_wage | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
grade | 0 (omitted)
age | -.0026787 .000863 -3.10 0.002 -.0043703 -.0009871
ttl_exp | .0287709 .0014474 19.88 0.000 .0259339 .0316079
tenure | .0114355 .0009229 12.39 0.000 .0096265 .0132445
not_smsa | -.0921689 .0096641 -9.54 0.000 -.1111112 -.0732266
south | -.0633396 .0110819 -5.72 0.000 -.0850608 -.0416184
_cons | 1.591678 .0186849 85.19 0.000 1.555054 1.628302
-------------+----------------------------------------------------------------
sigma_u | .36167618
sigma_e | .29477563
rho | .60086475 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(4696, 23389) = 6.63 Prob > F = 0.0000
To replicate, we’ll include idcode as a categorical
variable. One of the benefits of xtreg ..., fe is
efficiency; since there are over 4000 idcode, the
regression model will fail to run. Consequently, we’ll demostrate on a
subset of the data.
. preserve
. keep if idcode < 100
(27,959 observations deleted)
. xtreg ln_wage grade age ttl_exp tenure not_smsa south, fe
note: grade omitted because of collinearity.
Fixed-effects (within) regression Number of obs = 567
Group variable: idcode Number of groups = 89
R-squared: Obs per group:
Within = 0.1895 min = 1
Between = 0.2312 avg = 6.4
Overall = 0.1900 max = 15
F(5, 473) = 22.12
corr(u_i, Xb) = 0.1420 Prob > F = 0.0000
------------------------------------------------------------------------------
ln_wage | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
grade | 0 (omitted)
age | -.0045718 .0054274 -0.84 0.400 -.0152366 .0060931
ttl_exp | .0313655 .0091803 3.42 0.001 .0133263 .0494046
tenure | .0149681 .0061975 2.42 0.016 .0027901 .0271462
not_smsa | .0003284 .0743637 0.00 0.996 -.1457957 .1464524
south | .1353543 .170972 0.79 0.429 -.2006042 .4713129
_cons | 1.770705 .1125176 15.74 0.000 1.549609 1.991801
-------------+----------------------------------------------------------------
sigma_u | .35036238
sigma_e | .26779328
rho | .6312319 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(88, 473) = 9.01 Prob > F = 0.0000
. regress ln_wage grade age ttl_exp tenure not_smsa south i.idcode, noconstant
Source | SS df MS Number of obs = 567
-------------+---------------------------------- F(94, 473) = 305.03
Model | 2056.18728 94 21.8743328 Prob > F = 0.0000
Residual | 33.9203619 473 .071713239 R-squared = 0.9838
-------------+---------------------------------- Adj R-squared = 0.9805
Total | 2090.10764 567 3.68625687 Root MSE = .26779
------------------------------------------------------------------------------
ln_wage | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
grade | .1664948 .0111166 14.98 0.000 .1446507 .1883389
age | -.0045718 .0054274 -0.84 0.400 -.0152366 .0060931
ttl_exp | .0313655 .0091803 3.42 0.001 .0133263 .0494046
tenure | .0149681 .0061975 2.42 0.016 .0027901 .0271462
not_smsa | .0003284 .0743637 0.00 0.996 -.1457957 .1464524
south | .1353543 .170972 0.79 0.429 -.2006042 .4713129
|
idcode |
2 | -.4226575 .1098026 -3.85 0.000 -.6384187 -.2068963
3 | -.5725985 .104336 -5.49 0.000 -.7776178 -.3675791
[Many more `idcode` rows trimmed]
98 | -.0877863 .1183023 -0.74 0.458 -.3202494 .1446767
99 | -.127322 .1556316 -0.82 0.414 -.4331369 .1784928
------------------------------------------------------------------------------
. restore
grade is not estimated in the fixed effects model because
it is time-invarying; within each individual it is constant. In the
regress model, we are able to estimate it.
xtreg reports 3 R-squared statistics; this is a within
variance model so we can use that value (which agrees with the
regression R-squared). Note that the regress R-squared
estimate is artificially inflated due to the massive amount of
predictors.
xtreg, be: Between Effect model (Between variance)
. xtreg ln_wage grade age ttl_exp tenure not_smsa south, be
Between regression (regression on group means) Number of obs = 28,091
Group variable: idcode Number of groups = 4,697
R-squared: Obs per group:
Within = 0.1427 min = 1
Between = 0.4787 avg = 6.0
Overall = 0.3562 max = 15
F(6,4690) = 717.89
sd(u_i + avg(e_i.)) = .3068161 Prob > F = 0.0000
------------------------------------------------------------------------------
ln_wage | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
grade | .064188 .0019539 32.85 0.000 .0603575 .0680185
age | -.0041071 .0010618 -3.87 0.000 -.0061886 -.0020255
ttl_exp | .0287514 .002033 14.14 0.000 .0247658 .0327369
tenure | .0286782 .0022174 12.93 0.000 .024331 .0330254
not_smsa | -.175568 .0111952 -15.68 0.000 -.1975159 -.1536202
south | -.1086271 .0098438 -11.04 0.000 -.1279256 -.0893287
_cons | .8066724 .0329873 24.45 0.000 .7420017 .871343
------------------------------------------------------------------------------
To replicate, collapse over idcode and run a regression:
. preserve
. collapse (mean) ln_wage grade age ttl_exp tenure not_smsa south, by(idcode)
. regress ln_wage grade age ttl_exp tenure not_smsa south
Source | SS df MS Number of obs = 4,697
-------------+---------------------------------- F(6, 4690) = 724.96
Model | 406.076398 6 67.6793996 Prob > F = 0.0000
Residual | 437.838178 4,690 .093355688 R-squared = 0.4812
-------------+---------------------------------- Adj R-squared = 0.4805
Total | 843.914575 4,696 .179709237 Root MSE = .30554
------------------------------------------------------------------------------
ln_wage | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
grade | .0642015 .0019467 32.98 0.000 .0603849 .068018
age | -.0039009 .0010611 -3.68 0.000 -.0059812 -.0018206
ttl_exp | .0285569 .0020267 14.09 0.000 .0245836 .0325301
tenure | .0288392 .0022077 13.06 0.000 .024511 .0331673
not_smsa | -.1758847 .0111494 -15.78 0.000 -.1977428 -.1540266
south | -.1080377 .0098055 -11.02 0.000 -.1272611 -.0888142
_cons | .8000955 .0328792 24.33 0.000 .7356368 .8645542
------------------------------------------------------------------------------
. restore
Again, the coefficients agree to three decimals and the between R-square agrees.
All predictors here are estimated; if we had any time-variant by individual-invariant predictors (e.g. time), they would not be estimable here.
xtreg, re: Random Effect model (Both variances)
. xtreg ln_wage grade age ttl_exp tenure not_smsa south, re
Random-effects GLS regression Number of obs = 28,091
Group variable: idcode Number of groups = 4,697
R-squared: Obs per group:
Within = 0.1483 min = 1
Between = 0.4701 avg = 6.0
Overall = 0.3569 max = 15
Wald chi2(6) = 8304.62
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000
------------------------------------------------------------------------------
ln_wage | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
grade | .0691836 .0017689 39.11 0.000 .0657166 .0726506
age | -.0038386 .0006544 -5.87 0.000 -.0051212 -.0025559
ttl_exp | .0301313 .0011215 26.87 0.000 .0279331 .0323294
tenure | .0134656 .0008442 15.95 0.000 .011811 .0151202
not_smsa | -.128591 .0072246 -17.80 0.000 -.142751 -.114431
south | -.0932646 .007231 -12.90 0.000 -.107437 -.0790921
_cons | .7544109 .0273445 27.59 0.000 .7008168 .8080051
-------------+----------------------------------------------------------------
sigma_u | .26027808
sigma_e | .29477563
rho | .43808743 (fraction of variance due to u_i)
------------------------------------------------------------------------------
Using mixed:
. mixed ln_wage grade age ttl_exp tenure not_smsa south || idcode:
Performing EM optimization ...
Performing gradient-based optimization:
Iteration 0: Log likelihood = -9218.9773
Iteration 1: Log likelihood = -9218.9773
Computing standard errors ...
Mixed-effects ML regression Number of obs = 28,091
Group variable: idcode Number of groups = 4,697
Obs per group:
min = 1
avg = 6.0
max = 15
Wald chi2(6) = 8496.81
Log likelihood = -9218.9773 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
ln_wage | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
grade | .0691186 .0017231 40.11 0.000 .0657414 .0724959
age | -.003869 .0006491 -5.96 0.000 -.0051411 -.0025969
ttl_exp | .030151 .0011135 27.08 0.000 .0279687 .0323334
tenure | .013591 .0008441 16.10 0.000 .0119365 .0152454
not_smsa | -.1299789 .007154 -18.17 0.000 -.1440004 -.1159575
south | -.0941264 .0071291 -13.20 0.000 -.1080991 -.0801537
_cons | .7566548 .0267655 28.27 0.000 .7041954 .8091142
------------------------------------------------------------------------------
------------------------------------------------------------------------------
Random-effects parameters | Estimate Std. err. [95% conf. interval]
-----------------------------+------------------------------------------------
idcode: Identity |
var(_cons) | .0626522 .0017678 .0592815 .0662147
-----------------------------+------------------------------------------------
var(Residual) | .087569 .000811 .0859938 .0891732
------------------------------------------------------------------------------
LR test vs. linear model: chibar2(01) = 7277.75 Prob >= chibar2 = 0.0000
The results are very close; we can get even closer by fitting the
xtreg model with the mle option, which uses
a different estimation strategy.
. xtreg ln_wage grade age ttl_exp tenure not_smsa south, re mle
Fitting constant-only model:
Iteration 0: Log likelihood = -12663.954
Iteration 1: Log likelihood = -12649.756
Iteration 2: Log likelihood = -12649.614
Iteration 3: Log likelihood = -12649.614
Fitting full model:
Iteration 0: Log likelihood = -9271.8615
Iteration 1: Log likelihood = -9219.1214
Iteration 2: Log likelihood = -9218.9773
Iteration 3: Log likelihood = -9218.9773
Random-effects ML regression Number of obs = 28,091
Group variable: idcode Number of groups = 4,697
Random effects u_i ~ Gaussian Obs per group:
min = 1
avg = 6.0
max = 15
LR chi2(6) = 6861.27
Log likelihood = -9218.9773 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
ln_wage | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
grade | .0691186 .0017233 40.11 0.000 .0657411 .0724962
age | -.003869 .0006491 -5.96 0.000 -.0051413 -.0025967
ttl_exp | .030151 .0011135 27.08 0.000 .0279687 .0323334
tenure | .013591 .0008454 16.08 0.000 .0119341 .0152478
not_smsa | -.1299789 .0071711 -18.13 0.000 -.144034 -.1159239
south | -.0941264 .0071356 -13.19 0.000 -.1081119 -.0801409
_cons | .7566548 .0267773 28.26 0.000 .7041722 .8091374
-------------+----------------------------------------------------------------
/sigma_u | .2503043 .0035313 .2434779 .2573221
/sigma_e | .2959207 .0013704 .293247 .2986188
rho | .4170663 .0074745 .4024774 .4317704
------------------------------------------------------------------------------
LR test of sigma_u=0: chibar2(01) = 7277.75 Prob >= chibar2 = 0.000