8 Exercise solutions
8.1 Exercise 1
webuse nhanes2, clear .
describe, short
.
data from https://www.stata-press.com/data/r18/nhanes2.dta
Contains
Observations: 10,351
Variables: 58 20 Dec 2022 10:07by: Sorted
There are 10,351 observations of 59 variables. The full describe
output is suppressed for space, but you should run it.
tab race, mi
.
Race | Freq. Percent Cum.
------------+-----------------------------------
White | 9,065 87.58 87.58
Black | 1,086 10.49 98.07
Other | 200 1.93 100.00
------------+-----------------------------------
Total | 10,351 100.00
No missing data.
tab diabetes, mi
.
Diabetes |
status | Freq. Percent Cum.
-------------+-----------------------------------
Not diabetic | 9,850 95.16 95.16
Diabetic | 499 4.82 99.98
. | 2 0.02 100.00
-------------+----------------------------------- Total | 10,351 100.00
Two missing values.
lead
is continuous, so a table isn’t the most effective.
codebook lead
.
-------------------------------------------------------------------------------
lead Lead (mcg/dL)
-------------------------------------------------------------------------------
byte)
Type: Numeric (
Range: [2,80] Units: 1values: 53 Missing .: 5,403/10,351
Unique
Mean: 14.3203dev.: 6.16647
Std.
Percentiles: 10% 25% 50% 75% 90% 8 10 13 17 22
There’s a lot of missingness.
pwcorr height weight bp*
.
weight bpsystol bpdiast
| height
-------------+------------------------------------
height | 1.0000 weight | 0.4775 1.0000
bpsystol | -0.0364 0.2861 1.0000 bpdiast | 0.0675 0.3799 0.6831 1.0000
Blood pressure is highly correlated, more-so than height and weight. Weight is also correlated with both forms of BP. Height looks to be completely independent of blood pressure.
8.2 Exercise 2
webuse nhanes2, clear
.
twoway (scatter bpdiast bpsystol if sex == 1, mcolor(blue)) ///
. scatter bpdiast bpsystol if sex == 2, mcolor(pink)) ///
> (lfit bpdiast bpsystol if sex == 1, lcolor(blue)) ///
> (lfit bpdiast bpsystol if sex == 2, lcolor(pink)), ///
> (legend(label(1 "Men") label(2 "Women") order(1 2)) >
We can see the correlation between blood pressure measures, with a bit stronger of a relationship for men.
8.3 Exercise 3
webuse nhanes2, clear .
The sample size is massive, so the central limit theorem suffices.
ttest height == 176 if sex == 1
.
sample t test
One-
------------------------------------------------------------------------------dev. [95% conf. interval]
Variable | Obs Mean Std. err. Std.
---------+--------------------------------------------------------------------
height | 4,915 174.7421 .1026447 7.196115 174.5408 174.9433
------------------------------------------------------------------------------mean = mean(height) t = -12.2553
mean = 176 Degrees of freedom = 4914
H0:
mean < 176 Ha: mean != 176 Ha: mean > 176
Ha: Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000
The test rejects; the average height of men in the sample is lower than the national average.
ttest age, by(sex)
.
sample t test with equal variances
Two-
------------------------------------------------------------------------------dev. [95% conf. interval]
Group | Obs Mean Std. err. Std.
---------+--------------------------------------------------------------------
Male | 4,915 47.4238 .2448869 17.1683 46.94372 47.90389
Female | 5,436 47.72057 .2340613 17.25716 47.26171 48.17942
---------+--------------------------------------------------------------------
Combined | 10,351 47.57965 .1692044 17.21483 47.24798 47.91133
---------+--------------------------------------------------------------------diff | -.2967619 .338842 -.9609578 .3674339
------------------------------------------------------------------------------diff = mean(Male) - mean(Female) t = -0.8758
diff = 0 Degrees of freedom = 10349
H0:
diff < 0 Ha: diff != 0 Ha: diff > 0
Ha: Pr(T < t) = 0.1906 Pr(|T| > |t|) = 0.3812 Pr(T > t) = 0.8094
We fail to reject; there is no difference that the average age differs by gender.
tab race diabetes
.
| Diabetes status
Race | Not diabe Diabetic | Total
-----------+----------------------+----------
White | 8,659 404 | 9,063
Black | 1,000 86 | 1,086
Other | 191 9 | 200
-----------+----------------------+---------- Total | 9,850 499 | 10,349
Given the different scales per race, it’s hard to draw a comparison. We can look at the row-wise percents. (If you ran tab diabetes race
, you’d need the column-wise percents.)
tab race diabetes, row chi2
.
+----------------+
| Key |
|----------------|frequency |
| row percentage |
|
+----------------+
| Diabetes status
Race | Not diabe Diabetic | Total
-----------+----------------------+----------
White | 8,659 404 | 9,063
| 95.54 4.46 | 100.00
-----------+----------------------+----------
Black | 1,000 86 | 1,086
| 92.08 7.92 | 100.00
-----------+----------------------+----------
Other | 191 9 | 200
| 95.50 4.50 | 100.00
-----------+----------------------+----------
Total | 9,850 499 | 10,349
| 95.18 4.82 | 100.00
chi2(2) = 25.3630 Pr = 0.000 Pearson
Nearly double the percent of blacks have diabetes and the \(\chi^2\) test confirms the difference is statistically significant.
8.4 Exercise 4
webuse nhanes2, clear
.
regress lead i.sex i.race c.age c.weight c.height i.region
.
of obs = 4,948
Source | SS df MS Number F(9, 4938) = 129.98
-------------+---------------------------------- F = 0.0000
Model | 36027.945 9 4003.105 Prob >
Residual | 152083.33 4,938 30.7985682 R-squared = 0.1915
-------------+---------------------------------- Adj R-squared = 0.1901
Total | 188111.275 4,947 38.0253234 Root MSE = 5.5496
------------------------------------------------------------------------------
lead | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
sex |
Female | -5.26415 .2259894 -23.29 0.000 -5.70719 -4.82111
|
race |
Black | 2.937124 .2671182 11.00 0.000 2.413454 3.460794
Other | -.5521668 .5617011 -0.98 0.326 -1.653351 .549017
|
age | .0196081 .0048918 4.01 0.000 .0100181 .0291981weight | -.0012435 .0059001 -0.21 0.833 -.0128103 .0103233
height | -.0275532 .0127149 -2.17 0.030 -.05248 -.0026264
|
region |
MW | -.1099025 .233614 -0.47 0.638 -.5678897 .3480848
S | -1.858491 .2346487 -7.92 0.000 -2.318507 -1.398476
W | -.2854048 .2380554 -1.20 0.231 -.7520992 .1812897
|_cons | 21.16891 2.199034 9.63 0.000 16.85783 25.48
------------------------------------------------------------------------------
The F-test rejects so the model is informative. The \(R^2\) is low, so there is a lot of variability we’re not capturing in this model.
rvfplot .
This doesn’t look great. We don’t see any signs of non-normality, but we do see a lot of very large positive residuals. If you look at a histogram for lead
,
hist lead
. bin=36, start=2, width=2.1666667) (
We see right skew. The maintainers of this data noticed the same concern, as they include a loglead
variable in the data to attempt to address this.
desc loglead
.
Variable Storage Display Valuename type format label Variable label
-------------------------------------------------------------------------------float %9.0g log(lead) loglead
Perhaps we should have run the model with loglead
as the output instead.
estat vif
.
Variable | VIF 1/VIF
-------------+----------------------
2.sex | 2.05 0.488262
race |
2 | 1.05 0.950220
3 | 1.06 0.941064
age | 1.13 0.885331weight | 1.33 0.749151
height | 2.44 0.410091
region |
2 | 1.71 0.583352
3 | 1.77 0.566163
4 | 1.73 0.576881
-------------+---------------------- Mean VIF | 1.59
Nothing of concern here. The only moderately high VIF’s are on sex and it’s interaction, which does not concern us (of course a main effect and interaction are collinear.).
The coefficient on “Female” is -5 and is statistically significant, so there is evidence that males have higher average lead levels.
The p-value is very small, so it is statistically significant. However, if we look at lead levels:
. summ lead
dev. Min Max
Variable | Obs Mean Std.
-------------+--------------------------------------------------------- lead | 4,948 14.32033 6.166468 2 80
We see that lead levels range from 2 to 80. The coefficient on age is about .02, so age would need to increase by about 50 years to see a higher value for the lead score. Unlikely to be clinically interesting! This is a side effect of the massive sample size.
. margins region
of obs = 4,948
Predictive margins Number VCE: OLS
Model
predict()
Expression: Linear prediction,
------------------------------------------------------------------------------
| Delta-methodstd. err. t P>|t| [95% conf. interval]
| Margin
-------------+----------------------------------------------------------------
region |
NE | 14.93498 .176734 84.51 0.000 14.5885 15.28145
MW | 14.82507 .1532375 96.75 0.000 14.52466 15.12549
S | 13.07649 .1525888 85.70 0.000 12.77734 13.37563
W | 14.64957 .1588465 92.22 0.000 14.33816 14.96098
------------------------------------------------------------------------------
. margins region, pwcompare(pv)
of predictive margins Number of obs = 4,948
Pairwise comparisons VCE: OLS
Model
predict()
Expression: Linear prediction,
-----------------------------------------------------
| Delta-method Unadjustedstd. err. t P>|t|
| Contrast
-------------+---------------------------------------
region |
MW vs NE | -.1099025 .233614 -0.47 0.638
S vs NE | -1.858491 .2346487 -7.92 0.000
W vs NE | -.2854048 .2380554 -1.20 0.231
S vs MW | -1.748589 .2161764 -8.09 0.000
W vs MW | -.1755023 .2216647 -0.79 0.429
W vs S | 1.573087 .2227974 7.06 0.000 -----------------------------------------------------
It looks like South is significantly lower levels of lead than the other regions, which show no difference between them.
regress lead i.sex##c.age i.race c.weight c.height i.region
.
of obs = 4,948
Source | SS df MS Number F(10, 4937) = 123.65
-------------+---------------------------------- F = 0.0000
Model | 37676.3642 10 3767.63642 Prob >
Residual | 150434.91 4,937 30.4709156 R-squared = 0.2003
-------------+---------------------------------- Adj R-squared = 0.1987
Total | 188111.275 4,947 38.0253234 Root MSE = 5.52
------------------------------------------------------------------------------
lead | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
sex |
Female | -8.485856 .4923314 -17.24 0.000 -9.451044 -7.520667
age | -.0150619 .0067745 -2.22 0.026 -.0283429 -.0017809
|
sex#c.age |
Female | .0676205 .0091936 7.36 0.000 .0495969 .0856441
|
race |
Black | 2.985719 .2657756 11.23 0.000 2.464681 3.506758
Other | -.5307013 .5587129 -0.95 0.342 -1.626027 .5646243
|weight | -.0044452 .0058847 -0.76 0.450 -.0159819 .0070915
height | -.0266069 .0126477 -2.10 0.035 -.0514021 -.0018118
|
region |
MW | -.1417546 .2324084 -0.61 0.542 -.5973783 .3138691
S | -1.897967 .2334589 -8.13 0.000 -2.35565 -1.440284
W | -.2945939 .2367891 -1.24 0.214 -.7588057 .169618
|_cons | 22.89998 2.199931 10.41 0.000 18.58714 27.21282
------------------------------------------------------------------------------
dydx(age)
. margins sex,
effects Number of obs = 4,948
Average marginal VCE: OLS
Model
predict()
Expression: Linear prediction,
dy/dx wrt: age
------------------------------------------------------------------------------
| Delta-methodstd. err. t P>|t| [95% conf. interval]
| dy/dx
-------------+----------------------------------------------------------------
age |
sex |
Male | -.0150619 .0067745 -2.22 0.026 -.0283429 -.0017809
Female | .0525586 .006614 7.95 0.000 .0395923 .0655249
------------------------------------------------------------------------------
quietly margins sex, at(age = (20 45 70))
.
. marginsplot
Variables that uniquely identify margins: age sex
We see significance in the interaction, so we looked at the margins. It looks like men show a slight decline in lead as age increases (again, rescaling, -.015/year becomes -1.5 over 100 years - not very interesting) while women show a much more significant increase as age increases (roughly 1 unit every 20 years). The marginal plot helps us to visualize this. For men, from age 20 to 70, the average lead decreases barely half a point. For women, we see nearly a 3 point average increase.
8.5 Exercise 5
webuse nhanes2, clear
.
logit diabetes i.sex i.race c.age weight height i.region
.
Iteration 0: Log likelihood = -1999.7591
Iteration 1: Log likelihood = -1819.9899
Iteration 2: Log likelihood = -1777.7462
Iteration 3: Log likelihood = -1776.9939
Iteration 4: Log likelihood = -1776.9935
Iteration 5: Log likelihood = -1776.9935
of obs = 10,349
Logistic regression Number chi2(9) = 445.53
LR chi2 = 0.0000
Prob >
Log likelihood = -1776.9935 Pseudo R2 = 0.1114
------------------------------------------------------------------------------
diabetes | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
sex |
Female | .0934803 .137068 0.68 0.495 -.175168 .3621286
|
race |
Black | .5781259 .1326244 4.36 0.000 .3181869 .8380648
Other | .408093 .3623376 1.13 0.260 -.3020757 1.118262
|
age | .058799 .0039646 14.83 0.000 .0510286 .0665695weight | .0268805 .0031111 8.64 0.000 .0207829 .0329782
height | -.0220928 .0076659 -2.88 0.004 -.0371175 -.007068
|
region |
MW | -.0265833 .1418506 -0.19 0.851 -.3046054 .2514388
S | .0998071 .1375642 0.73 0.468 -.1698138 .3694281
W | -.0984779 .1457444 -0.68 0.499 -.3841316 .1871758
|_cons | -4.645124 1.3496 -3.44 0.001 -7.290292 -1.999956
------------------------------------------------------------------------------
estat gof
.
of-fit test after logistic model
Goodness-
Variable: diabetes
of observations = 10,349
Number of covariate patterns = 10,349
Number chi2(10339) = 10025.54
Pearson chi2 = 0.9860
Prob >
estat gof, group(20)
. note: obs collapsed on 20 quantiles of estimated probabilities.
of-fit test after logistic model
Goodness-
Variable: diabetes
of observations = 10,349
Number of groups = 20
Number chi2(18) = 15.92
Hosmer–Lemeshow chi2 = 0.5983
Prob >
lroc
.
model for diabetes
Logistic
of observations = 10349
Number Area under ROC curve = 0.7665
We cannot reject the model fit (even once we switch to the proper Hosmer-Lemeshow test, which used 20 instead of 10 because we have 10 predictors). The ROC and AUC look decent but not great.
. margins race, pwcompare(pv)
of predictive margins Number of obs = 10,349
Pairwise comparisons VCE: OIM
Model
predict()
Expression: Pr(diabetes),
--------------------------------------------------------
| Delta-method Unadjustedstd. err. z P>|z|
| Contrast
----------------+---------------------------------------
race |
Black vs White | .0301076 .0081584 3.69 0.000
Other vs White | .0197925 .0204751 0.97 0.334
Other vs Black | -.0103152 .0220354 -0.47 0.640
--------------------------------------------------------
. margins region, pwcompare(pv)
of predictive margins Number of obs = 10,349
Pairwise comparisons VCE: OIM
Model
predict()
Expression: Pr(diabetes),
-----------------------------------------------------
| Delta-method Unadjustedstd. err. z P>|z|
| Contrast
-------------+---------------------------------------
region |
MW vs NE | -.0011484 .0061371 -0.19 0.852
S vs NE | .0045414 .0062131 0.73 0.465
W vs NE | -.0041309 .0061392 -0.67 0.501
S vs MW | .0056899 .0056911 1.00 0.317
W vs MW | -.0029825 .0056876 -0.52 0.600
W vs S | -.0086724 .0057535 -1.51 0.132 -----------------------------------------------------
Blacks are more likely to have diabetes than whites or others. Age and weight are positive predictors whereas height is a negative predictor for some reason. There is no effect of gender or region.
8.6 Exercise 6
webuse chicken, clear
.
///
. melogit complain grade i.race i.gender tenure age income
> nworkers i.genderm || restaurant:
effects model:
Fitting fixed-
Iteration 0: Log likelihood = -1341.7541
Iteration 1: Log likelihood = -1337.7735
Iteration 2: Log likelihood = -1337.7659
Iteration 3: Log likelihood = -1337.7659
values:
Refining starting
Grid node 0: Log likelihood = -1331.542
model:
Fitting full
Iteration 0: Log likelihood = -1331.542
Iteration 1: Log likelihood = -1323.2469
Iteration 2: Log likelihood = -1321.7555
Iteration 3: Log likelihood = -1321.7325
Iteration 4: Log likelihood = -1321.7325
effects logistic regression Number of obs = 2,763
Mixed-variable: restaurant Number of groups = 500
Group
group:
Obs per min = 3
avg = 5.5max = 8
Integration method: mvaghermite Integration pts. = 7
chi2(9) = 117.56
Wald chi2 = 0.0000
Log likelihood = -1321.7325 Prob >
------------------------------------------------------------------------------
complain | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
grade | .0616416 .0463355 1.33 0.183 -.0291742 .1524574
|
race |
2 | .5512152 .140432 3.93 0.000 .2759736 .8264568
3 | 1.180759 .137298 8.60 0.000 .9116599 1.449858
|
1.gender | .5866638 .1059402 5.54 0.000 .3790247 .7943029
tenure | -.0699963 .1727861 -0.41 0.685 -.4086509 .2686582
age | -.0748883 .0230589 -3.25 0.001 -.1200828 -.0296938
income | -.0034531 .0016933 -2.04 0.041 -.006772 -.0001343
nworkers | -.0303442 .0391781 -0.77 0.439 -.1071318 .0464435
1.genderm | .0795912 .1245047 0.64 0.523 -.1644336 .3236159_cons | .3910687 1.103955 0.35 0.723 -1.772644 2.554781
-------------+----------------------------------------------------------------
restaurant |var(_cons)| .5755458 .1417722 .3551455 .9327247
------------------------------------------------------------------------------test vs. logistic model: chibar2(01) = 32.07 Prob >= chibar2 = 0.0000 LR
We can’t look at fit statistics, but the \(\chi^2\) is significant, so we’re doing better than chance.
. margins race, pwcompare(pv)
of predictive margins Number of obs = 2,763
Pairwise comparisons VCE: OIM
Model
mean, predict()
Expression: Marginal predicted
-----------------------------------------------------
| Delta-method Unadjustedstd. err. z P>|z|
| Contrast
-------------+---------------------------------------
race |
2 vs 1 | .0665377 .0166607 3.99 0.000
3 vs 1 | .1685295 .0185599 9.08 0.000
3 vs 2 | .1019918 .019303 5.28 0.000 -----------------------------------------------------
Unfortunately, this data is poorly labeled so we can’t talk in specifics about things, but generally
- Race 1, 2 and 3 have increasing odds of a complaint.
- Gender 1 has significantly higher odds than gender 2.
- Age and income are negatively related to the odds of a complaint (older, more well paid employees are less likely to have complaints).
- Neither restaurant level characteristic is significant once server characteristics are accounted for.
The estimated random variance is non-zero, so yes, the random effects for restaurants are warranted.