Chapter 9 Exercise solutions
9.1 Exercise 1
. webuse nhanes2, clear
1)
. describe, short
Contains data from https://www.statapress.com/data/r16/nhanes2.dta
obs: 10,351
vars: 58 20 Dec 2018 10:07
Sorted by:
There are 10,351 observations of 59 variables. The full describe
output is suppressed for space, but you should run it.
3)
. tab race, mi
1=white, 
2=black, 
3=other  Freq. Percent Cum.
+
White  9,065 87.58 87.58
Black  1,086 10.49 98.07
Other  200 1.93 100.00
+
Total  10,351 100.00
No missing data.
. tab diabetes, mi
diabetes, 
1=yes, 0=no  Freq. Percent Cum.
+
0  9,850 95.16 95.16
1  499 4.82 99.98
.  2 0.02 100.00
+
Total  10,351 100.00
Two missing values.
lead
is continuous, so a table isn’t the most effective.
. codebook lead

lead lead (mcg/dL)

type: numeric (byte)
range: [2,80] units: 1
unique values: 53 missing .: 5,403/10,351
mean: 14.3203
std. dev: 6.16647
percentiles: 10% 25% 50% 75% 90%
8 10 13 17 22
There’s a lot of missingness.
4)
. pwcorr height weight bp*
 height weight bpsystol bpdiast
+
height  1.0000
weight  0.4775 1.0000
bpsystol  0.0364 0.2861 1.0000
bpdiast  0.0675 0.3799 0.6831 1.0000
Blood pressure is highly correlated, moreso than height and weight. Weight is also correlated with both forms of BP. Height looks to be completely independent of boood pressure.
9.2 Exercise 2
. webuse nhanes2, clear
. twoway (scatter bpdiast bpsystol if sex == 1, mcolor(blue)) ///
> (scatter bpdiast bpsystol if sex == 2, mcolor(pink)) ///
> (lfit bpdiast bpsystol if sex == 1, lcolor(blue)) ///
> (lfit bpdiast bpsystol if sex == 2, lcolor(pink)), ///
> legend(label(1 "Men") label(2 "Women") order(1 2))
We can see the correlation between blood pressure measures, with a bit stronger of a relationship for men.
9.3 Exercise 3
. webuse nhanes2, clear
1)
The sample size is massive, so the central limit theorem suffices.
2)
. ttest height == 176 if sex == 1
Onesample t test

Variable  Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
+
height  4,915 174.7421 .1026447 7.196115 174.5408 174.9433

mean = mean(height) t = 12.2553
Ho: mean = 176 degrees of freedom = 4914
Ha: mean < 176 Ha: mean != 176 Ha: mean > 176
Pr(T < t) = 0.0000 Pr(T > t) = 0.0000 Pr(T > t) = 1.0000
The test rejects; the average height of men in the sample is lower than the national average.
3)
. ttest age, by(sex)
Twosample t test with equal variances

Group  Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
+
Male  4,915 47.4238 .2448869 17.1683 46.94372 47.90389
Female  5,436 47.72057 .2340613 17.25716 47.26171 48.17942
+
combined  10,351 47.57965 .1692044 17.21483 47.24798 47.91133
+
diff  .2967619 .338842 .9609578 .3674339

diff = mean(Male)  mean(Female) t = 0.8758
Ho: diff = 0 degrees of freedom = 10349
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.1906 Pr(T > t) = 0.3812 Pr(T > t) = 0.8094
We fail to reject; there is no difference that the average age differs by gender.
4)
. tab race diabetes
1=white, 
2=black,  diabetes, 1=yes, 0=no
3=other  0 1  Total
++
White  8,659 404  9,063
Black  1,000 86  1,086
Other  191 9  200
++
Total  9,850 499  10,349
Given the different scales per race, it’s hard to draw a comparison. We can look at the rowwise percents. (If you ran tab diabetes race
, you’d need the columnwise percents.)
. tab race diabetes, row chi2
++
 Key 

 frequency 
 row percentage 
++
1=white, 
2=black,  diabetes, 1=yes, 0=no
3=other  0 1  Total
++
White  8,659 404  9,063
 95.54 4.46  100.00
++
Black  1,000 86  1,086
 92.08 7.92  100.00
++
Other  191 9  200
 95.50 4.50  100.00
++
Total  9,850 499  10,349
 95.18 4.82  100.00
Pearson chi2(2) = 25.3630 Pr = 0.000
Nearly double the percent of blacks have diabetes and the \(\chi^2\) test confirms the difference is statistically significant.
9.4 Exercise 4
. webuse nhanes2, clear
. regress lead i.sex i.race c.age c.weight c.height i.region
Source  SS df MS Number of obs = 4,948
+ F(9, 4938) = 129.98
Model  36027.945 9 4003.105 Prob > F = 0.0000
Residual  152083.33 4,938 30.7985682 Rsquared = 0.1915
+ Adj Rsquared = 0.1901
Total  188111.275 4,947 38.0253234 Root MSE = 5.5496

lead  Coef. Std. Err. t P>t [95% Conf. Interval]
+
sex 
Female  5.26415 .2259894 23.29 0.000 5.70719 4.82111

race 
Black  2.937124 .2671182 11.00 0.000 2.413454 3.460794
Other  .5521668 .5617011 0.98 0.326 1.653351 .549017

age  .0196081 .0048918 4.01 0.000 .0100181 .0291981
weight  .0012435 .0059001 0.21 0.833 .0128103 .0103233
height  .0275532 .0127149 2.17 0.030 .05248 .0026264

region 
MW  .1099025 .233614 0.47 0.638 .5678897 .3480848
S  1.858491 .2346487 7.92 0.000 2.318507 1.398476
W  .2854048 .2380554 1.20 0.231 .7520992 .1812897

_cons  21.16891 2.199034 9.63 0.000 16.85783 25.48

1)
The Ftest rejects so the model is informative. The \(R^2\) is low, so there is a lot of variability we’re not capturing in this model.
2)
. rvfplot
This doesn’t look great. We don’t see any signs of nonnormality, but we do see a lot of very large positive residuals. If you look at a histogram for lead
,
. hist lead
(bin=36, start=2, width=2.1666667)
We see right skew. The maintainers of this data noticed the same concern, as they include a loglead
variable in the data to attempt to address this.
. desc loglead
storage display value
variable name type format label variable label

loglead float %9.0g log(lead)
Perhaps we should have run the model with loglead
as the output instead.
3)
. estat vif
Variable  VIF 1/VIF
+
2.sex  2.05 0.488262
race 
2  1.05 0.950220
3  1.06 0.941064
age  1.13 0.885331
weight  1.33 0.749151
height  2.44 0.410091
region 
2  1.71 0.583352
3  1.77 0.566163
4  1.73 0.576881
+
Mean VIF  1.59
Nothing of concern here. The only moderately high VIF’s are on sex and it’s interaction, which does not concern us (of course a main effect and interaction are collinear.).
4)
The coffecient on “Female” is 5 and is statistically significant, so there is evidence that males have higher average lead levels.
5)
The pvalue is very small, so it is statistically significant. However, if we look at lead levels:
. summ lead
Variable  Obs Mean Std. Dev. Min Max
+
lead  4,948 14.32033 6.166468 2 80
We see that lead levels range from 2 to 80. The coefficient on age is about .02, so age would need to increase by about 50 years to see a higher value for the lead score. Unlikely to be clinically interesting! This is a side effect of the massive sample size.
6)
. margins region
Predictive margins Number of obs = 4,948
Model VCE : OLS
Expression : Linear prediction, predict()

 Deltamethod
 Margin Std. Err. t P>t [95% Conf. Interval]
+
region 
NE  14.93498 .176734 84.51 0.000 14.5885 15.28145
MW  14.82507 .1532375 96.75 0.000 14.52466 15.12549
S  13.07649 .1525888 85.70 0.000 12.77734 13.37563
W  14.64957 .1588465 92.22 0.000 14.33816 14.96098

. margins region, pwcompare(pv)
Pairwise comparisons of predictive margins Number of obs = 4,948
Model VCE : OLS
Expression : Linear prediction, predict()

 Deltamethod Unadjusted
 Contrast Std. Err. t P>t
+
region 
MW vs NE  .1099025 .233614 0.47 0.638
S vs NE  1.858491 .2346487 7.92 0.000
W vs NE  .2854048 .2380554 1.20 0.231
S vs MW  1.748589 .2161764 8.09 0.000
W vs MW  .1755023 .2216647 0.79 0.429
W vs S  1.573087 .2227974 7.06 0.000

It looks like South is significantly lower levels of lead than the other regions, which show no difference between them.
7)
. regress lead i.sex##c.age i.race c.weight c.height i.region
Source  SS df MS Number of obs = 4,948
+ F(10, 4937) = 123.65
Model  37676.3642 10 3767.63642 Prob > F = 0.0000
Residual  150434.91 4,937 30.4709156 Rsquared = 0.2003
+ Adj Rsquared = 0.1987
Total  188111.275 4,947 38.0253234 Root MSE = 5.52

lead  Coef. Std. Err. t P>t [95% Conf. Interval]
+
sex 
Female  8.485856 .4923314 17.24 0.000 9.451044 7.520667
age  .0150619 .0067745 2.22 0.026 .0283429 .0017809

sex#c.age 
Female  .0676205 .0091936 7.36 0.000 .0495969 .0856441

race 
Black  2.985719 .2657756 11.23 0.000 2.464681 3.506758
Other  .5307013 .5587129 0.95 0.342 1.626027 .5646243

weight  .0044452 .0058847 0.76 0.450 .0159819 .0070915
height  .0266069 .0126477 2.10 0.035 .0514021 .0018118

region 
MW  .1417546 .2324084 0.61 0.542 .5973783 .3138691
S  1.897967 .2334589 8.13 0.000 2.35565 1.440284
W  .2945939 .2367891 1.24 0.214 .7588057 .169618

_cons  22.89998 2.199931 10.41 0.000 18.58714 27.21282

. margins sex, dydx(age)
Average marginal effects Number of obs = 4,948
Model VCE : OLS
Expression : Linear prediction, predict()
dy/dx w.r.t. : age

 Deltamethod
 dy/dx Std. Err. t P>t [95% Conf. Interval]
+
age 
sex 
Male  .0150619 .0067745 2.22 0.026 .0283429 .0017809
Female  .0525586 .006614 7.95 0.000 .0395923 .0655249

. quietly margins sex, at(age = (20 45 70))
. marginsplot
Variables that uniquely identify margins: age sex
We see significance in the interaction, so we looked at the margins. It looks like men show a slight decline in lead as age increases (again, rescaling, .015/year becomes 1.5 over 100 years  not very interesting) while women show a much more significant increase as age increases (roughly 1 unit every 20 years). The marginal plot helps us to visualize this. For men, from age 20 to 70, the average lead decreases barely half a point. For women, we see nearly a 3 point average increase.
9.5 Exercise 5
. webuse nhanes2, clear
. logit diabetes i.sex i.race c.age weight height i.region
Iteration 0: log likelihood = 1999.7591
Iteration 1: log likelihood = 1819.9899
Iteration 2: log likelihood = 1777.7462
Iteration 3: log likelihood = 1776.9939
Iteration 4: log likelihood = 1776.9935
Iteration 5: log likelihood = 1776.9935
Logistic regression Number of obs = 10,349
LR chi2(9) = 445.53
Prob > chi2 = 0.0000
Log likelihood = 1776.9935 Pseudo R2 = 0.1114

diabetes  Coef. Std. Err. z P>z [95% Conf. Interval]
+
sex 
Female  .0934803 .137068 0.68 0.495 .175168 .3621286

race 
Black  .5781259 .1326244 4.36 0.000 .3181869 .8380648
Other  .408093 .3623376 1.13 0.260 .3020757 1.118262

age  .058799 .0039646 14.83 0.000 .0510286 .0665695
weight  .0268805 .0031111 8.64 0.000 .0207829 .0329782
height  .0220928 .0076659 2.88 0.004 .0371175 .007068

region 
MW  .0265833 .1418506 0.19 0.851 .3046054 .2514388
S  .0998071 .1375642 0.73 0.468 .1698138 .3694281
W  .0984779 .1457444 0.68 0.499 .3841316 .1871758

_cons  4.645124 1.3496 3.44 0.001 7.290292 1.999956

1)
. estat gof
Logistic model for diabetes, goodnessoffit test

number of observations = 10349
number of covariate patterns = 10349
Pearson chi2(10339) = 10025.54
Prob > chi2 = 0.9860
. estat gof, group(20)
Logistic model for diabetes, goodnessoffit test

(Table collapsed on quantiles of estimated probabilities)
number of observations = 10349
number of groups = 20
HosmerLemeshow chi2(18) = 15.92
Prob > chi2 = 0.5983
. lroc
Logistic model for diabetes
number of observations = 10349
area under ROC curve = 0.7665
We cannot reject the model fit (even once we switch to the proper HosmerLemeshow test, which used 20 instead of 10 because we have 10 predictors). The ROC and AUC look decent but not great.
2)
. margins race, pwcompare(pv)
Pairwise comparisons of predictive margins Number of obs = 10,349
Model VCE : OIM
Expression : Pr(diabetes), predict()

 Deltamethod Unadjusted
 Contrast Std. Err. z P>z
+
race 
Black vs White  .0301076 .0081584 3.69 0.000
Other vs White  .0197925 .0204751 0.97 0.334
Other vs Black  .0103152 .0220354 0.47 0.640

. margins region, pwcompare(pv)
Pairwise comparisons of predictive margins Number of obs = 10,349
Model VCE : OIM
Expression : Pr(diabetes), predict()

 Deltamethod Unadjusted
 Contrast Std. Err. z P>z
+
region 
MW vs NE  .0011484 .0061371 0.19 0.852
S vs NE  .0045414 .0062131 0.73 0.465
W vs NE  .0041309 .0061392 0.67 0.501
S vs MW  .0056899 .0056911 1.00 0.317
W vs MW  .0029825 .0056876 0.52 0.600
W vs S  .0086724 .0057535 1.51 0.132

Blacks are more likely to have diabetes than whites or others. Age and weight are positive predictors whereas height is a negative predictor for some reason. There is no effect of gender or region.
9.6 Exercise 6
. webuse chicken, clear
. melogit complain grade i.race i.gender tenure age income ///
> nworkers i.genderm  restaurant:
Fitting fixedeffects model:
Iteration 0: log likelihood = 1341.7541
Iteration 1: log likelihood = 1337.7735
Iteration 2: log likelihood = 1337.7659
Iteration 3: log likelihood = 1337.7659
Refining starting values:
Grid node 0: log likelihood = 1331.542
Fitting full model:
Iteration 0: log likelihood = 1331.542
Iteration 1: log likelihood = 1323.2469
Iteration 2: log likelihood = 1321.7555
Iteration 3: log likelihood = 1321.7325
Iteration 4: log likelihood = 1321.7325
Mixedeffects logistic regression Number of obs = 2,763
Group variable: restaurant Number of groups = 500
Obs per group:
min = 3
avg = 5.5
max = 8
Integration method: mvaghermite Integration pts. = 7
Wald chi2(9) = 117.56
Log likelihood = 1321.7325 Prob > chi2 = 0.0000

complain  Coef. Std. Err. z P>z [95% Conf. Interval]
+
grade  .0616416 .0463355 1.33 0.183 .0291742 .1524574

race 
2  .5512152 .140432 3.93 0.000 .2759736 .8264568
3  1.180759 .137298 8.60 0.000 .9116599 1.449858

1.gender  .5866638 .1059402 5.54 0.000 .3790247 .7943029
tenure  .0699963 .1727861 0.41 0.685 .4086509 .2686582
age  .0748883 .0230589 3.25 0.001 .1200828 .0296938
income  .0034531 .0016933 2.04 0.041 .006772 .0001343
nworkers  .0303442 .0391781 0.77 0.439 .1071318 .0464435
1.genderm  .0795912 .1245047 0.64 0.523 .1644336 .3236159
_cons  .3910687 1.103955 0.35 0.723 1.772644 2.554781
+
restaurant 
var(_cons) .5755458 .1417722 .3551455 .9327247

LR test vs. logistic model: chibar2(01) = 32.07 Prob >= chibar2 = 0.0000
1)
We can’t look at fit statistics, but the \(\chi^2\) is significant, so we’re doing better than chance.
2)
. margins race, pwcompare(pv)
Pairwise comparisons of predictive margins Number of obs = 2,763
Model VCE : OIM
Expression : Marginal predicted mean, predict()

 Deltamethod Unadjusted
 Contrast Std. Err. z P>z
+
race 
2 vs 1  .0665377 .0166607 3.99 0.000
3 vs 1  .1685295 .0185599 9.08 0.000
3 vs 2  .1019918 .019303 5.28 0.000

Unfortunately, this data is poorly labeled so we can’t talk in specifics about things, but generally
 Race 1, 2 and 3 have increasing odds of a complaint.
 Gender 1 has significantly higher odds than gender 2.
 Age and income are negatively related to the odds of a complaint (older, more well paid employees are less likely to have complaints).
 Neither restaurant level characteristic is significant once server characteristics are accounted for.
3)
The estimated random variance is nonzero, so yes, the random effects for restaurants are warranted.
We won’t cover in this class, but there are multipleequation estimating commands which have syntax
command (varlist) (varlist) … (varlist) [if] [in] [weight] [,options]
. ↩︎We could just run
correlate
, but the postestimation commands followingmean
are fairly limited, so bare with me here. Postestimation commands following models are much more interesting!↩︎This is true of most pvalues  it’s often the case that large sample sizes can provide small pvalues for scientifically insignificant effects. However, correlation is especially susceptible to this issue.↩︎
This is by the central limit theorem.↩︎
Named not for students in a class, but the pseudonym for statistician William Sealy Gosset. Gosset worked at the Guinness brewery when he was performing some of his seminal work and wasn’t allowed to publish under his own name, so he used the pseudonym Student.↩︎
This requires the assumption that the ordinal variable can be well approximated by a continuous variable. This assumption is fine if each level of the ordinal variable is “equally” spaced (e.g. the amount of “effort” to go from level 1 to level 2 is the same as from level 4 to level 5).↩︎
The statistical reasoning is that when we’re dealing with proportions, the variance is determined directly by the mean. A ttest assumes we need to estimate the variance as well. A Ztest assumes we know the variance, which will be more efficient.↩︎
Unless you’re in a pathological setting, i.e. you’re testing whether the average height of a population is less than 6 feet, and somehow you get an average height of 20 feet….↩︎
There are variations of regression with multiple outcomes, but they are for very specialized circumstances and can generally be fit as several basic regression models instead.↩︎
The 2 and 71 are degrees of freedom. They don’t typically add any interpretation.↩︎
The only exception is if the predictor being added is either constant or identical to another variable.↩︎
This is why it’s important to familiarize yourself with the units in your data!↩︎
Once we know the value of all but one of the dummies, the last is automatically determined so adds no new information.↩︎
It’s a bit confusing because if you look at the
codebook
forrep78
, Stata correctly surmises that it is categorical. However, when it comes to modeling, Stata is not willing to make that assumption.↩︎Note that if you provide data with perfect correlation, Stata will drop one of them for you. This in only a thought exercise. If it helps, imagine their correlation is 99% instead of perfect, and add “almost” as a qualifier to most claims.↩︎
Technically there could be a scaling factors such that \(X_1 = aX_2 + b\), but let’s assume without loss of generality that \(a=1\) and \(b=0\).↩︎
Technically that maximizes likelihood, but that distinction is not important for understanding.↩︎
She may have salary information for only a subset of those years, but would have missing values in the other years.↩︎
Though why called it mixed at that point?↩︎
Typically, salary information is very rightskewed, and a log transformation makes the data closer to normal.↩︎
There is technically Little’s MCAR test to compare MCAR vs MAR, but the majority of imputation methods require only MAR, not MCAR, so it’s of limited use. Additionally, it is not yet supported in Stata.↩︎
Technically this happens as soon as you run
mi set
, but they’re not interesting until aftermi impute
.↩︎