After running a regression model of some sort, the common way of interpreting the relationships between predictors and the outcome is by interpreting the regression coefficients. In many situations these interpretations are straightforward, however, in some settings the interpretations can be tricky, or the coefficients can simply not be interpreted.
These complications arise most commonly when a model involves an interaction or moderation. In these sort of models, interpretation of the coefficients requires special care due to the relationship between the various interacted predictors, and the set of information obtainable from the coefficients is often limited without resorting to significant algebra.
Marginal effects offer an improvement over simple coefficient interpretation. Marginal means allow you to estimate the average response under a set of conditions; e.g. the average response for each racial group, or the average response when age and weight are at certain values.
Marginal slopes estimate the slope between a predictor and the outcome when other variables are at some specific value; e.g. the average slope on age within each gender, or the average slope on age when weight is at certain values.
In either case, in addition to estimating these marginal effects, we ;can easily test hypotheses of equality between them via pairwise comparisons. For example, is the average response different by ethnicity, or is the slope on age different between genders.
If you are familiar with interpreting regression coefficients, and specifically interactions, you may have some idea of how to start addressing all the above examples. However, to complete answer these research questions would require more than simply a table of regression coefficients. With marginal effects, one additional set of results can address all these questions.
Finally, marginal effect estimation is a step towards creating informative visualizations of relationships such as interaction plots.
The webuse
command can load the directly.
. webuse margex
(Artificial data for margins)
The marginaleffects
package in R doesn’t like a variable
being named “group”, so we rename it.
. rename group class
. summarize, sep(0)
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
y | 3,000 69.73357 21.53986 0 146.3
outcome | 3,000 .1696667 .3754023 0 1
sex | 3,000 .5006667 .5000829 0 1
class | 3,000 1.828 .7732714 1 3
age | 3,000 39.799 11.54174 20 60
distance | 3,000 58.58566 181.3987 .25 748.8927
ycn | 3,000 74.47263 22.01264 0 153.8
yc | 3,000 .2413333 .4279633 0 1
treatment | 3,000 .5006667 .5000829 0 1
agegroup | 3,000 2.263667 .8316913 1 3
arm | 3,000 1.970667 .7899457 1 3
The haven
package is used to read the data from Stata
into R.
library(haven)
m <- read_dta("http://www.stata-press.com/data/r18/margex.dta")
m <- data.frame(m)
m$sex <- as_factor(m$sex)
names(m)[names(m) == "group"] <- "class"
m$class <- as_factor(m$class)
head(m)
y outcome sex class age distance ycn yc treatment agegroup arm
1 102.6 1 Female 1 55 11.75 97.3 1 1 3 1
2 77.2 0 Female 1 60 30.75 78.6 0 1 3 1
3 80.5 0 Female 1 27 14.25 54.5 0 1 1 1
4 82.5 1 Female 1 60 29.25 98.4 1 1 3 1
5 71.6 1 Female 1 52 19.00 105.3 1 1 3 1
6 83.2 1 Female 1 49 2.50 89.7 0 1 3 1
The marginaleffects
package doesn’t like a variable
being named “group”, so we rename it.
There are several packages which can be used to estimate marginal means. So far, this document implements emmeans and ggeffects, and I’ve begun writing up marginaleffects.
emmeans is based on an older package, lsmeans, and is more flexible and powerful. However, it’s syntax is slightly strange, and there are often several ways to obtain similar results which may or may not be the same. Additionally, while it’s immediate output is clear, accessing the raw data producing the output takes some doing.
ggeffects is a package designed to tie into the dplyr/ggplot universe. It’s results are always a clean data frame which can be manipulated or passed straight into ggplot. However, ggeffects can only estimate marginal means, not marginal slopes or pairwise comparisons. (If I’m wrong about that last statement, please let me know.)
marginaleffects
is a newer package which has a much more robust series of estimates. It
focuses on observation-level predictions which can then be averaged,
rather than estimating marginal effects directly. Theres often multiple
ways to obtain the same result which can be very frustrating. It’s a
very early version ( r packageVersion("marginaleffects")
at
time of last updating this document) so it may see breaking changes.
The marginaleffects website has a good write-up comparing between these packages (as well as Stata). This is obviously written with bias towards being pro-marginaleffects so keep that in mind while reading it.
In addition, the interactions package is used to plot interaction effects.
The ggeffects package requires several other packages to work fully. However, it does not follow proper package framework and install them for you, rather it requires you to ensure you have the following packages installed as well:
If you do not have those packages loaded, ggeffects will load them on demand (rather than at the time of loading ggeffects). If you do not have a required package installed, ggeffects will instead produce a warning, so keep an eye out for those and install packages as needed.
This document was last compiled using Stata/SE 18.0 and R 4.3.3.
R package versions at last compilation:
Package | Version |
---|---|
emmeans | 1.10.0 |
ggeffects | 1.5.1 |
effects | 4.2.2 |
datawizard | 0.10.0 |
marginaleffects | 0.18.0 |
ggplot2 | 3.5.0 |
interactions | 1.1.5 |
. regress y i.class
Source | SS df MS Number of obs = 3,000
-------------+---------------------------------- F(2, 2997) = 15.48
Model | 14229.0349 2 7114.51743 Prob > F = 0.0000
Residual | 1377203.97 2,997 459.527518 R-squared = 0.0102
-------------+---------------------------------- Adj R-squared = 0.0096
Total | 1391433.01 2,999 463.965657 Root MSE = 21.437
------------------------------------------------------------------------------
y | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
class |
2 | .4084991 .8912269 0.46 0.647 -1.338979 2.155977
3 | 5.373138 1.027651 5.23 0.000 3.358165 7.38811
|
_cons | 68.35805 .6190791 110.42 0.000 67.14419 69.57191
------------------------------------------------------------------------------
. margins class
Adjusted predictions Number of obs = 3,000
Model VCE: OLS
Expression: Linear prediction, predict()
------------------------------------------------------------------------------
| Delta-method
| Margin std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
class |
1 | 68.35805 .6190791 110.42 0.000 67.14419 69.57191
2 | 68.76655 .6411134 107.26 0.000 67.50948 70.02361
3 | 73.73119 .8202484 89.89 0.000 72.12288 75.33949
------------------------------------------------------------------------------
Call:
lm(formula = y ~ class, data = m)
Residuals:
Min 1Q Median 3Q Max
-73.531 -14.758 0.101 14.536 72.569
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 68.3580 0.6191 110.419 < 2e-16 ***
class2 0.4085 0.8912 0.458 0.647
class3 5.3731 1.0277 5.229 1.83e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.44 on 2997 degrees of freedom
Multiple R-squared: 0.01023, Adjusted R-squared: 0.009566
F-statistic: 15.48 on 2 and 2997 DF, p-value: 2.045e-07
class emmean SE df lower.CL upper.CL
1 68.4 0.619 2997 67.1 69.6
2 68.8 0.641 2997 67.5 70.0
3 73.7 0.820 2997 72.1 75.3
Confidence level used: 0.95
Note that by default, emmeans
returns confidence intervals.
If you prefer p-values, you can wrap the emmeans
call in
test
:
class emmean SE df t.ratio p.value
1 68.4 0.619 2997 110.419 <.0001
2 68.8 0.641 2997 107.261 <.0001
3 73.7 0.820 2997 89.889 <.0001
Call:
lm(formula = y ~ class, data = m)
Residuals:
Min 1Q Median 3Q Max
-73.531 -14.758 0.101 14.536 72.569
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 68.3580 0.6191 110.419 < 2e-16 ***
class2 0.4085 0.8912 0.458 0.647
class3 5.3731 1.0277 5.229 1.83e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.44 on 2997 degrees of freedom
Multiple R-squared: 0.01023, Adjusted R-squared: 0.009566
F-statistic: 15.48 on 2 and 2997 DF, p-value: 2.045e-07
# Predicted values of y
class | Predicted | 95% CI
--------------------------------
1 | 68.36 | 67.14, 69.57
2 | 68.77 | 67.51, 70.02
3 | 73.73 | 72.12, 75.34
Note that we could have run that as
ggeffect(mod1, ~ class)
, with a formula. However, when we
later want to specify certain values of the predictors, we need to use
the character version, so I stick with it here.
Call:
lm(formula = y ~ class, data = m)
Residuals:
Min 1Q Median 3Q Max
-73.531 -14.758 0.101 14.536 72.569
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 68.3580 0.6191 110.419 < 2e-16 ***
class2 0.4085 0.8912 0.458 0.647
class3 5.3731 1.0277 5.229 1.83e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.44 on 2997 degrees of freedom
Multiple R-squared: 0.01023, Adjusted R-squared: 0.009566
F-statistic: 15.48 on 2 and 2997 DF, p-value: 2.045e-07
class Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
1 68.4 0.619 110.4 <0.001 Inf 67.1 69.6
2 68.8 0.641 107.3 <0.001 Inf 67.5 70.0
3 73.7 0.820 89.9 <0.001 Inf 72.1 75.3
Columns: rowid, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, y, class
Type: response
The datagrid()
function can also take in values. For
example,
class Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
1 68.4 0.619 110.4 <0.001 Inf 67.1 69.6
2 68.8 0.641 107.3 <0.001 Inf 67.5 70.0
3 73.7 0.820 89.9 <0.001 Inf 72.1 75.3
Columns: rowid, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, y, class
Type: response
is an alternative to class = unique
to specify the
levels. (NOTE: These are the “levels” not the values - it just happens
that class
has the same. See below for what happens when it
differs.) We can specify a subset of levels as well.
class Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
1 68.4 0.619 110 <0.001 Inf 67.1 69.6
2 68.8 0.641 107 <0.001 Inf 67.5 70.0
Columns: rowid, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, y, class
Type: response
If your factor has labels which are not just the values, you need to refer to those.
Factor w/ 2 levels "Male","Female": 2 2 2 2 2 2 2 2 2 2 ...
- attr(*, "label")= chr "Sex"
Error: The "sex" element of the `at` list corresponds to a factor variable. The values entered in the `at` list must be one of the factor levels: "Male", "Female".
sex Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
Male 64.6 0.54 119 <0.001 Inf 63.5 65.6
Female 74.9 0.54 139 <0.001 Inf 73.8 75.9
Columns: rowid, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, y, sex
Type: response
Assume a linear regression set up with a single categorical variable, \(G\), with three groups. We fit
\[ E(Y|G) = \beta_0 + \beta_1g_2 + \beta_2g_3, \]
where \(g_2\) and \(g_3\) are dummy variables representing membership in groups 2 and 3 respectively (\(g_{2i} = 1\) if observation \(i\) has \(G = 2\), and equal to 0 otherwise.) Since \(g_1\) is not found in the model, it is the reference category.
Therefore, \(\beta_0\) represents the average response among the reference category \(G = 1\). \(\beta_1\) represents the difference in the average response between groups \(G = 1\) and \(G = 2\). Therefore, \(\beta_0 + \beta_1\) is the average response in group \(G = 2\). A similar argument can be made about \(\beta_2\) and group 3.
. regress y i.sex##i.class
Source | SS df MS Number of obs = 3,000
-------------+---------------------------------- F(5, 2994) = 92.91
Model | 186898.199 5 37379.6398 Prob > F = 0.0000
Residual | 1204534.81 2,994 402.316235 R-squared = 0.1343
-------------+---------------------------------- Adj R-squared = 0.1329
Total | 1391433.01 2,999 463.965657 Root MSE = 20.058
------------------------------------------------------------------------------
y | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
sex |
Female | 21.62507 1.509999 14.32 0.000 18.66433 24.58581
|
class |
2 | 11.37444 1.573314 7.23 0.000 8.289552 14.45932
3 | 21.6188 1.588487 13.61 0.000 18.50416 24.73344
|
sex#class |
Female#2 | -4.851582 1.942744 -2.50 0.013 -8.66083 -1.042333
Female#3 | -6.084875 3.004638 -2.03 0.043 -11.97624 -.1935115
|
_cons | 50.6107 1.367932 37.00 0.000 47.92852 53.29288
------------------------------------------------------------------------------
. margins sex#class
Adjusted predictions Number of obs = 3,000
Model VCE: OLS
Expression: Linear prediction, predict()
------------------------------------------------------------------------------
| Delta-method
| Margin std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
sex#class |
Male#1 | 50.6107 1.367932 37.00 0.000 47.92852 53.29288
Male#2 | 61.98514 .7772248 79.75 0.000 60.46119 63.50908
Male#3 | 72.2295 .8074975 89.45 0.000 70.64619 73.8128
Female#1 | 72.23577 .63942 112.97 0.000 70.98203 73.48952
Female#2 | 78.75863 .9434406 83.48 0.000 76.90877 80.60849
Female#3 | 87.7697 2.468947 35.55 0.000 82.92869 92.6107
------------------------------------------------------------------------------
Call:
lm(formula = y ~ sex * class, data = m)
Residuals:
Min 1Q Median 3Q Max
-72.029 -13.285 0.464 13.741 67.964
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 50.611 1.368 36.998 < 2e-16 ***
sexFemale 21.625 1.510 14.321 < 2e-16 ***
class2 11.374 1.573 7.230 6.12e-13 ***
class3 21.619 1.588 13.610 < 2e-16 ***
sexFemale:class2 -4.852 1.943 -2.497 0.0126 *
sexFemale:class3 -6.085 3.005 -2.025 0.0429 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.06 on 2994 degrees of freedom
Multiple R-squared: 0.1343, Adjusted R-squared: 0.1329
F-statistic: 92.91 on 5 and 2994 DF, p-value: < 2.2e-16
class sex emmean SE df lower.CL upper.CL
1 Male 50.6 1.368 2994 47.9 53.3
2 Male 62.0 0.777 2994 60.5 63.5
3 Male 72.2 0.807 2994 70.6 73.8
1 Female 72.2 0.639 2994 71.0 73.5
2 Female 78.8 0.943 2994 76.9 80.6
3 Female 87.8 2.469 2994 82.9 92.6
Confidence level used: 0.95
class sex emmean SE df t.ratio p.value
1 Male 50.6 1.368 2994 36.998 <.0001
2 Male 62.0 0.777 2994 79.752 <.0001
3 Male 72.2 0.807 2994 89.449 <.0001
1 Female 72.2 0.639 2994 112.971 <.0001
2 Female 78.8 0.943 2994 83.480 <.0001
3 Female 87.8 2.469 2994 35.549 <.0001
Call:
lm(formula = y ~ sex * class, data = m)
Residuals:
Min 1Q Median 3Q Max
-72.029 -13.285 0.464 13.741 67.964
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 50.611 1.368 36.998 < 2e-16 ***
sexFemale 21.625 1.510 14.321 < 2e-16 ***
class2 11.374 1.573 7.230 6.12e-13 ***
class3 21.619 1.588 13.610 < 2e-16 ***
sexFemale:class2 -4.852 1.943 -2.497 0.0126 *
sexFemale:class3 -6.085 3.005 -2.025 0.0429 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.06 on 2994 degrees of freedom
Multiple R-squared: 0.1343, Adjusted R-squared: 0.1329
F-statistic: 92.91 on 5 and 2994 DF, p-value: < 2.2e-16
# Predicted values of y
sex: Male
class | Predicted | 95% CI
--------------------------------
1 | 50.61 | 47.93, 53.29
2 | 61.99 | 60.46, 63.51
3 | 72.23 | 70.65, 73.81
sex: Female
class | Predicted | 95% CI
--------------------------------
1 | 72.24 | 70.98, 73.49
2 | 78.76 | 76.91, 80.61
3 | 87.77 | 82.93, 92.61
Call:
lm(formula = y ~ sex * class, data = m)
Residuals:
Min 1Q Median 3Q Max
-72.029 -13.285 0.464 13.741 67.964
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 50.611 1.368 36.998 < 2e-16 ***
sexFemale 21.625 1.510 14.321 < 2e-16 ***
class2 11.374 1.573 7.230 6.12e-13 ***
class3 21.619 1.588 13.610 < 2e-16 ***
sexFemale:class2 -4.852 1.943 -2.497 0.0126 *
sexFemale:class3 -6.085 3.005 -2.025 0.0429 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.06 on 2994 degrees of freedom
Multiple R-squared: 0.1343, Adjusted R-squared: 0.1329
F-statistic: 92.91 on 5 and 2994 DF, p-value: < 2.2e-16
sex class Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
Male 1 50.6 1.368 37.0 <0.001 993.0 47.9 53.3
Male 2 62.0 0.777 79.8 <0.001 Inf 60.5 63.5
Male 3 72.2 0.807 89.4 <0.001 Inf 70.6 73.8
Female 1 72.2 0.639 113.0 <0.001 Inf 71.0 73.5
Female 2 78.8 0.943 83.5 <0.001 Inf 76.9 80.6
Female 3 87.8 2.469 35.5 <0.001 917.1 82.9 92.6
Columns: rowid, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, y, sex, class
Type: response
Again, we can specify specific levels in the
datagrid
:
sex class Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
Male 1 50.6 1.368 37.0 <0.001 993.0 47.9 53.3
Male 3 72.2 0.807 89.4 <0.001 Inf 70.6 73.8
Columns: rowid, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, y, sex, class
Type: response
Consider the simplest case, with two binary variables \(Z\) and \(K\). We fit the model
\[ E(Y|Z,K) = \beta_0 + \beta_1Z + \beta_2K + \beta_3ZK, \]
where \(ZK = 1\) only if both \(Z = 1\) and \(K = 1\).
This is functionally equivalent to defining a new variable \(L\),
\[ L = \begin{cases} 0, & K = 0\ \&\ Z = 0 \\ 1, & K = 0\ \&\ Z = 1 \\ 2, & K = 1\ \&\ Z = 0 \\ 3, & K = 1\ \&\ Z = 1, \end{cases} \]
and fitting the model
\[ E(Y|L) = \beta_0 + \beta_1l_1 + \beta_2l_2 + \beta_3l_3. \]
. regress y i.sex c.age c.distance
Source | SS df MS Number of obs = 3,000
-------------+---------------------------------- F(3, 2996) = 143.27
Model | 174576.269 3 58192.0896 Prob > F = 0.0000
Residual | 1216856.74 2,996 406.16046 R-squared = 0.1255
-------------+---------------------------------- Adj R-squared = 0.1246
Total | 1391433.01 2,999 463.965657 Root MSE = 20.153
------------------------------------------------------------------------------
y | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
sex |
Female | 13.98753 .7768541 18.01 0.000 12.46431 15.51075
age | -.5038264 .0336538 -14.97 0.000 -.5698133 -.4378394
distance | -.0060725 .0020291 -2.99 0.003 -.0100511 -.0020939
_cons | 83.13802 1.327228 62.64 0.000 80.53565 85.74039
------------------------------------------------------------------------------
. margins, at(age = (30 40))
Predictive margins Number of obs = 3,000
Model VCE: OLS
Expression: Linear prediction, predict()
1._at: age = 30
2._at: age = 40
------------------------------------------------------------------------------
| Delta-method
| Margin std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
_at |
1 | 74.67056 .4941029 151.12 0.000 73.70175 75.63938
2 | 69.6323 .3680117 189.21 0.000 68.91072 70.35388
------------------------------------------------------------------------------
. margins, at(age = (30 40) distance = (10 100))
Predictive margins Number of obs = 3,000
Model VCE: OLS
Expression: Linear prediction, predict()
1._at: age = 30
distance = 10
2._at: age = 30
distance = 100
3._at: age = 40
distance = 10
4._at: age = 40
distance = 100
------------------------------------------------------------------------------
| Delta-method
| Margin std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
_at |
1 | 74.9656 .5034959 148.89 0.000 73.97837 75.95283
2 | 74.41907 .5014946 148.39 0.000 73.43576 75.40238
3 | 69.92733 .3809974 183.54 0.000 69.18029 70.67438
4 | 69.38081 .3774763 183.80 0.000 68.64067 70.12095
------------------------------------------------------------------------------
. margins sex, at(age = (30 40))
Predictive margins Number of obs = 3,000
Model VCE: OLS
Expression: Linear prediction, predict()
1._at: age = 30
2._at: age = 40
------------------------------------------------------------------------------
| Delta-method
| Margin std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
_at#sex |
1#Male | 67.66747 .5597763 120.88 0.000 66.56989 68.76506
1#Female | 81.655 .6902601 118.30 0.000 80.30157 83.00843
2#Male | 62.62921 .5370234 116.62 0.000 61.57624 63.68218
2#Female | 76.61674 .5331296 143.71 0.000 75.5714 77.66208
------------------------------------------------------------------------------
Note that any variable specified in the at
option must
also appear in the formula. Any variables you place in the formula but
not at
will be examined at each unique level (so don’t put
a continuous variable there!).
Call:
lm(formula = y ~ sex + age + distance, data = m)
Residuals:
Min 1Q Median 3Q Max
-72.356 -13.792 -0.275 13.624 74.950
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 83.138023 1.327228 62.640 < 2e-16 ***
sexFemale 13.987531 0.776854 18.005 < 2e-16 ***
age -0.503826 0.033654 -14.971 < 2e-16 ***
distance -0.006073 0.002029 -2.993 0.00279 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.15 on 2996 degrees of freedom
Multiple R-squared: 0.1255, Adjusted R-squared: 0.1246
F-statistic: 143.3 on 3 and 2996 DF, p-value: < 2.2e-16
age emmean SE df lower.CL upper.CL
30 74.7 0.494 2996 73.7 75.6
40 69.6 0.368 2996 68.9 70.3
Results are averaged over the levels of: sex
Confidence level used: 0.95
age emmean SE df t.ratio p.value
30 74.7 0.494 2996 151.138 <.0001
40 69.6 0.368 2996 189.185 <.0001
Results are averaged over the levels of: sex
age distance emmean SE df lower.CL upper.CL
30 10 75.0 0.503 2996 74.0 75.9
40 10 69.9 0.381 2996 69.2 70.7
30 100 74.4 0.501 2996 73.4 75.4
40 100 69.4 0.377 2996 68.6 70.1
Results are averaged over the levels of: sex
Confidence level used: 0.95
age sex emmean SE df lower.CL upper.CL
30 Male 67.7 0.560 2996 66.6 68.8
40 Male 62.6 0.537 2996 61.6 63.7
30 Female 81.7 0.690 2996 80.3 83.0
40 Female 76.6 0.533 2996 75.6 77.7
Confidence level used: 0.95
Note the particular syntax for specifying particular values - a
quoted string containing the name of the variable, followed by a list in
square brackets of comma-separated values (a single value would be just
[4]
). Whitespace is ignored.
Call:
lm(formula = y ~ sex + age + distance, data = m)
Residuals:
Min 1Q Median 3Q Max
-72.356 -13.792 -0.275 13.624 74.950
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 83.138023 1.327228 62.640 < 2e-16 ***
sexFemale 13.987531 0.776854 18.005 < 2e-16 ***
age -0.503826 0.033654 -14.971 < 2e-16 ***
distance -0.006073 0.002029 -2.993 0.00279 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.15 on 2996 degrees of freedom
Multiple R-squared: 0.1255, Adjusted R-squared: 0.1246
F-statistic: 143.3 on 3 and 2996 DF, p-value: < 2.2e-16
# Predicted values of y
age | Predicted | 95% CI
------------------------------
30 | 74.67 | 73.70, 75.64
40 | 69.63 | 68.91, 70.35
# Predicted values of y
distance: 10
age | Predicted | 95% CI
------------------------------
30 | 74.97 | 73.98, 75.95
40 | 69.93 | 69.18, 70.67
distance: 100
age | Predicted | 95% CI
------------------------------
30 | 74.42 | 73.44, 75.40
40 | 69.38 | 68.64, 70.12
# Predicted values of y
age: 30
sex | Predicted | 95% CI
---------------------------------
Male | 67.67 | 66.57, 68.77
Female | 81.66 | 80.30, 83.01
age: 40
sex | Predicted | 95% CI
---------------------------------
Male | 62.63 | 61.58, 63.68
Female | 76.62 | 75.57, 77.66
Call:
lm(formula = y ~ sex + age + distance, data = m)
Residuals:
Min 1Q Median 3Q Max
-72.356 -13.792 -0.275 13.624 74.950
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 83.138023 1.327228 62.640 < 2e-16 ***
sexFemale 13.987531 0.776854 18.005 < 2e-16 ***
age -0.503826 0.033654 -14.971 < 2e-16 ***
distance -0.006073 0.002029 -2.993 0.00279 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.15 on 2996 degrees of freedom
Multiple R-squared: 0.1255, Adjusted R-squared: 0.1246
F-statistic: 143.3 on 3 and 2996 DF, p-value: < 2.2e-16
age Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
30 74.7 0.494 151 <0.001 Inf 73.7 75.6
40 69.6 0.368 189 <0.001 Inf 68.9 70.4
Columns: age, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high
Type: response
Note the inclusion of age
twice, once in the
by=
argument and once inside the newdata=
argument. Excluding either of those will produce unexpected results.
Also, the datagridcf()
function produces a counter-factual set of datasets similar to how Stata
works. Using datagrid()
here would not work.
predictions(mod1, by = c("age", "distance"),
newdata = datagridcf(age = c(30, 40),
distance = c(10, 100)))
age distance Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
30 10 75.0 0.504 149 <0.001 Inf 74.0 76.0
30 100 74.4 0.501 148 <0.001 Inf 73.4 75.4
40 10 69.9 0.381 184 <0.001 Inf 69.2 70.7
40 100 69.4 0.377 184 <0.001 Inf 68.6 70.1
Columns: age, distance, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high
Type: response
age sex Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 % distance
30 Male 67.7 0.560 121 <0.001 Inf 66.6 68.8 58.6
30 Female 81.7 0.690 118 <0.001 Inf 80.3 83.0 58.6
40 Male 62.6 0.537 117 <0.001 Inf 61.6 63.7 58.6
40 Female 76.6 0.533 144 <0.001 Inf 75.6 77.7 58.6
Columns: rowid, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, y, distance, age, sex
Type: response
Consider a model with a single continuous predictor, plus a control variable (which may be continuous or categorical).
\[ E(Y|X, Z) = \beta_0 + \beta_1X + \beta_2Z. \]
For any given value of \(X\), it is easy to compute the marginal mean for a given individual. For example,
\[ E(Y|X = 2, Z) = \beta_0 + 2\beta_1 + \beta_2Z. \]
Therefore we can easily compute \(E(y_i|x = 2, z = z_i)\) for each individual (the predicted outcome if each individual had a value of \(X = 2\), with their other values at the observed value) and average them.
. regress y i.sex##c.age
Source | SS df MS Number of obs = 3,000
-------------+---------------------------------- F(3, 2996) = 139.91
Model | 170983.675 3 56994.5583 Prob > F = 0.0000
Residual | 1220449.33 2,996 407.35959 R-squared = 0.1229
-------------+---------------------------------- Adj R-squared = 0.1220
Total | 1391433.01 2,999 463.965657 Root MSE = 20.183
------------------------------------------------------------------------------
y | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
sex |
Female | 14.92308 2.789012 5.35 0.000 9.454508 20.39165
age | -.4929608 .0480944 -10.25 0.000 -.5872622 -.3986595
|
sex#c.age |
Female | -.0224116 .0674167 -0.33 0.740 -.1545994 .1097762
|
_cons | 82.36936 1.812958 45.43 0.000 78.8146 85.92413
------------------------------------------------------------------------------
. margins, at(age = (30 40))
Predictive margins Number of obs = 3,000
Model VCE: OLS
Expression: Linear prediction, predict()
1._at: age = 30
2._at: age = 40
------------------------------------------------------------------------------
| Delta-method
| Margin std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
_at |
1 | 74.71541 .5089298 146.81 0.000 73.71752 75.71329
2 | 69.67359 .3890273 179.10 0.000 68.9108 70.43638
------------------------------------------------------------------------------
. margins sex, at(age = (30 40))
Adjusted predictions Number of obs = 3,000
Model VCE: OLS
Expression: Linear prediction, predict()
1._at: age = 30
2._at: age = 40
------------------------------------------------------------------------------
| Delta-method
| Margin std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
_at#sex |
1#Male | 67.58054 .5984013 112.94 0.000 66.40722 68.75386
1#Female | 81.83127 .8228617 99.45 0.000 80.21784 83.4447
2#Male | 62.65093 .5541361 113.06 0.000 61.56441 63.73746
2#Female | 76.67755 .5461908 140.39 0.000 75.6066 77.74849
------------------------------------------------------------------------------
Call:
lm(formula = y ~ sex * age, data = m)
Residuals:
Min 1Q Median 3Q Max
-71.817 -13.848 -0.116 13.758 75.263
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 82.36936 1.81296 45.434 < 2e-16 ***
sexFemale 14.92308 2.78901 5.351 9.42e-08 ***
age -0.49296 0.04809 -10.250 < 2e-16 ***
sexFemale:age -0.02241 0.06742 -0.332 0.74
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.18 on 2996 degrees of freedom
Multiple R-squared: 0.1229, Adjusted R-squared: 0.122
F-statistic: 139.9 on 3 and 2996 DF, p-value: < 2.2e-16
age emmean SE df lower.CL upper.CL
30 74.7 0.509 2996 73.7 75.7
40 69.7 0.389 2996 68.9 70.4
Results are averaged over the levels of: sex
Confidence level used: 0.95
age emmean SE df t.ratio p.value
30 74.7 0.509 2996 146.851 <.0001
40 69.7 0.389 2996 179.070 <.0001
Results are averaged over the levels of: sex
age sex emmean SE df t.ratio p.value
30 Male 67.6 0.598 2996 112.935 <.0001
40 Male 62.7 0.554 2996 113.061 <.0001
30 Female 81.8 0.823 2996 99.447 <.0001
40 Female 76.7 0.546 2996 140.386 <.0001
Call:
lm(formula = y ~ sex * age, data = m)
Residuals:
Min 1Q Median 3Q Max
-71.817 -13.848 -0.116 13.758 75.263
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 82.36936 1.81296 45.434 < 2e-16 ***
sexFemale 14.92308 2.78901 5.351 9.42e-08 ***
age -0.49296 0.04809 -10.250 < 2e-16 ***
sexFemale:age -0.02241 0.06742 -0.332 0.74
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.18 on 2996 degrees of freedom
Multiple R-squared: 0.1229, Adjusted R-squared: 0.122
F-statistic: 139.9 on 3 and 2996 DF, p-value: < 2.2e-16
# Predicted values of y
age | Predicted | 95% CI
------------------------------
30 | 74.72 | 73.72, 75.71
40 | 69.67 | 68.91, 70.44
# Predicted values of y
age: 30
sex | Predicted | 95% CI
---------------------------------
Male | 67.58 | 66.41, 68.75
Female | 81.83 | 80.22, 83.44
age: 40
sex | Predicted | 95% CI
---------------------------------
Male | 62.65 | 61.56, 63.74
Female | 76.68 | 75.61, 77.75
. regress y i.sex c.age
Source | SS df MS Number of obs = 3,000
-------------+---------------------------------- F(2, 2997) = 209.88
Model | 170938.657 2 85469.3283 Prob > F = 0.0000
Residual | 1220494.35 2,997 407.238688 R-squared = 0.1229
-------------+---------------------------------- Adj R-squared = 0.1223
Total | 1391433.01 2,999 463.965657 Root MSE = 20.18
------------------------------------------------------------------------------
y | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
sex |
Female | 14.03271 .7777377 18.04 0.000 12.50775 15.55766
age | -.5043666 .033698 -14.97 0.000 -.5704402 -.4382931
_cons | 82.78115 1.323613 62.54 0.000 80.18586 85.37643
------------------------------------------------------------------------------
. margins, dydx(age)
Average marginal effects Number of obs = 3,000
Model VCE: OLS
Expression: Linear prediction, predict()
dy/dx wrt: age
------------------------------------------------------------------------------
| Delta-method
| dy/dx std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
age | -.5043666 .033698 -14.97 0.000 -.5704402 -.4382931
------------------------------------------------------------------------------
Call:
lm(formula = y ~ sex + age, data = m)
Residuals:
Min 1Q Median 3Q Max
-71.99 -13.88 -0.15 13.76 75.28
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 82.7811 1.3236 62.54 <2e-16 ***
sexFemale 14.0327 0.7777 18.04 <2e-16 ***
age -0.5044 0.0337 -14.97 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.18 on 2997 degrees of freedom
Multiple R-squared: 0.1229, Adjusted R-squared: 0.1223
F-statistic: 209.9 on 2 and 2997 DF, p-value: < 2.2e-16
1 age.trend SE df lower.CL upper.CL
overall -0.504 0.0337 2997 -0.57 -0.438
Results are averaged over the levels of: sex
Confidence level used: 0.95
To attach a p-value, wrap the call in the test
function:
1 age.trend SE df t.ratio p.value
overall -0.504 0.0337 2997 -14.967 <.0001
Results are averaged over the levels of: sex
The Stata margins
command to obtain marginal means
includes the “dydx
” option, which points to derivative -
and indeed, this is exactly what is computed. If we have a basic
regression model,
\[ E(Y|X) = \beta_0 + \beta_1X \]
taking the derivative with respect to \(X\) will obtain \(\beta_1\), which is the estimated coefficient.
If \(X\) enters the model in a more complex way, say a polynomial term:
\[ E(Y|X) = \beta_0 + \beta_1X + \beta_2X^2 \]
now the derivative is \(\beta_1 + 2\beta_2X\). Similar to marginal means, where the predicted outcome was estimated for each individual then those outcomes averaged, here the derivative is estimated plugging in each observation’s value of \(X\), then averaged.
. regress y i.sex##c.age
Source | SS df MS Number of obs = 3,000
-------------+---------------------------------- F(3, 2996) = 139.91
Model | 170983.675 3 56994.5583 Prob > F = 0.0000
Residual | 1220449.33 2,996 407.35959 R-squared = 0.1229
-------------+---------------------------------- Adj R-squared = 0.1220
Total | 1391433.01 2,999 463.965657 Root MSE = 20.183
------------------------------------------------------------------------------
y | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
sex |
Female | 14.92308 2.789012 5.35 0.000 9.454508 20.39165
age | -.4929608 .0480944 -10.25 0.000 -.5872622 -.3986595
|
sex#c.age |
Female | -.0224116 .0674167 -0.33 0.740 -.1545994 .1097762
|
_cons | 82.36936 1.812958 45.43 0.000 78.8146 85.92413
------------------------------------------------------------------------------
. margins sex, dydx(age)
Average marginal effects Number of obs = 3,000
Model VCE: OLS
Expression: Linear prediction, predict()
dy/dx wrt: age
------------------------------------------------------------------------------
| Delta-method
| dy/dx std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
age |
sex |
Male | -.4929608 .0480944 -10.25 0.000 -.5872622 -.3986595
Female | -.5153724 .0472435 -10.91 0.000 -.6080054 -.4227395
------------------------------------------------------------------------------
Call:
lm(formula = y ~ sex * age, data = m)
Residuals:
Min 1Q Median 3Q Max
-71.817 -13.848 -0.116 13.758 75.263
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 82.36936 1.81296 45.434 < 2e-16 ***
sexFemale 14.92308 2.78901 5.351 9.42e-08 ***
age -0.49296 0.04809 -10.250 < 2e-16 ***
sexFemale:age -0.02241 0.06742 -0.332 0.74
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.18 on 2996 degrees of freedom
Multiple R-squared: 0.1229, Adjusted R-squared: 0.122
F-statistic: 139.9 on 3 and 2996 DF, p-value: < 2.2e-16
sex age.trend SE df lower.CL upper.CL
Male -0.493 0.0481 2996 -0.587 -0.399
Female -0.515 0.0472 2996 -0.608 -0.423
Confidence level used: 0.95
sex age.trend SE df t.ratio p.value
Male -0.493 0.0481 2996 -10.250 <.0001
Female -0.515 0.0472 2996 -10.909 <.0001
Let’s assume we have a binary variable, \(Z\), interacted with a continuous variable, \(X\).
\[ E(Y|X,Z) = \beta_0 + \beta_1X + \beta_2Z + \beta_3XZ \]
Here we’re combining the math from marginal effects with marginal slopes. First, we generate two equations, one for \(Z = 0\) and one for \(Z = 1\):
\[\begin{align*} E(Y|X, Z = 0) & = \beta_0 + \beta_1X \\ E(Y|X, Z = 1) & = \beta_0 + \beta_1X + \beta_2 + \beta_3X = (\beta_0 + \beta_2) + (\beta_1 + \beta_3)X. \end{align*}\]
Then when we differentiate each with respect to \(X\), we obtain \(\beta_1\) and \((\beta_1 + \beta_3)\) respectively.
. regress y i.class
Source | SS df MS Number of obs = 3,000
-------------+---------------------------------- F(2, 2997) = 15.48
Model | 14229.0349 2 7114.51743 Prob > F = 0.0000
Residual | 1377203.97 2,997 459.527518 R-squared = 0.0102
-------------+---------------------------------- Adj R-squared = 0.0096
Total | 1391433.01 2,999 463.965657 Root MSE = 21.437
------------------------------------------------------------------------------
y | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
class |
2 | .4084991 .8912269 0.46 0.647 -1.338979 2.155977
3 | 5.373138 1.027651 5.23 0.000 3.358165 7.38811
|
_cons | 68.35805 .6190791 110.42 0.000 67.14419 69.57191
------------------------------------------------------------------------------
. margins class, pwcompare(pv)
Pairwise comparisons of adjusted predictions Number of obs = 3,000
Model VCE: OLS
Expression: Linear prediction, predict()
-----------------------------------------------------
| Delta-method Unadjusted
| Contrast std. err. t P>|t|
-------------+---------------------------------------
class |
2 vs 1 | .4084991 .8912269 0.46 0.647
3 vs 1 | 5.373138 1.027651 5.23 0.000
3 vs 2 | 4.964638 1.041073 4.77 0.000
-----------------------------------------------------
Call:
lm(formula = y ~ class, data = m)
Residuals:
Min 1Q Median 3Q Max
-73.531 -14.758 0.101 14.536 72.569
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 68.3580 0.6191 110.419 < 2e-16 ***
class2 0.4085 0.8912 0.458 0.647
class3 5.3731 1.0277 5.229 1.83e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.44 on 2997 degrees of freedom
Multiple R-squared: 0.01023, Adjusted R-squared: 0.009566
F-statistic: 15.48 on 2 and 2997 DF, p-value: 2.045e-07
contrast estimate SE df t.ratio p.value
class1 - class2 -0.408 0.891 2997 -0.458 0.6467
class1 - class3 -5.373 1.028 2997 -5.229 <.0001
class2 - class3 -4.965 1.041 2997 -4.769 <.0001
The adjust = "none"
argument skips any multiple
comparisons correction. This is done to obtain the same results as the
regression coefficients. If you’d prefer a more ANOVA post-hoc approach,
there are other options for adjust
, see
?emmeans::contrast
for details.
. regress y i.class##i.sex
Source | SS df MS Number of obs = 3,000
-------------+---------------------------------- F(5, 2994) = 92.91
Model | 186898.199 5 37379.6398 Prob > F = 0.0000
Residual | 1204534.81 2,994 402.316235 R-squared = 0.1343
-------------+---------------------------------- Adj R-squared = 0.1329
Total | 1391433.01 2,999 463.965657 Root MSE = 20.058
------------------------------------------------------------------------------
y | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
class |
2 | 11.37444 1.573314 7.23 0.000 8.289552 14.45932
3 | 21.6188 1.588487 13.61 0.000 18.50416 24.73344
|
sex |
Female | 21.62507 1.509999 14.32 0.000 18.66433 24.58581
|
class#sex |
2#Female | -4.851582 1.942744 -2.50 0.013 -8.66083 -1.042333
3#Female | -6.084875 3.004638 -2.03 0.043 -11.97624 -.1935115
|
_cons | 50.6107 1.367932 37.00 0.000 47.92852 53.29288
------------------------------------------------------------------------------
. margins class#sex, pwcompare(pv)
Pairwise comparisons of adjusted predictions Number of obs = 3,000
Model VCE: OLS
Expression: Linear prediction, predict()
------------------------------------------------------------------
| Delta-method Unadjusted
| Contrast std. err. t P>|t|
--------------------------+---------------------------------------
class#sex |
(1#Female) vs (1#Male) | 21.62507 1.509999 14.32 0.000
(2#Male) vs (1#Male) | 11.37444 1.573314 7.23 0.000
(2#Female) vs (1#Male) | 28.14793 1.661722 16.94 0.000
(3#Male) vs (1#Male) | 21.6188 1.588487 13.61 0.000
(3#Female) vs (1#Male) | 37.159 2.822577 13.16 0.000
(2#Male) vs (1#Female) | -10.25064 1.006447 -10.18 0.000
(2#Female) vs (1#Female) | 6.522856 1.13971 5.72 0.000
(3#Male) vs (1#Female) | -.0062748 1.030005 -0.01 0.995
(3#Female) vs (1#Female) | 15.53392 2.550404 6.09 0.000
(2#Female) vs (2#Male) | 16.77349 1.222358 13.72 0.000
(3#Male) vs (2#Male) | 10.24436 1.120772 9.14 0.000
(3#Female) vs (2#Male) | 25.78456 2.588393 9.96 0.000
(3#Male) vs (2#Female) | -6.529131 1.241826 -5.26 0.000
(3#Female) vs (2#Female) | 9.011069 2.643063 3.41 0.001
(3#Female) vs (3#Male) | 15.5402 2.597644 5.98 0.000
------------------------------------------------------------------
This produces a lot of comparisons; if we only cared about the comparisons one way (e.g. gender differences within each class), we can restrict our output to these results.
. margins sex@class, contrast(pv nowald)
Contrasts of adjusted predictions Number of obs = 3,000
Model VCE: OLS
Expression: Linear prediction, predict()
------------------------------------------------------------
| Delta-method
| Contrast std. err. t P>|t|
--------------------+---------------------------------------
sex@class |
(Female vs base) 1 | 21.62507 1.509999 14.32 0.000
(Female vs base) 2 | 16.77349 1.222358 13.72 0.000
(Female vs base) 3 | 15.5402 2.597644 5.98 0.000
------------------------------------------------------------
Note that the results are identical, it’s just a nicer output.
The pv
option requests p-values instead of confidence
intervals; the nowald
option suppresses an omnibus test of
any difference between classes (e.g. ANOVA output).
Call:
lm(formula = y ~ class * sex, data = m)
Residuals:
Min 1Q Median 3Q Max
-72.029 -13.285 0.464 13.741 67.964
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 50.611 1.368 36.998 < 2e-16 ***
class2 11.374 1.573 7.230 6.12e-13 ***
class3 21.619 1.588 13.610 < 2e-16 ***
sexFemale 21.625 1.510 14.321 < 2e-16 ***
class2:sexFemale -4.852 1.943 -2.497 0.0126 *
class3:sexFemale -6.085 3.005 -2.025 0.0429 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.06 on 2994 degrees of freedom
Multiple R-squared: 0.1343, Adjusted R-squared: 0.1329
F-statistic: 92.91 on 5 and 2994 DF, p-value: < 2.2e-16
contrast estimate SE df t.ratio p.value
Male class1 - Female class1 -21.62507 1.51 2994 -14.321 <.0001
Male class1 - Male class2 -11.37444 1.57 2994 -7.230 <.0001
Male class1 - Female class2 -28.14793 1.66 2994 -16.939 <.0001
Male class1 - Male class3 -21.61880 1.59 2994 -13.610 <.0001
Male class1 - Female class3 -37.15900 2.82 2994 -13.165 <.0001
Female class1 - Male class2 10.25064 1.01 2994 10.185 <.0001
Female class1 - Female class2 -6.52286 1.14 2994 -5.723 <.0001
Female class1 - Male class3 0.00627 1.03 2994 0.006 0.9951
Female class1 - Female class3 -15.53392 2.55 2994 -6.091 <.0001
Male class2 - Female class2 -16.77349 1.22 2994 -13.722 <.0001
Male class2 - Male class3 -10.24436 1.12 2994 -9.140 <.0001
Male class2 - Female class3 -25.78456 2.59 2994 -9.962 <.0001
Female class2 - Male class3 6.52913 1.24 2994 5.258 <.0001
Female class2 - Female class3 -9.01107 2.64 2994 -3.409 0.0007
Male class3 - Female class3 -15.54020 2.60 2994 -5.982 <.0001
The adjust = "none"
argument skips any multiple
comparisons correction. This is done to obtain the same results as the
regression coefficients. If you’d prefer a more ANOVA post-hoc approach,
there are other options for adjust
, see
?emmeans::contrast
for details.
This produces a lot of comparisons; if we only cared about the comparisons one way (e.g. gender differences within each class), we can restrict our output to these results.
class = 1:
contrast estimate SE df t.ratio p.value
Male - Female -21.6 1.51 2994 -14.321 <.0001
class = 2:
contrast estimate SE df t.ratio p.value
Male - Female -16.8 1.22 2994 -13.722 <.0001
class = 3:
contrast estimate SE df t.ratio p.value
Male - Female -15.5 2.60 2994 -5.982 <.0001
Note that the results are identical, it’s just a nicer output.
. regress y i.class##c.age
Source | SS df MS Number of obs = 3,000
-------------+---------------------------------- F(5, 2994) = 21.15
Model | 47475.1592 5 9495.03183 Prob > F = 0.0000
Residual | 1343957.85 2,994 448.883716 R-squared = 0.0341
-------------+---------------------------------- Adj R-squared = 0.0325
Total | 1391433.01 2,999 463.965657 Root MSE = 21.187
------------------------------------------------------------------------------
y | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
class |
2 | -13.19784 3.880448 -3.40 0.001 -20.80646 -5.589225
3 | -11.14087 4.189539 -2.66 0.008 -19.35553 -2.926199
|
age | -.4751707 .0630779 -7.53 0.000 -.5988511 -.3514903
|
class#c.age |
2 | .2484541 .0887425 2.80 0.005 .0744516 .4224565
3 | .2903105 .1107395 2.62 0.009 .0731774 .5074437
|
_cons | 90.55752 3.009782 30.09 0.000 84.65606 96.45897
------------------------------------------------------------------------------
. margins class, dydx(age) pwcompare(pv)
Pairwise comparisons of average marginal effects
Model VCE: OLS Number of obs = 3,000
Expression: Linear prediction, predict()
dy/dx wrt: age
-----------------------------------------------------
| Contrast Delta-method Unadjusted
| dy/dx std. err. t P>|t|
-------------+---------------------------------------
age |
class |
2 vs 1 | .2484541 .0887425 2.80 0.005
3 vs 1 | .2903105 .1107395 2.62 0.009
3 vs 2 | .0418565 .1103668 0.38 0.705
-----------------------------------------------------
Call:
lm(formula = y ~ class * age, data = m)
Residuals:
Min 1Q Median 3Q Max
-75.335 -14.841 -0.036 14.469 73.169
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 90.55752 3.00978 30.088 < 2e-16 ***
class2 -13.19784 3.88045 -3.401 0.00068 ***
class3 -11.14087 4.18954 -2.659 0.00787 **
age -0.47517 0.06308 -7.533 6.52e-14 ***
class2:age 0.24845 0.08874 2.800 0.00515 **
class3:age 0.29031 0.11074 2.622 0.00880 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.19 on 2994 degrees of freedom
Multiple R-squared: 0.03412, Adjusted R-squared: 0.03251
F-statistic: 21.15 on 5 and 2994 DF, p-value: < 2.2e-16
contrast estimate SE df t.ratio p.value
class1 - class2 -0.2485 0.0887 2994 -2.800 0.0051
class1 - class3 -0.2903 0.1107 2994 -2.622 0.0088
class2 - class3 -0.0419 0.1104 2994 -0.379 0.7045
. regress y i.class##i.sex
Source | SS df MS Number of obs = 3,000
-------------+---------------------------------- F(5, 2994) = 92.91
Model | 186898.199 5 37379.6398 Prob > F = 0.0000
Residual | 1204534.81 2,994 402.316235 R-squared = 0.1343
-------------+---------------------------------- Adj R-squared = 0.1329
Total | 1391433.01 2,999 463.965657 Root MSE = 20.058
------------------------------------------------------------------------------
y | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
class |
2 | 11.37444 1.573314 7.23 0.000 8.289552 14.45932
3 | 21.6188 1.588487 13.61 0.000 18.50416 24.73344
|
sex |
Female | 21.62507 1.509999 14.32 0.000 18.66433 24.58581
|
class#sex |
2#Female | -4.851582 1.942744 -2.50 0.013 -8.66083 -1.042333
3#Female | -6.084875 3.004638 -2.03 0.043 -11.97624 -.1935115
|
_cons | 50.6107 1.367932 37.00 0.000 47.92852 53.29288
------------------------------------------------------------------------------
. margins sex@class, contrast(pv nowald)
Contrasts of adjusted predictions Number of obs = 3,000
Model VCE: OLS
Expression: Linear prediction, predict()
------------------------------------------------------------
| Delta-method
| Contrast std. err. t P>|t|
--------------------+---------------------------------------
sex@class |
(Female vs base) 1 | 21.62507 1.509999 14.32 0.000
(Female vs base) 2 | 16.77349 1.222358 13.72 0.000
(Female vs base) 3 | 15.5402 2.597644 5.98 0.000
------------------------------------------------------------
. margins class, dydx(sex) pwcompare(pv)
Pairwise comparisons of conditional marginal effects
Model VCE: OLS Number of obs = 3,000
Expression: Linear prediction, predict()
dy/dx wrt: 1.sex
-----------------------------------------------------
| Contrast Delta-method Unadjusted
| dy/dx std. err. t P>|t|
-------------+---------------------------------------
0.sex | (base outcome)
-------------+---------------------------------------
1.sex |
class |
2 vs 1 | -4.851582 1.942744 -2.50 0.013
3 vs 1 | -6.084875 3.004638 -2.03 0.043
3 vs 2 | -1.233294 2.870873 -0.43 0.668
-----------------------------------------------------
Note: dy/dx for factor levels is the discrete change
from the base level.
Note that this works because sex
is binary, thus it’s
coefficient (obtained via dydx
) is for a 1-unit change,
from 0 to 1. If you had a more complicated situation (e.g. there was a
third gender category but you only wanted to compare male versus
female), you could instead manually define your contrast.
. margins {sex +1 -1}#{class +1 -1 0}, contrast(pv nowald)
Contrasts of adjusted predictions Number of obs = 3,000
Model VCE: OLS
Expression: Linear prediction, predict()
-----------------------------------------------------
| Delta-method
| Contrast std. err. t P>|t|
-------------+---------------------------------------
sex#class |
(1) (1) | -4.851582 1.942744 -2.50 0.013
-----------------------------------------------------
. margins {sex +1 -1}#{class +1 -1 0} {sex +1 -1}#{class +1 0 -1} {sex +1 -1}#{
> class 0 +1 -1}, contrast(pv nowald)
Contrasts of adjusted predictions Number of obs = 3,000
Model VCE: OLS
Expression: Linear prediction, predict()
-----------------------------------------------------
| Delta-method
| Contrast std. err. t P>|t|
-------------+---------------------------------------
sex#class |
(1) (1) | -4.851582 1.942744 -2.50 0.013
(1) (2) | -6.084875 3.004638 -2.03 0.043
(1) (3) | -1.233294 2.870873 -0.43 0.668
-----------------------------------------------------
Here the {variable # # #}
notation defines a contrast -
the values correspond to a linear equation for the levels. E.g. if we
wanted to test male = female
, or
male - female = 0
, we could write this as
(+1)male + (-1)female = 0
, therefore the contrast is
+1 -1
(in sex
, male is the lowest level,
female is the higher level). For class, we want to compare each pair of
classs, so class1 = class2
is equivalent to
(+1)class1 + (-1)class2 + (0)class3 = 0
, whose contrast is
+1 -1 0
.
If you wanted a diff-in-diff-in-diff estimate (e.g. say you’ve got
males and females, and some undergo each of treatment 0 and treatment 1,
and you want to test whether the difference in pre-post changes amongst
men is significantly different than the difference in pre-post changes
amongst women), you can run something like
margins sex#treatment, dydx(prepost) contrast(pv nowald)
.
Call:
lm(formula = y ~ class * sex, data = m)
Residuals:
Min 1Q Median 3Q Max
-72.029 -13.285 0.464 13.741 67.964
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 50.611 1.368 36.998 < 2e-16 ***
class2 11.374 1.573 7.230 6.12e-13 ***
class3 21.619 1.588 13.610 < 2e-16 ***
sexFemale 21.625 1.510 14.321 < 2e-16 ***
class2:sexFemale -4.852 1.943 -2.497 0.0126 *
class3:sexFemale -6.085 3.005 -2.025 0.0429 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.06 on 2994 degrees of freedom
Multiple R-squared: 0.1343, Adjusted R-squared: 0.1329
F-statistic: 92.91 on 5 and 2994 DF, p-value: < 2.2e-16
class = 1:
contrast estimate SE df t.ratio p.value
Male - Female -21.6 1.51 2994 -14.321 <.0001
class = 2:
contrast estimate SE df t.ratio p.value
Male - Female -16.8 1.22 2994 -13.722 <.0001
class = 3:
contrast estimate SE df t.ratio p.value
Male - Female -15.5 2.60 2994 -5.982 <.0001
contrast estimate SE df t.ratio
(Male - Female class1) - (Male - Female class2) -4.85 1.94 2994 -2.497
(Male - Female class1) - (Male - Female class3) -6.08 3.00 2994 -2.025
(Male - Female class2) - (Male - Female class3) -1.23 2.87 2994 -0.430
p.value
0.0126
0.0429
0.6675
. regress y i.class##c.age
Source | SS df MS Number of obs = 3,000
-------------+---------------------------------- F(5, 2994) = 21.15
Model | 47475.1592 5 9495.03183 Prob > F = 0.0000
Residual | 1343957.85 2,994 448.883716 R-squared = 0.0341
-------------+---------------------------------- Adj R-squared = 0.0325
Total | 1391433.01 2,999 463.965657 Root MSE = 21.187
------------------------------------------------------------------------------
y | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
class |
2 | -13.19784 3.880448 -3.40 0.001 -20.80646 -5.589225
3 | -11.14087 4.189539 -2.66 0.008 -19.35553 -2.926199
|
age | -.4751707 .0630779 -7.53 0.000 -.5988511 -.3514903
|
class#c.age |
2 | .2484541 .0887425 2.80 0.005 .0744516 .4224565
3 | .2903105 .1107395 2.62 0.009 .0731774 .5074437
|
_cons | 90.55752 3.009782 30.09 0.000 84.65606 96.45897
------------------------------------------------------------------------------
. margins class, at(age = (20 60))
Adjusted predictions Number of obs = 3,000
Model VCE: OLS
Expression: Linear prediction, predict()
1._at: age = 20
2._at: age = 60
------------------------------------------------------------------------------
| Delta-method
| Margin std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
_at#class |
1 1 | 81.0541 1.793005 45.21 0.000 77.53845 84.56975
1 2 | 72.82534 1.284642 56.69 0.000 70.30647 75.34421
1 3 | 75.71945 1.27105 59.57 0.000 73.22723 78.21167
2 1 | 62.04727 1.037397 59.81 0.000 60.01319 64.08136
2 2 | 63.75668 1.517933 42.00 0.000 60.78038 66.73298
2 3 | 68.32504 2.782515 24.56 0.000 62.86921 73.78088
------------------------------------------------------------------------------
. marginsplot, recastci(rarea) ciopts(fcolor(%25) lcolor(%25))
Variables that uniquely identify margins: age class
The recastci(rarea)
option plots the confidence region,
rather than just confident intervals at each estimated point (to more
closely mimic R’s output). The ciopts()
allows you to pass
options for the confidence intervals; these particular options just make
the confidence region (and it’s outline) transparent.
Due to this being a linear model as well as age
entering
linearly (as opposed to say quadratically), two points are sufficient to
fit a line. If you fit a non-linear model (such as logistic regression)
or allowed age
to enter non-linearlly, you may want to add
more points to the plot. There is a trade-off between speed (more points
take longer to estimate) and smoothness of the resulting curve.
. regress y i.class##c.age##c.age
Source | SS df MS Number of obs = 3,000
-------------+---------------------------------- F(8, 2991) = 14.96
Model | 53538.5729 8 6692.32162 Prob > F = 0.0000
Residual | 1337894.43 2,991 447.306731 R-squared = 0.0385
-------------+---------------------------------- Adj R-squared = 0.0359
Total | 1391433.01 2,999 463.965657 Root MSE = 21.15
------------------------------------------------------------------------------
y | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
class |
2 | -22.72613 14.60206 -1.56 0.120 -51.35722 5.904968
3 | -5.253225 15.40844 -0.34 0.733 -35.46544 24.959
|
age | .384345 .536903 0.72 0.474 -.6683915 1.437081
|
class#c.age |
2 | .8892081 .726491 1.22 0.221 -.5352645 2.313681
3 | .1727145 .8296354 0.21 0.835 -1.453999 1.799428
|
c.age#c.age | -.0098622 .006118 -1.61 0.107 -.0218581 .0021337
|
class#c.age#|
c.age |
2 | -.009385 .00873 -1.08 0.282 -.0265025 .0077325
3 | -.0008709 .010928 -0.08 0.936 -.0222981 .0205564
|
_cons | 72.8557 11.38487 6.40 0.000 50.53273 95.17866
------------------------------------------------------------------------------
. margins class, at(age = (20(5)60))
Adjusted predictions Number of obs = 3,000
Model VCE: OLS
Expression: Linear prediction, predict()
1._at: age = 20
2._at: age = 25
3._at: age = 30
4._at: age = 35
5._at: age = 40
6._at: age = 45
7._at: age = 50
8._at: age = 55
9._at: age = 60
------------------------------------------------------------------------------
| Delta-method
| Margin std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
_at#class |
1 1 | 76.59771 3.293341 23.26 0.000 70.14026 83.05515
1 2 | 67.90175 2.045099 33.20 0.000 63.8918 71.91169
1 3 | 74.45042 1.660151 44.85 0.000 71.19527 77.70557
2 1 | 76.30043 2.102181 36.30 0.000 72.17856 80.4223
2 2 | 69.93889 1.169676 59.79 0.000 67.64544 72.23234
2 3 | 74.82077 .9637658 77.63 0.000 72.93106 76.71048
3 1 | 75.51005 1.312605 57.53 0.000 72.93634 78.08375
3 2 | 71.01367 .8150404 87.13 0.000 69.41557 72.61177
3 3 | 74.65447 1.047226 71.29 0.000 72.60111 76.70782
4 1 | 74.22655 .9758115 76.07 0.000 72.31322 76.13988
4 2 | 71.12609 .8578454 82.91 0.000 69.44406 72.80812
4 3 | 73.95151 1.233883 59.93 0.000 71.53216 76.37085
5 1 | 72.44994 .9291129 77.98 0.000 70.62818 74.27171
5 2 | 70.27615 .910904 77.15 0.000 68.49009 72.06221
5 3 | 72.7119 1.303433 55.78 0.000 70.15618 75.26761
6 1 | 70.18022 .8796416 79.78 0.000 68.45546 71.90499
6 2 | 68.46385 .8799818 77.80 0.000 66.73842 70.18928
6 3 | 70.93563 1.532553 46.29 0.000 67.93066 73.94059
7 1 | 67.41739 .7502806 89.86 0.000 65.94627 68.88851
7 2 | 65.68919 .989977 66.35 0.000 63.74809 67.6303
7 3 | 68.6227 2.329015 29.46 0.000 64.05607 73.18934
8 1 | 64.16145 .8193357 78.31 0.000 62.55493 65.76797
8 2 | 61.95217 1.561682 39.67 0.000 58.89009 65.01425
8 3 | 65.77313 3.756076 17.51 0.000 58.40837 73.13788
9 1 | 60.4124 1.44948 41.68 0.000 57.57032 63.25448
9 2 | 57.25279 2.593175 22.08 0.000 52.1682 62.33738
9 3 | 62.38689 5.728209 10.89 0.000 51.15527 73.61852
------------------------------------------------------------------------------
. marginsplot, recastci(rarea) ciopts(fcolor(%25) lcolor(%25)) recast(line)
Variables that uniquely identify margins: age class
recast(line)
removes the points to make the plot a bit
cleaner.
Call:
lm(formula = y ~ class * age, data = m)
Residuals:
Min 1Q Median 3Q Max
-75.335 -14.841 -0.036 14.469 73.169
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 90.55752 3.00978 30.088 < 2e-16 ***
class2 -13.19784 3.88045 -3.401 0.00068 ***
class3 -11.14087 4.18954 -2.659 0.00787 **
age -0.47517 0.06308 -7.533 6.52e-14 ***
class2:age 0.24845 0.08874 2.800 0.00515 **
class3:age 0.29031 0.11074 2.622 0.00880 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.19 on 2994 degrees of freedom
Multiple R-squared: 0.03412, Adjusted R-squared: 0.03251
F-statistic: 21.15 on 5 and 2994 DF, p-value: < 2.2e-16
Call:
lm(formula = y ~ class * age, data = m)
Residuals:
Min 1Q Median 3Q Max
-75.335 -14.841 -0.036 14.469 73.169
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 90.55752 3.00978 30.088 < 2e-16 ***
class2 -13.19784 3.88045 -3.401 0.00068 ***
class3 -11.14087 4.18954 -2.659 0.00787 **
age -0.47517 0.06308 -7.533 6.52e-14 ***
class2:age 0.24845 0.08874 2.800 0.00515 **
class3:age 0.29031 0.11074 2.622 0.00880 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.19 on 2994 degrees of freedom
Multiple R-squared: 0.03412, Adjusted R-squared: 0.03251
F-statistic: 21.15 on 5 and 2994 DF, p-value: < 2.2e-16
Call:
lm(formula = y ~ class * age, data = m)
Residuals:
Min 1Q Median 3Q Max
-75.335 -14.841 -0.036 14.469 73.169
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 90.55752 3.00978 30.088 < 2e-16 ***
class2 -13.19784 3.88045 -3.401 0.00068 ***
class3 -11.14087 4.18954 -2.659 0.00787 **
age -0.47517 0.06308 -7.533 6.52e-14 ***
class2:age 0.24845 0.08874 2.800 0.00515 **
class3:age 0.29031 0.11074 2.622 0.00880 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.19 on 2994 degrees of freedom
Multiple R-squared: 0.03412, Adjusted R-squared: 0.03251
F-statistic: 21.15 on 5 and 2994 DF, p-value: < 2.2e-16
Generate a variable indicating whether age is above 40 before fitting the model.
. generate above40 = age > 40
. regress y i.above40##c.age
Source | SS df MS Number of obs = 3,000
-------------+---------------------------------- F(3, 2996) = 33.97
Model | 45772.6752 3 15257.5584 Prob > F = 0.0000
Residual | 1345660.33 2,996 449.152313 R-squared = 0.0329
-------------+---------------------------------- Adj R-squared = 0.0319
Total | 1391433.01 2,999 463.965657 Root MSE = 21.193
------------------------------------------------------------------------------
y | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
1.above40 | 22.28839 5.698579 3.91 0.000 11.11487 33.46191
age | -.0686643 .0887806 -0.77 0.439 -.2427415 .1054129
|
above40#|
c.age |
1 | -.5400972 .1329699 -4.06 0.000 -.8008187 -.2793757
|
_cons | 74.76941 2.739267 27.30 0.000 69.39837 80.14044
------------------------------------------------------------------------------
. margins, at(age = (20 40) above40 = 0) ///
> at(age = (40 60) above40 = 1)
Adjusted predictions Number of obs = 3,000
Model VCE: OLS
Expression: Linear prediction, predict()
1._at: above40 = 0
age = 20
2._at: above40 = 0
age = 40
3._at: above40 = 1
age = 40
4._at: above40 = 1
age = 60
------------------------------------------------------------------------------
| Delta-method
| Margin std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
_at |
1 | 73.39612 1.056808 69.45 0.000 71.32398 75.46826
2 | 72.02283 1.017694 70.77 0.000 70.02738 74.01828
3 | 72.70734 1.150884 63.18 0.000 70.45073 74.96394
4 | 60.53211 1.12271 53.92 0.000 58.33074 62.73347
------------------------------------------------------------------------------
. marginsplot, recastci(rarea) ciopts(fcolor(%25) lcolor(%25)) ///
> legend(off)
Variables that uniquely identify margins: age _atopt
Multiple at() options specified:
_atoption=1: age = (20 40) above40 = 0
_atoption=2: age = (40 60) above40 = 1
Note the use of two separate at
statements to ensure
that each trend is plotted only in the appropriate age
range. The legend is suppressed as it is obvious in this situation.
For a full diff-in-diff plot, simply add the additional term as usual.
. regress y i.class##i.above40##c.age
Source | SS df MS Number of obs = 3,000
-------------+---------------------------------- F(11, 2988) = 10.96
Model | 53973.7541 11 4906.70491 Prob > F = 0.0000
Residual | 1337459.25 2,988 447.610191 R-squared = 0.0388
-------------+---------------------------------- Adj R-squared = 0.0353
Total | 1391433.01 2,999 463.965657 Root MSE = 21.157
------------------------------------------------------------------------------
y | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
class |
2 | -10.88013 8.740776 -1.24 0.213 -28.01868 6.258421
3 | -3.526033 8.562967 -0.41 0.681 -20.31594 13.26387
|
1.above40 | 17.168 10.00685 1.72 0.086 -2.453016 36.78902
|
class#|
above40 |
2 1 | 16.39086 14.46734 1.13 0.257 -11.9761 44.75782
3 1 | -7.68618 22.34638 -0.34 0.731 -51.50204 36.12968
|
age | -.1626578 .220437 -0.74 0.461 -.5948815 .2695659
|
class#c.age |
2 | .1987596 .2644373 0.75 0.452 -.319738 .7172571
3 | .0879317 .2661492 0.33 0.741 -.4339225 .609786
|
above40#|
c.age |
1 | -.4411289 .2555264 -1.73 0.084 -.9421543 .0598966
|
class#|
above40#|
c.age |
2 1 | -.3434769 .3518112 -0.98 0.329 -1.033294 .3463398
3 1 | .1697788 .5123409 0.33 0.740 -.8347979 1.174356
|
_cons | 80.04815 7.439131 10.76 0.000 65.46182 94.63449
------------------------------------------------------------------------------
. margins class, at(age = (20 40) above40 = 0) ///
> at(age = (40 60) above40 = 1)
Adjusted predictions Number of obs = 3,000
Model VCE: OLS
Expression: Linear prediction, predict()
1._at: above40 = 0
age = 20
2._at: above40 = 0
age = 40
3._at: above40 = 1
age = 40
4._at: above40 = 1
age = 60
------------------------------------------------------------------------------
| Delta-method
| Margin std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
_at#class |
1 1 | 76.795 3.167156 24.25 0.000 70.58497 83.00502
1 2 | 69.89006 1.791452 39.01 0.000 66.37746 73.40267
1 3 | 75.0276 1.461485 51.34 0.000 72.16198 77.89322
2 1 | 73.54184 1.896068 38.79 0.000 69.82411 77.25957
2 2 | 70.6121 1.557486 45.34 0.000 67.55824 73.66595
2 3 | 73.53308 2.020516 36.39 0.000 69.57133 77.49482
3 1 | 73.06469 1.647357 44.35 0.000 69.83462 76.29476
3 2 | 72.78672 1.908052 38.15 0.000 69.04549 76.52795
3 3 | 72.1609 3.371785 21.40 0.000 65.54964 78.77215
4 1 | 60.98895 1.309189 46.59 0.000 58.42195 63.55596
4 2 | 57.81664 2.445232 23.64 0.000 53.02213 62.61115
4 3 | 65.23937 6.031687 10.82 0.000 53.41269 77.06605
------------------------------------------------------------------------------
. marginsplot, recast(line) noci ///
> plot1opts(lcolor("red")) ///
> plot2opts(lcolor("blue")) ///
> plot3opts(lcolor("red") lpattern("dash")) ///
> plot4opts(lcolor("blue") lpattern("dash")) ///
> plot5opts(lcolor("red") lpattern("dash_dot")) ///
> plot6opts(lcolor("blue") lpattern("dash_dot")) ///
> legend(order(1 "Class 1" 3 "Class 2" 5 "Class 3"))
Variables that uniquely identify margins: age class _atopt
Multiple at() options specified:
_atoption=1: age = (20 40) above40 = 0
_atoption=2: age = (40 60) above40 = 1
The plot requires a lot more configuring to make it look nice; I’ve even suppressed the confidence bounds to make it more readable.
Call:
lm(formula = y ~ above40 * age, data = m)
Residuals:
Min 1Q Median 3Q Max
-73.127 -14.764 0.256 14.611 73.865
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 74.76941 2.73927 27.295 < 2e-16 ***
above40TRUE 22.28839 5.69858 3.911 9.39e-05 ***
age -0.06866 0.08878 -0.773 0.439
above40TRUE:age -0.54010 0.13297 -4.062 4.99e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.19 on 2996 degrees of freedom
Multiple R-squared: 0.0329, Adjusted R-squared: 0.03193
F-statistic: 33.97 on 3 and 2996 DF, p-value: < 2.2e-16
I do not currently know of a way to do this entirely within emmeans without resorting to manual plotting. If you do know a way to do it, please submit an issue or otherwise reach out to me!
em_under <- emmeans(mod1, ~ age, at = list(age = c(20, 40), above40 = FALSE))
em_above <- emmeans(mod1, ~ age, at = list(age = c(40, 60), above40 = TRUE))
em_combined <- data.frame(rbind(em_under, em_above))
em_combined$above40 <- c(FALSE, FALSE, TRUE, TRUE)
ggplot(em_combined, aes(x = age, y = emmean, color = above40, group = above40)) +
geom_ribbon(aes(ymin = lower.CL, ymax = upper.CL, fill = above40), alpha = .2) +
geom_line(lwd = 2) +
scale_fill_manual(values = c("FALSE" = "blue", "TRUE" = "red")) +
scale_color_manual(values = c("FALSE" = "blue", "TRUE" = "red")) +
xlab("Age") + ylab("Prediction") + theme(legend.position = "none")
Call:
lm(formula = y ~ above40 * age, data = m)
Residuals:
Min 1Q Median 3Q Max
-73.127 -14.764 0.256 14.611 73.865
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 74.76941 2.73927 27.295 < 2e-16 ***
above40TRUE 22.28839 5.69858 3.911 9.39e-05 ***
age -0.06866 0.08878 -0.773 0.439
above40TRUE:age -0.54010 0.13297 -4.062 4.99e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.19 on 2996 degrees of freedom
Multiple R-squared: 0.0329, Adjusted R-squared: 0.03193
F-statistic: 33.97 on 3 and 2996 DF, p-value: < 2.2e-16
plot_predictions(mod1,
by = "age",
newdata = data.frame(age = c(20, 40, 40, 60),
above40 = c(FALSE, FALSE, TRUE, TRUE)))
The newdata
argument is doing the heavy-lifting here;
the by
argument is specifying the x-axis.
Note that you can also simply call
predictions(mod1, newdata = ...)
and plot the results
manually, for maximum flexibility.
. logit yc i.class, nolog or
Logistic regression Number of obs = 3,000
LR chi2(2) = 12.39
Prob > chi2 = 0.0020
Log likelihood = -1651.6397 Pseudo R2 = 0.0037
------------------------------------------------------------------------------
yc | Odds ratio Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
class |
2 | .8231057 .0789757 -2.03 0.042 .6819996 .9934067
3 | .6747704 .0779699 -3.40 0.001 .5380213 .846277
|
_cons | .3718535 .0241593 -15.23 0.000 .327393 .4223519
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.
Below, the references to each coefficient
(e.g. _b[2.class]
) was obtained by calling
logit, coeflegend
after the original logit
call.
. lincom _b[_cons], or
( 1) [yc]_cons = 0
------------------------------------------------------------------------------
yc | Odds ratio Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
(1) | .3718535 .0241593 -15.23 0.000 .327393 .4223519
------------------------------------------------------------------------------
. lincom _b[_cons] + _b[2.class], or
( 1) [yc]2.class + [yc]_cons = 0
------------------------------------------------------------------------------
yc | Odds ratio Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
(1) | .3060748 .0216103 -16.77 0.000 .2665193 .3515008
------------------------------------------------------------------------------
. lincom _b[_cons] + _b[3.class], or
( 1) [yc]3.class + [yc]_cons = 0
------------------------------------------------------------------------------
yc | Odds ratio Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
(1) | .2509158 .0239763 -14.47 0.000 .2080613 .302597
------------------------------------------------------------------------------
. lincom _b[2.class], or
( 1) [yc]2.class = 0
------------------------------------------------------------------------------
yc | Odds ratio Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
(1) | .8231057 .0789757 -2.03 0.042 .6819996 .9934067
------------------------------------------------------------------------------
. lincom _b[3.class], or
( 1) [yc]3.class = 0
------------------------------------------------------------------------------
yc | Odds ratio Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
(1) | .6747704 .0779699 -3.40 0.001 .5380213 .846277
------------------------------------------------------------------------------
. lincom _b[3.class] - _b[2.class], or
( 1) - [yc]2.class + [yc]3.class = 0
------------------------------------------------------------------------------
yc | Odds ratio Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
(1) | .8197858 .0973987 -1.67 0.094 .6494852 1.034741
------------------------------------------------------------------------------