When can Poisson Regression approximate Logistic?
An older idea in Epidemiology is to use a Poisson regression model in place of a logistic regression model. This idea has some validity because with a low mean, the Poisson distribution approximates the binary distribution.
Simulation
Let’s examine this. First, we define a program which generates some data according to a logistic model, then fits both logistic and Poisson regression models against it.
This program, defined below, takes in three arguments:
n
- Sample sizep
- Baseline probability of successb1
- Coefficient of interest.
The model is simply
program def binsim, rclass drop _all args n p b1 set obs `n' gen x = rnormal() gen y = rbinomial(1, invlogit(logit(`p') + `b1'*x)) * Return P(success) to ensure everything is working mean y mat b = e(b) scalar pp = b[1,1] return scalar pp=pp * Poisson model poisson y x mat b = e(b) scalar b_pois = b[1,1] return scalar b_pois=b_pois * Logistic model logistic y x mat b = e(b) scalar b_logit = b[1,1] return scalar b_logit=b_logit end
Results
We'll let b1
= .4 with n
= 10,000.
(n
needs to be large enough such that we have a
nontrivial amount of successes.) Choose from the tabs above the
p
to see how the results vary as the prevalence
approaches 0.
Run the simulation. We save the estimated proportion of successes
(pp
)to ensure the simulation worked as intended, as
well as the estimated coefficients from the Poisson model
(b_pois
) and logistic model (b_logit
).
. simulate pp=r(pp) b_pois=r(b_pois) b_logit=r(b_logit), /// > reps(1000) nodots: binsim 10000 .1.05.03.01.005 .4 Command: binsim 10000 .1.05.03.01.005 .4 pp: r(pp) b_pois: r(b_pois) b_logit: r(b_logit)
First we’ll ensure the code is working and that proportion of positive outcomes is approximately .1. .05. .03. .01. .005.
. mean pp Mean estimation Number of obs = 1,000 -------------------------------------------------------------- | Mean Std. err. [95% conf. interval] -------------+------------------------------------------------ pp | .1057365 .0000978 .1055445 .1059285 --------------------------------------------------------------
. mean pp Mean estimation Number of obs = 1,000 -------------------------------------------------------------- | Mean Std. err. [95% conf. interval] -------------+------------------------------------------------ pp | .0535497 .0000704 .0534115 .0536879 --------------------------------------------------------------
. mean pp Mean estimation Number of obs = 1,000 -------------------------------------------------------------- | Mean Std. err. [95% conf. interval] -------------+------------------------------------------------ pp | .0322263 .0000573 .0321139 .0323387 --------------------------------------------------------------
. mean pp Mean estimation Number of obs = 1,000 -------------------------------------------------------------- | Mean Std. err. [95% conf. interval] -------------+------------------------------------------------ pp | .0108073 .0000342 .0107402 .0108744 --------------------------------------------------------------
. mean pp Mean estimation Number of obs = 1,000 -------------------------------------------------------------- | Mean Std. err. [95% conf. interval] -------------+------------------------------------------------ pp | .0053893 .0000223 .0053454 .0054332 --------------------------------------------------------------
Now we can examine the distributions of the two estimated coefficients. If Poisson is truly a good approximation, then the two distributions should be nearly identical.
. twoway kdensity b_logit || kdensity b_pois, /// > xline(.4) legend(label(1 "Logistic") label(2 "Poisson"))
We can estimate the proportion of bias.
. gen error = abs(b_logit - b_pois)/b_logit . mean error Mean estimation Number of obs = 1,000 -------------------------------------------------------------- | Mean Std. err. [95% conf. interval] -------------+------------------------------------------------ error | .1195278 .0001314 .1192699 .1197857 --------------------------------------------------------------
. gen error = abs(b_logit - b_pois)/b_logit . mean error Mean estimation Number of obs = 1,000 -------------------------------------------------------------- | Mean Std. err. [95% conf. interval] -------------+------------------------------------------------ error | .0616367 .0001023 .0614361 .0618374 --------------------------------------------------------------
. gen error = abs(b_logit - b_pois)/b_logit . mean error Mean estimation Number of obs = 1,000 -------------------------------------------------------------- | Mean Std. err. [95% conf. interval] -------------+------------------------------------------------ error | .0375324 .0000839 .0373677 .0376971 --------------------------------------------------------------
. gen error = abs(b_logit - b_pois)/b_logit . mean error Mean estimation Number of obs = 1,000 -------------------------------------------------------------- | Mean Std. err. [95% conf. interval] -------------+------------------------------------------------ error | .0127062 .0000505 .0126072 .0128053 --------------------------------------------------------------
. gen error = abs(b_logit - b_pois)/b_logit . mean error Mean estimation Number of obs = 1,000 -------------------------------------------------------------- | Mean Std. err. [95% conf. interval] -------------+------------------------------------------------ error | .0064606 .0000391 .0063839 .0065374 --------------------------------------------------------------
Conclusion
When the prevalence is very low, Poisson is not a bad approximation, but I wouldn’t recommend using it over Logistic unless prevalence was 1% or less.