When can Poisson Regression approximate Logistic?

An older idea in Epidemiology is to use a Poisson regression model in place of a logistic regression model. This idea has some validity because with a low mean, the Poisson distribution approximates the binary distribution.

Simulation

Let’s examine this. First, we define a program which generates some data according to a logistic model, then fits both logistic and Poisson regression models against it.

This program, defined below, takes in three arguments:

• n - Sample size
• p - Baseline probability of success
• b1 - Coefficient of interest.

The model is simply

$logit(P(Y = 1 | X)) = logit(p) + b_1x$
program def binsim, rclass
drop _all
args n p b1
set obs n'
gen x = rnormal()
gen y = rbinomial(1, invlogit(logit(p') + b1'*x))
* Return P(success) to ensure everything is working
mean y
mat b = e(b)
scalar pp = b[1,1]
return scalar pp=pp

* Poisson model
poisson y x
mat b = e(b)
scalar b_pois = b[1,1]
return scalar b_pois=b_pois

* Logistic model
logistic y x
mat b = e(b)
scalar b_logit = b[1,1]
return scalar b_logit=b_logit
end

Results

Prevalence:
10%
5%
3%
1%
.5%

We'll let b1 $$= .4$$ with n $$= 10,000$$. (n needs to be large enough such that we have a nontrivial amount of successes.) Choose from the tabs above the p to see how the results vary as the prevalence approaches $$0$$.

Run the simulation. We save the estimated proportion of successes (pp)to ensure the simulation worked as intended, as well as the estimated coefficients from the Poisson model (b_pois) and logistic model (b_logit).

. simulate pp=r(pp) b_pois=r(b_pois) b_logit=r(b_logit), ///
>     reps(1000) nodots: binsim 10000 .1 .4

Command: binsim 10000 .1 .4
pp: r(pp)
b_pois: r(b_pois)
b_logit: r(b_logit)


First we’ll ensure the code is working and that proportion of positive outcomes is approximately .1.

. mean pp

Mean estimation                          Number of obs = 1,000

--------------------------------------------------------------
|       Mean   Std. err.     [95% conf. interval]
-------------+------------------------------------------------
pp |   .1057365   .0000978      .1055445    .1059285
--------------------------------------------------------------

Now we can examine the distributions of the two estimated coefficients. If Poisson is truly a good approximation, then the two distributions should be nearly identical.

. twoway kdensity b_logit || kdensity b_pois, ///
>     xline(.4) legend(label(1 "Logistic") label(2 "Poisson"))

We can estimate the proportion of bias.

. gen error = abs(b_logit - b_pois)/b_logit

. mean error

Mean estimation                          Number of obs = 1,000

--------------------------------------------------------------
|       Mean   Std. err.     [95% conf. interval]
-------------+------------------------------------------------
error |   .1195278   .0001314      .1192699    .1197857
--------------------------------------------------------------`

Conclusion

When the prevalence is very low, Poisson is not a bad approximation, but I wouldn’t recommend using it over Logistic unless prevalence was 1% or less.