When can Poisson Regression approximate Logistic?

An older idea in Epidemiology is to use a Poisson regression model in place of a logistic regression model. This idea has some validity because with a low mean, the Poisson distribution approximates the binary distribution.

Simulation

Let’s examine this. First, we define a program which generates some data according to a logistic model, then fits both logistic and Poisson regression models against it.

This program, defined below, takes in three arguments:

n - Sample size
p - Baseline probability of success
b1 - Coefficient of interest.

The model is simply

logit (P (Y = 1 | X)) = logit (p) + b_{1} x

program def binsim, rclass
    drop _all
    args n p b1
    set obs `n'
    gen x = rnormal()
    gen y = rbinomial(1, invlogit(logit(`p') + `b1'*x))
    * Return P(success) to ensure everything is working
    mean y
    mat b = e(b)
    scalar pp = b[1,1]
    return scalar pp=pp

    * Poisson model
    poisson y x
    mat b = e(b)
    scalar b_pois = b[1,1]
    return scalar b_pois=b_pois

    * Logistic model
    logistic y x
    mat b = e(b)
    scalar b_logit = b[1,1]
    return scalar b_logit=b_logit
end

Results

Prevalence: 10% 5% 3% 1% .5%

We'll let b1 = .4 with n = 10,000. (n needs to be large enough such that we have a nontrivial amount of successes.) Choose from the tabs above the p to see how the results vary as the prevalence approaches 0.

Run the simulation. We save the estimated proportion of successes (pp)to ensure the simulation worked as intended, as well as the estimated coefficients from the Poisson model (b_pois) and logistic model (b_logit).

. simulate pp=r(pp) b_pois=r(b_pois) b_logit=r(b_logit), ///
>     reps(1000) nodots: binsim 10000 .1.05.03.01.005 .4

      Command: binsim 10000 .1.05.03.01.005 .4
           pp: r(pp)
       b_pois: r(b_pois)
      b_logit: r(b_logit)

First we’ll ensure the code is working and that proportion of positive outcomes is approximately .1. .05. .03. .01. .005.

. mean pp

Mean estimation                          Number of obs = 1,000

--------------------------------------------------------------
             |       Mean   Std. err.     [95% conf. interval]
-------------+------------------------------------------------
          pp |   .1057365   .0000978      .1055445    .1059285
--------------------------------------------------------------

. mean pp

Mean estimation                          Number of obs = 1,000

--------------------------------------------------------------
             |       Mean   Std. err.     [95% conf. interval]
-------------+------------------------------------------------
          pp |   .0535497   .0000704      .0534115    .0536879
--------------------------------------------------------------

. mean pp

Mean estimation                          Number of obs = 1,000

--------------------------------------------------------------
             |       Mean   Std. err.     [95% conf. interval]
-------------+------------------------------------------------
          pp |   .0322263   .0000573      .0321139    .0323387
--------------------------------------------------------------

. mean pp

Mean estimation                          Number of obs = 1,000

--------------------------------------------------------------
             |       Mean   Std. err.     [95% conf. interval]
-------------+------------------------------------------------
          pp |   .0108073   .0000342      .0107402    .0108744
--------------------------------------------------------------

. mean pp

Mean estimation                          Number of obs = 1,000

--------------------------------------------------------------
             |       Mean   Std. err.     [95% conf. interval]
-------------+------------------------------------------------
          pp |   .0053893   .0000223      .0053454    .0054332
--------------------------------------------------------------

Now we can examine the distributions of the two estimated coefficients. If Poisson is truly a good approximation, then the two distributions should be nearly identical.

. twoway kdensity b_logit || kdensity b_pois, ///
>     xline(.4) legend(label(1 "Logistic") label(2 "Poisson"))

We can estimate the proportion of bias.

. gen error = abs(b_logit - b_pois)/b_logit

. mean error

Mean estimation                          Number of obs = 1,000

--------------------------------------------------------------
             |       Mean   Std. err.     [95% conf. interval]
-------------+------------------------------------------------
       error |   .1195278   .0001314      .1192699    .1197857
--------------------------------------------------------------

. gen error = abs(b_logit - b_pois)/b_logit

. mean error

Mean estimation                          Number of obs = 1,000

--------------------------------------------------------------
             |       Mean   Std. err.     [95% conf. interval]
-------------+------------------------------------------------
       error |   .0616367   .0001023      .0614361    .0618374
--------------------------------------------------------------

. gen error = abs(b_logit - b_pois)/b_logit

. mean error

Mean estimation                          Number of obs = 1,000

--------------------------------------------------------------
             |       Mean   Std. err.     [95% conf. interval]
-------------+------------------------------------------------
       error |   .0375324   .0000839      .0373677    .0376971
--------------------------------------------------------------

. gen error = abs(b_logit - b_pois)/b_logit

. mean error

Mean estimation                          Number of obs = 1,000

--------------------------------------------------------------
             |       Mean   Std. err.     [95% conf. interval]
-------------+------------------------------------------------
       error |   .0127062   .0000505      .0126072    .0128053
--------------------------------------------------------------

. gen error = abs(b_logit - b_pois)/b_logit

. mean error

Mean estimation                          Number of obs = 1,000

--------------------------------------------------------------
             |       Mean   Std. err.     [95% conf. interval]
-------------+------------------------------------------------
       error |   .0064606   .0000391      .0063839    .0065374
--------------------------------------------------------------

Conclusion

When the prevalence is very low, Poisson is not a bad approximation, but I wouldn’t recommend using it over Logistic unless prevalence was 1% or less.