- Understand the application of Poisson regression to count data
- Understand how to fit Poisson regression models in R
- Understand how to evaluate model fits and diagnostics for Poisson regression
15 May 2020
Counts form the basis for much of our data in environmental sciences
Number of adult salmon returning to spawn in a river
Number of days of rain in a year
Number of bees visiting a flower
We have seen how to model proportional data with GLMs
\(k\) survivors out of \(n\) tagged individuals
\(k\) infected individuals out of \(n\) susceptible individuals
\(k\) counts of allele A in \(n\) total chromosomes
With count data, we only know the frequency of occurrence
That is, how often something occurred without knowing how often it did not occur
Standard regression models are inappropriate for count data for 4 reasons:
linear model might lead to predictions of negative counts
variance of the response variable may increase with the mean
errors are not normally distributed
zeros are difficult to transform
The Poisson distribution is perhaps the best known
It gives the probability of a given number of events occurring in a fixed interval of time or space
the number of Prussian soldiers killed by horse kicks per year from 1868 - 1931
the number of new COVID-19 infections per day in the US
the number of email messages I receive per week from students in QERM 514
It’s unique in that it has one parameter \(\lambda\) to describe both the mean and variance
\[ y_i \sim \text{Poisson}(\lambda) \] \[ \text{Mean}(y) = \text{Var}(y) = \lambda \]
As \(\lambda \rightarrow \infty\) the Poisson \(\rightarrow\) Normal
\[ f(y; \theta, \phi) = \exp \left( \frac{y \theta - b(\theta)}{a(\phi)} - c(y, \phi)\right) \\ \Downarrow \\ f(y; \mu) = \frac{\exp (- \mu) \mu^y}{y!} \\ \]
with \(\theta = \log (\mu)\) and \(\phi = 1\)
\(a(\phi) = 1 ~~~~~~ b(\theta) = \exp (\theta) ~~~~~~ c(y, \phi) = - \log (y!)\)
An interesting property of the Poisson is that
\[ y_i \sim \text{Poisson}(\lambda) \\ \Downarrow \\ \sum_i y_i \sim \text{Poisson}(\sum_i \lambda_i) \]
This means we can use aggregated data if we lack individual-level data
The Poisson distribution can also approximate a binomial distribution if \(n\) is large and \(p\) is small
As \(p \rightarrow 0\), \(\text{logit}(p) \rightarrow \log(p)\)
Binomial with logit link \(\rightarrow\) Poisson with log link
An example with \(p\) = 0.05 and \(n\) = 1000
We have been considering (log) density itself as a response
\[ \text{Density}_i = f (\text{Count}_i, \text{Area}_i) \\ \Downarrow \\ \text{Density}_i = \frac{\text{Count}_i}{\text{Area}_i} \\ \]
We have been considering (log) density itself as a response
\[ \text{Density}_i = f (\text{Count}_i, \text{Area}_i) \\ \Downarrow \\ \text{Density}_i = \frac{\text{Count}_i}{\text{Area}_i} \\ \]
With GLMs, we can shift our focus to
\[ \text{Count}_i = f (\text{Area}_i) \]
Catches of spot prawns \(y_i\) as a function of bait type \(C_i\) and water temperature \(T_i\)
\[ \begin{aligned} \text{data distribution:} & ~~ y_i \sim \text{Poisson}(\lambda_i) \\ \\ \text{link function:} & ~~ \text{log}(\lambda_i) = \mu_i \\ \\ \text{linear predictor:} & ~~ \mu_i = \alpha + \beta_1 C_i + \beta_2 T_i \end{aligned} \]
## Poisson regression cmod <- glm(catch ~ fish + temp, data = prawns, family = poisson(link = "log")) faraway::sumary(cmod)
## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 3.5644284 0.0906850 39.306 < 2.2e-16 ## fish 0.0894061 0.0274085 3.262 0.001106 ## temp 0.0256769 0.0087425 2.937 0.003314 ## ## n = 113 p = 3 ## Deviance = 135.32140 Null Deviance = 157.85016 (Difference = 22.52876)
We can easily estimate the CI’s on the model parameters with confint()
## CI's for prawn model ci_prawns <- confint(cmod) ci_tbl <- cbind(ci_prawns[,1], coef(cmod), ci_prawns[,2]) colnames(ci_tbl) <- c("Lower", "Estimate", "Upper") signif(ci_tbl, 3)
## Lower Estimate Upper ## (Intercept) 3.39000 3.5600 3.7400 ## fish 0.03570 0.0894 0.1430 ## temp 0.00856 0.0257 0.0428
Due to possible bias in \(\text{SE}(\beta)\), we can compute CI’s based on the profile likelihood
## number of points to profile nb <- 200 ## possible beta's beta_bait <- seq(0, 0.2, length = nb) ## calculate neg-LL of possible beta_bait ## fix beta_temp at its MLE plb <- rep(NA, nb) for(i in 1:nb) { mm <- glm(catch ~ 1 + offset(beta_bait[i] * fish + offset(coef(cmod)[3] * temp)), data = prawns, family = poisson(link = "log")) plb[i] <- -logLik(mm) }
It’s natural to ask how well a model fits the data
As with binomial models, we can check the deviance \(D\) against a \(\chi^2\) distribution
Recall that the deviance for any model is
\[ D_i = \text{-}2 \left[ \log \mathcal{L}(M_i) - \log \mathcal{L}(M_S) \right] \]
where \(M_i\) is the model of interest and \(M_S\) is a saturated model with \(k = n\)
The log-likelihood for a Poisson is
\[ \log \mathcal{L}(y; \lambda) = \sum_{i=1}^{n} \left[ y_{i} \log (\lambda_i)- \lambda_i - \log \left( y_{i}! \right) \right] \]
The deviance for a Poisson is
\[ D = 2 \sum_{i=1}^{n} \left[ y_{i} \log (y_i / \hat{\lambda}_i) - (y_i - \hat{\lambda}_i) \right] \]
\(H_0\): Our model is correctly specified
## deviance of prawn model D_full <- summary(cmod)$deviance ## LRT with df = 1 (p_value <- pchisq(D_full, nn - length(coef(cmod)), lower.tail = FALSE))
## [1] 0.05096932
We cannot reject the \(H_0\)
It turns out that the assumption of \(D \sim \chi^2_{n - k}\) can be violated with Poisson models unless \(\lambda\) is large
Another option is Pearson’s \(X^2\) statistic we saw for binomial models
Recall that Pearson’s \(X^2\) is
\[ X^2 = \sum_{i = 1}^n \frac{(O_i - E_i)^2}{E_i} \sim \chi^2_{(n - k)} \\ \] So for our Poisson model
\[ X^2 = \sum_{i=1}^{n} \frac{(y_i - \hat{\lambda}_i)^2}{\hat{\lambda}_i} \sim \chi^2_{n - k} \]
\(H_0\): Our model is correctly specified
## numerator nm <- (prawns$catch - fitted(cmod))^2 ## denominator dm <- fitted(cmod) ## Pearson's X2 <- sum(nm / dm) ## test (p_value <- pchisq(X2, nn - length(coef(cmod)), lower.tail = FALSE))
## [1] 0.07074179
We cannot reject the \(H_0\)
As with other models, it’s important to examine diagnostic checks for our fitted models
We can calculate the leverages \(h\) to look for unusual observation in predictor space
Recall we are potentially concerned about \(h > 2 \frac{k}{n}\)
We can use faraway::halfnorm()
Recall that we can use Cook’s \(D\) to identify potentially influential points
\[ D_{i}=e_{i}^{2} \frac{1}{k}\left(\frac{h_{i}}{1-h_{i}}\right) \]
In general we are concerned about \(D_i > F^{(0.5)}_{n, n - k} \approx 1\)
We can use a likelihood ratio test to compare our model to an intercept-only model
## deviance of full model D_full <- summary(cmod)$deviance ## deviance of null model D_null <- summary(cmod)$null.deviance ## test statistic lambda <- D_null - D_full ## LRT with df = 2 (p_value <- pchisq(lambda, 2, lower.tail = FALSE))
## [1] 1.282157e-05
We reject \(H_0\) (that the data come from the null model)