- Understand the difference between zero-truncated and zero-inflated data
- Understand how to model zero-truncated data
- Understand the differences between zero-altered and zero-inflated models
20 May 2020
Zero-truncated data cannot take a value of 0
Although somewhat rare in ecological studies, examples include
time a whale is at the surface before diving
herd size in elk
number of fin rays on a fish
Zero-truncated data are not necessarily a problem
Rather, an underlying assumption of Poisson or neg binomial may be the problem
Both of these examples contain zeros
Recall that for \(y_i \sim \text{Poisson}(\lambda)\)
its probability mass function is
\[ f(y_i; \lambda_i) = \frac{\exp (\text{-} \lambda_i) \lambda_{i}^{y_i}}{y_i!} \]
\(f(y_i; \lambda_i)\) gives the probability of \(y_i \geq 0\)
The probability that \(y_i = 0\) is
\[ f(y_i; \lambda_i) = \frac{\exp (\text{-} \lambda_i) \lambda_{i}^{y_i}}{y_i!} \\ \Downarrow \\ \begin{align} f(y_i = 0; \lambda_i) &= \frac{\exp (\text{-} \lambda_i) \lambda_{i}^0}{0!} \\ &= \exp (\text{-} \lambda_i) \end{align} \]
The probability that \(y_i \neq 0\) is therefore
\[ f(y_i = 0; \lambda_i) = \exp (\text{-} \lambda_i) \\ \Downarrow \\ f(y_i \neq 0; \lambda_i) = 1 - \exp (\text{-} \lambda_i) \]
We can now exclude the probability that \(y_i = 0\) by dividing the pmf by the probability that \(y_i \neq 0\)
\[ f(y_i; \lambda_i) = \frac{\exp (\text{-} \lambda_i) \lambda_{i}^{y_i}}{y_i!} \\ \Downarrow \\ f(y_i; \lambda_i | y_i > 0) = \frac{\exp (\text{-} \lambda_i) \lambda_{i}^{y_i}}{y_i!} \cdot \frac{1}{1 - \exp (\text{-} \lambda_i)} \\ \Downarrow \\ \log \mathcal{L} = \left( y_i \log \lambda_i - \lambda_i \right) - \left( 1 - \exp (\text{-} \lambda_i) \right) \]
Let’s consider some data presented in Zuur et al. (2009), which detail the number of days that carcasses from road-killed snakes stay on roads
The predictors are the total rainfall (mm) and an indicator of where on the pavement the snake was killed (lane or shoulder)
Let’s first consider a regular Poisson regression model
## Poisson regression smod_pois <- glm(n_days ~ location + rain, data = snakes, family = poisson(link = "log"))
## ## Call: ## glm(formula = n_days ~ location + rain, family = poisson(link = "log"), ## data = snakes) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.0560 -0.7981 -0.4738 0.3410 6.6474 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.439195 0.088539 4.960 7.03e-07 *** ## locationV 0.462663 0.119060 3.886 0.000102 *** ## rain 0.021707 0.003092 7.021 2.21e-12 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 226.38 on 129 degrees of freedom ## Residual deviance: 175.80 on 127 degrees of freedom ## AIC: 497.63 ## ## Number of Fisher Scoring iterations: 5
Now let’s fit a zero-truncated Poisson regression model with vglm()
from VGAM
library(VGAM) ## zero-truncated Poisson regression smod_ztpois <- vglm(n_days ~ location + rain, data = snakes, family = pospoisson)
## ## Call: ## vglm(formula = n_days ~ location + rain, family = pospoisson, ## data = snakes) ## ## Pearson residuals: ## Min 1Q Median 3Q Max ## loglink(lambda) -2.086 -0.8361 -0.7313 0.4636 12.54 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.041426 0.125111 0.331 0.741 ## locationV 0.711320 0.149272 4.765 1.89e-06 *** ## rain 0.027329 0.003326 8.217 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Name of linear predictor: loglink(lambda) ## ## Log-likelihood: -218.6267 on 127 degrees of freedom ## ## Number of Fisher scoring iterations: 5 ## ## No Hauck-Donner effect found in any of the estimates
Here are the parameter estimates and SE’s for both models
## Poisson Poisson SE +Poisson +Poisson SE ## (Intercept) 0.439 0.089 0.041 0.125 ## locationV 0.463 0.119 0.711 0.149 ## rain 0.022 0.003 0.027 0.003
Recall that for \(y_i \sim \text{negBinom}(r, \mu)\)
its probability mass function is
\[ f(y; \mu, r) = \frac{(y+r-1) !}{(r-1) ! y !} \left( \frac{r}{\mu + r} \right)^{r}\left( \frac{\mu}{\mu + r} \right)^{y} \]
\(f(y_i; \mu, r)\) gives the probability of \(y_i \geq 0\)
The probability that \(y_i = 0\) is
\[ f(y; r, \mu) = \frac{(y+r-1) !}{(r-1) ! y !} \left( \frac{r}{\mu + r} \right)^{r}\left( \frac{\mu}{\mu + r} \right)^{y} \\ \Downarrow \\ \begin{align} f(y_i = 0; r, \mu) &= \frac{(0+r-1) !}{(r-1) ! 0 !} \left( \frac{r}{\mu + r} \right)^{r} \left( \frac{\mu}{\mu + r} \right)^{0} \\ &= \left( \frac{r}{\mu + r} \right)^{r} \end{align} \]
The probability that \(y_i \neq 0\) is therefore
\[ f(y_i = 0; r, \mu_i) = \left( \frac{r}{\mu + r} \right)^{r} \\ \Downarrow \\ f(y_i \neq 0; r, \mu_i) = 1 - \left( \frac{r}{\mu + r} \right)^{r} \]
We can now exclude the probability that \(y_i = 0\) by dividing the pmf by the probability that \(y_i \neq 0\)
\[ f(y; r, \mu) = \frac{(y+r-1) !}{(r-1) ! y !} \left( \frac{r}{\mu + r} \right)^{r} \left( \frac{\mu}{\mu + r} \right)^{y} \\ \Downarrow \\ f(y_i; \lambda_i | y_i > 0) = \frac{ \frac{(y+r-1) !}{(r-1) ! y !} \left( \frac{r}{\mu + r} \right)^{r} \left( \frac{\mu}{\mu + r} \right)^{y} }{ 1 - \left( \frac{r}{\mu + r} \right)^{r} } \\ \Downarrow \\ \log \mathcal{L} = \log \mathcal{L}(\text{NB}) - \log \left( 1 - \left( \frac{r}{\mu + r} \right)^{r} \right) \]
Let’s first consider a regular negative binomial regression model
## load MASS pkg library(MASS) ## neg binomial regression smod_nb <- glm.nb(n_days ~ location + rain, data = snakes, link = "log")
## ## Call: ## glm.nb(formula = n_days ~ location + rain, data = snakes, link = "log", ## init.theta = 4.153875871) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.6698 -0.7268 -0.3911 0.3106 4.4713 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.418101 0.106859 3.913 9.13e-05 *** ## locationV 0.453524 0.151708 2.989 0.00279 ** ## rain 0.025127 0.004529 5.548 2.88e-08 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for Negative Binomial(4.1539) family taken to be 1) ## ## Null deviance: 128.298 on 129 degrees of freedom ## Residual deviance: 94.918 on 127 degrees of freedom ## AIC: 469.28 ## ## Number of Fisher Scoring iterations: 1 ## ## ## Theta: 4.15 ## Std. Err.: 1.17 ## ## 2 x log-likelihood: -461.279
Now let’s fit a zero-truncated neg binomial regression model with vglm()
from VGAM
library(VGAM) ## zero-truncated neg binomial regression smod_ztnb <- vglm(n_days ~ location + rain, data = snakes, family = posnegbinomial)
## ## Call: ## vglm(formula = n_days ~ location + rain, family = posnegbinomial, ## data = snakes) ## ## Pearson residuals: ## Min 1Q Median 3Q Max ## loglink(munb) 0.05515 1.4856 3.7166 4.5336 28.4735 ## loglink(size) -0.82119 0.7007 0.7756 0.8972 0.9741 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept):1 -17.97164 1.94802 -9.226 <2e-16 *** ## (Intercept):2 -19.22278 0.08787 -218.753 <2e-16 *** ## locationV 0.96528 2.93334 0.329 0.742 ## rain 0.06482 0.10346 0.627 0.531 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Names of linear predictors: loglink(munb), loglink(size) ## ## Log-likelihood: -186.8024 on 256 degrees of freedom ## ## Number of Fisher scoring iterations: 3 ## ## Warning: Hauck-Donner effect detected in the following estimate(s): ## '(Intercept):1'
Here are the parameter estimates and SE’s for both models
## NB NB SE +NB +NB SE ## (Intercept) 0.418 0.107 -17.972 1.948 ## locationV 0.454 0.152 0.965 2.933 ## rain 0.025 0.005 0.065 0.103
Lots of count data are zero-inflated
The data contain more zeros than would be expected under a Poisson or negative binomial distribution
In general, there are 4 different types of errors that cause zeros
In general, there are 4 different types of errors that cause zeros
Structural (an animal is absent because the habitat is unsuitable)
Design (sampling is limited temporally or spatially)
In general, there are 4 different types of errors that cause zeros
Structural (an animal is absent because the habitat is unsuitable)
Design (sampling is limited temporally or spatially)
Observer error (inexperience or difficult circumstances)
In general, there are 4 different types of errors that cause zeros
Structural (an animal is absent because the habitat is unsuitable)
Design (sampling is limited temporally or spatially)
Observer error (inexperience or difficult circumstances)
Process error (habitat is suitable but unused)
Image from Blasco-Moreno et al (2019)
There are 2 general approaches for dealing with zero-inflated data
Zero-altered (“hurdle”) models
Zero-inflated (“mixture”) models
Hurdle models do not discriminate among the 4 types of zeros
The data are treated as 2 distinct groups:
Zeros
Non-zero counts
Image from Zuur et al (2009)
Hurdle models consist of 2 parts
Use a binomial model to determine the probability of a zero
If non-zero (“over the hurdle”), use a truncated Poisson or negative binomial to model the positive counts
A zero-altered Poisson (ZAP) model is given by
\[ f_{\text{ZAP}}(y; \pi, \lambda) = \left\{ \begin{array}{lc} f_{\text{binomial}}(y = 0; \pi) \\ \left[1 - f_{\text{binomial}}(y = 0; \pi) \right] \times \left( \frac{f_{\text{Poisson}}(y = 0; \lambda)}{1 - f_{\text{Poisson}}(y = 0; \lambda)} \right) \end{array} \right. \]
\(\pi\) is the probability of finding any individuals
\(\lambda\) is the mean (and variance) of the positive counts
We can model both parameters as functions of covariates
Probability of detection
\[ \text{logit}(\pi) = \mathbf{X}_d \boldsymbol{\beta}_d \]
Mean and variance of the positive counts
\[ \log(\lambda) = \mathbf{X}_c \boldsymbol{\beta}_c \]
Let’s apply a ZAP model to survey data for hippos
We’ll assume the following
the probability of finding hippos increases with water availability
the number of hippos increases with tree density
Detection as a function of water availability \(W\)
\[ z_i \sim \text{Bernoulli}(\pi_i) \\ \text{logit}(\pi) = \gamma_0 + \gamma_1 W_i \]
Positive counts as a function of tree density \(T\)
\[ c_i \sim \text{Poisson}^+(\lambda_i) \\ \log(\lambda) = \beta_0 + \beta_1 T_i \]
Total counts as a function of detections and positive counts
\[ y_i = z_i c_i \]
We can fit ZAP models in R with hurdle()
from the pscl package
The formula for ZAP models is specified as
y ~ predictors_of_counts | predictors_for_detection
## load pscl library(pscl) ## fit hurdle model hippo_zap <- hurdle(y ~ trees | water)
summary(hippo_zap)
## ## Call: ## hurdle(formula = y ~ trees | water) ## ## Pearson residuals: ## Min 1Q Median 3Q Max ## -1.2165 -0.7104 -0.4803 0.9193 2.6988 ## ## Count model coefficients (truncated poisson with log link): ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 2.05104 0.06955 29.492 < 2e-16 *** ## trees 0.74967 0.10843 6.914 4.71e-12 *** ## Zero hurdle model coefficients (binomial with logit link): ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.7326 0.3519 -4.924 8.48e-07 *** ## water 2.3422 0.5676 4.126 3.69e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Number of iterations in BFGS optimization: 11 ## Log-likelihood: -326.1 on 4 Df
A zero-altered negative binomial (ZANB) model is given by
\[ f_{\text{ZANB}}(y; \pi, \mu, r) = \left\{ \begin{array}{lc} f_{\text{binomial}}(y = 0; \pi) \\ \left[1 - f_{\text{binomial}}(y = 0; \pi) \right] \times \left( \frac{f_{\text{NB}}(y = 0; \mu, r)}{1 - f_{\text{NB}}(y = 0; \mu, r)} \right) \end{array} \right. \]
\(\pi\) is the probability of finding any individuals
\(\mu\) is the mean the positive counts
\(r\) is the scale for the positive counts
Zero-inflated (mixture) models treat the zeros as coming from 2 sources
observation errors (missed detections)
ecological (function of environment)
Image from Zuur et al (2009)
Zero-inflated (mixture) models consist of 2 parts
Use a binomial model to determine the probability of a zero
Use a Poisson or negative binomial to model counts, which can include zeros
Probability of a zero count comes from 2 sources:
false zeros (missed detections)
true zeros (ecological reasons)
Pr(zero) = Pr(false zero) + Pr(true zero) \(\times\) Pr(count = 0)
A zero-inflated Poisson (ZIP) model is given by
\[ \begin{align} f_{\text{ZIP}}(y = 0) &= f_{\text{Binomial}}(\pi) + [1 - f_{\text{Binomial}}(\pi)] f_{\text{Poisson}}(y = 0; \lambda) \\ ~ \\ f_{\text{ZIP}}(y | y > 0) &= [1 - f_{\text{Binomial}}(\pi)] f_{\text{Poisson}}(y; \lambda) \\ \end{align} \]
\(\pi\) is the probability of false zeros (missed detections)
\(\lambda\) is the mean (and variance) of all counts (including zeros)
We can model both parameters as functions of covariates
Probability of detection
\[ \text{logit}(\pi) = \mathbf{X}_d \boldsymbol{\beta}_d \]
Mean and variance of the counts
\[ \log(\lambda) = \mathbf{X}_c \boldsymbol{\beta}_c \]
Let’s apply a ZIP model to survey data for white tailed deer
We’ll assume the following
the probability of detecting deer decreases with tree density
the number of deer increases with tree density
Non-detection as a function of tree density \(T\)
\[ z_i \sim \text{Bernoulli}(\pi_i) \\ \text{logit}(\pi) = \gamma_0 + \gamma_1 T_i \]
Counts as a function of tree density \(T\)
\[ c_i \sim \text{Poisson}(\lambda_i) \\ \log(\lambda) = \beta_0 + \beta_1 T_i \]
Total counts as a function of detections and positive counts
\[ y_i = (1 - z_i) c_i \]
We can fit ZIP models in R with zeroinfl()
from the pscl package
The formula for ZIP models is specified as
y ~ predictors_of_counts | predictors_for_detection
## fit hurdle model deer_zip <- zeroinfl(y ~ trees | trees)
summary(deer_zip)
## ## Call: ## zeroinfl(formula = y ~ trees | trees) ## ## Pearson residuals: ## Min 1Q Median 3Q Max ## -0.7312 -0.5103 -0.3674 -0.2611 4.1697 ## ## Count model coefficients (poisson with log link): ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.5190 0.1183 12.841 < 2e-16 *** ## trees 1.0660 0.2272 4.693 2.69e-06 *** ## ## Zero-inflation model coefficients (binomial with logit link): ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.2625 0.3438 0.764 0.445104 ## trees 2.4158 0.7048 3.428 0.000609 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Number of iterations in BFGS optimization: 12 ## Log-likelihood: -185.8 on 4 Df
A zero-inflated negative binomial (ZINB) model is given by
\[ \begin{align} f_{\text{ZIP}}(y = 0) &= f_{\text{Binomial}}(\pi) + [1 - f_{\text{Binomial}}(\pi)] f_{\text{NB}}(y = 0; \mu, r) \\ ~ \\ f_{\text{ZIP}}(y | y > 0) &= [1 - f_{\text{Binomial}}(\pi)] f_{\text{NB}}(y; \mu, r) \\ \end{align} \]
\(\pi\) is the probability of false zeros (missed detections)
\(\mu\) is the mean of all counts (including zeros)
\(r\) is the scale of the counts
Formulate good hypotheses and create a robust study design
Understand the system of interest
Detect and classify zeros
Remove false zeros due to design or observer errors
Understand the system of interest
Detect and classify zeros
Identify suitable covariates for zeros & non-zeros
What are the causes of zeros (non-zeros)
Understand the system of interest
Detect and classify zeros
Identify suitable covariates for zeros & non-zeros
Test for overdispersion
Understand the system of interest
Detect and classify zeros
Identify suitable covariates for zeros & non-zeros
Test for overdispersion
Choose appropriate model
Source | Reason | Over-dispersion | Zero inflation | Approach |
---|---|---|---|---|
Random | Sampling variability | No | No | Poisson |
Yes | No | Neg binomial | ||
Structural | Outside count process | No | Yes | ZAP or ZIP |
Yes | Yes | ZANB or ZINB |