These lab exercises focus on fitting and evaluating generalized linear mixed models (GLMM). As we saw in lecture, GLMMs are relatively new and therefore the existing literature is rather sparse and technical. The good news, though, is that GLMMs follow similar methods to GLMs and LMMs. The bad news is that there are multiple approaches for fitting models, each with different pros and cons.
| Method | Advantages | Disadvantages | R functions |
|---|---|---|---|
| Penalized quasi-likelihood | Flexible, widely implemented | inference may be inappropriate; potentially biased | MASS::glmmPQL |
| Laplace approximation | More accurate than PQL | Slower and less flexible than PQL | lme4::glmer glmmsr::glmm
glmmML::glmmML |
| Gauss-Hermite quadrature | More accurate than Laplace | Slower than Laplace; limited random effects | lme4::glmer glmmsr::glmm
glmmML::glmmML |
Just like GLMs, GLMMs have three components:
Distribution of the data \(f(y; \theta)\)
Link function \(g(\eta)\)
Linear predictor \(\eta\)
For GLMMs, our linear predictor also includes random effects, such that
\[ \eta = \beta_0 + \beta_1 x_1 + \dots + \beta_k x_k + \alpha_0 + \alpha_1 z_1 + \dots + \alpha_l z_l\\ \Downarrow \\ \eta = \mathbf{X} \boldsymbol{\beta} + \mathbf{Z} \boldsymbol{\alpha} \]
where the \(\beta_i\) are fixed effects of the covariates \(x_i\).
Let’s consider a long-term study of invasive brown tree snakes in Guam, which were introduced to the island shortly after WWII. Theses snakes are voracious predators on native birds and other vertebrates, and there are many efforts underway to understand their ecology and possible control measures.
Our data consist of counts of the number of eggs per female at 23 locations over 14 years. We are interested in the fixed effect of body size and the random effects of location and year. We’ll begin with only the effects of body size and location. Here are the simulated data.
## code adapted from Sarah Converse (2019)
set.seed(514)
## sample and group sizes
nn <- 234
n_loc <- 23
n_yrs <- 14
## grand mean
mu <- 1
## slope
beta <- 0.5
## random effects
re_loc <- rnorm(n_loc, 0, 0.4)
re_year <- rnorm(n_yrs, 0, 0.4)
## covariates
loc <- sample(seq(n_loc), nn, replace = TRUE)
year <- sample(seq(n_yrs), nn, replace = TRUE)
size <- runif(nn, -1, 1)
## mean number of eggs
mean <- exp(mu + re_loc[loc] + re_year[year] + beta * size)
## number of eggs
eggs <- rpois(nn,mean)
## data frame
df_eggs <- data.frame(cbind(eggs, size, loc, year))
## plot area
par(mai = c(0.9, 0.9, 0.1, 0.1),
omi = c(0, 0, 0, 0), bg = NA,
cex.main = 1.2, cex.lab = 1.2)
## histogram of eggs
hist(eggs, breaks = seq(0, max(eggs)), las = 1, col = "brown", border = "gray",
main = "", xlab = "Number of eggs")
We fit PQL models with MASS::glmPQL()
## load MASS
library(MASS)
## fit model
snakes_pql <- glmmPQL(eggs ~ size, random = ~1 | loc, data = df_eggs,
family = poisson)
## model summary
summary(snakes_pql)
## Linear mixed-effects model fit by maximum likelihood
## Data: df_eggs
## AIC BIC logLik
## NA NA NA
##
## Random effects:
## Formula: ~1 | loc
## (Intercept) Residual
## StdDev: 0.4425735 1.197006
##
## Variance function:
## Structure: fixed weights
## Formula: ~invwt
## Fixed effects: eggs ~ size
## Value Std.Error DF t-value p-value
## (Intercept) 1.1282992 0.1046315 210 10.783555 0
## size 0.4159611 0.0785370 210 5.296371 0
## Correlation:
## (Intr)
## size -0.072
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -1.8204199 -0.7979912 -0.1019043 0.5053257 3.8209070
##
## Number of Observations: 234
## Number of Groups: 23
We can fit Laplace models with lme4::glmer()
## load lme4
library(lme4)
## fit model
snakes_lap <- glmer(eggs ~ size + (1 | loc), data = df_eggs,
family = poisson)
## model summary
summary(snakes_lap)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: poisson ( log )
## Formula: eggs ~ size + (1 | loc)
## Data: df_eggs
##
## AIC BIC logLik -2*log(L) df.resid
## 1038.5 1048.8 -516.2 1032.5 231
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.1953 -0.9501 -0.0997 0.6020 4.5178
##
## Random effects:
## Groups Name Variance Std.Dev.
## loc (Intercept) 0.2185 0.4674
## Number of obs: 234, groups: loc, 23
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.09806 0.10638 10.322 < 2e-16 ***
## size 0.41662 0.06529 6.382 1.75e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## size -0.059
One question people have is, “What is the
Correlation of Fixed Effects in the output from
glmer()? Here the Correlation of Fixed Effects
is not the correlation between (among) the predictors (i.e., it does not
involve multicollinearity). Rather, it concerns the expected
correlation of the model coefficients. So, in this case, if you repeated
the study and the coefficient for size got smaller, there
is some chance the intercept (Intr) would get bigger (i.e.,
there is a small negative correlation between size and
Intr). In general, this phenomenon is expected for
regression models.
We can fit GHQ models with lme4::glmer(..., nAGQ = pts),
but note that this method only works with one random
effect.
## fit model
snakes_ghq <- glmer(eggs ~ size + (1 | loc), data = df_eggs,
family = poisson, nAGQ = 20)
## model summary
summary(snakes_ghq)
## Generalized linear mixed model fit by maximum likelihood (Adaptive
## Gauss-Hermite Quadrature, nAGQ = 20) [glmerMod]
## Family: poisson ( log )
## Formula: eggs ~ size + (1 | loc)
## Data: df_eggs
##
## AIC BIC logLik -2*log(L) df.resid
## 408.9 419.3 -201.5 402.9 231
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.1954 -0.9500 -0.0993 0.6018 4.5173
##
## Random effects:
## Groups Name Variance Std.Dev.
## loc (Intercept) 0.2192 0.4682
## Number of obs: 234, groups: loc, 23
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.09793 0.10667 10.292 < 2e-16 ***
## size 0.41663 0.06563 6.348 2.18e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## size -0.059
A common question is how many knots one should use when fitting GHQ
models? I have not seen good guidance on this, but one option is to fit
a series of models with increasing values for nAGQ and see
how the estimates change.
ngq <- seq(2, 20)
## empty matrix for coefs
gq <- matrix(NA, length(ngq), 3)
colnames(gq) <- c("nAGQ", "beta", "SD_RE")
gq[,1] <- ngq
## loop over increasing nAGQ
for(i in ngq) {
tmp <- glmer(eggs ~ size + (1 | loc), data = df_eggs,
family = poisson, nAGQ = i)
gq[i-1,2] <- coef(tmp)$loc[1,2]
gq[i-1,3] <- unlist(summary(tmp)$varcor)^2
}
## inspect them
gq
## nAGQ beta SD_RE
## [1,] 2 0.4166241 0.04768222
## [2,] 3 0.4166296 0.04780057
## [3,] 4 0.4166318 0.04805207
## [4,] 5 0.4166277 0.04804779
## [5,] 6 0.4166288 0.04805216
## [6,] 7 0.4166296 0.04805725
## [7,] 8 0.4166296 0.04805725
## [8,] 9 0.4166296 0.04805725
## [9,] 10 0.4166296 0.04805725
## [10,] 11 0.4166296 0.04805725
## [11,] 12 0.4166296 0.04805725
## [12,] 13 0.4166296 0.04805725
## [13,] 14 0.4166296 0.04805725
## [14,] 15 0.4166296 0.04805725
## [15,] 16 0.4166296 0.04805725
## [16,] 17 0.4166296 0.04805725
## [17,] 18 0.4166296 0.04805725
## [18,] 19 0.4166296 0.04805725
## [19,] 20 0.4166296 0.04805725
It looks like the results converge for nAGQ \(\geq\) 7.
Here is a summary of the results from the 3 methods
## betas and SE's
tbl_smry <- cbind(summary(snakes_pql)$tTable[, 1:2],
summary(snakes_lap)$coefficients[, 1:2],
summary(snakes_ghq)$coefficients[, 1:2])
## SD for RE's
loc_SD <- c(c(sqrt(insight::get_variance_random(snakes_pql)), NA),
c(sqrt(insight::get_variance_random(snakes_lap)), NA),
c(sqrt(insight::get_variance_random(snakes_ghq)), NA))
## table of results
tbl_smry <- rbind(tbl_smry, "location SD" = loc_SD)
colnames(tbl_smry) <- c(" PQL ", "SE ", " Laplace", "SE ", " GHQ ", "SE ")
round(tbl_smry, 3)
## PQL SE Laplace SE GHQ SE
## (Intercept) 1.128 0.105 1.098 0.106 1.098 0.107
## size 0.416 0.079 0.417 0.065 0.417 0.066
## location SD 0.443 NA 0.467 NA 0.468 NA
What if we also want to include the random effect of year? If so, our
options are much more limited. glmmPQL only allows for
nested random effects and glmer(..., nAGQ = pts) only
allows for one random effect, but we can use the Laplace approximation
via glmer.
## fit model
snakes_lap_2 <- glmer(eggs ~ size + (1 | loc) + (1 | year),
data = df_eggs, family = poisson)
## model summary
summary(snakes_lap_2)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: poisson ( log )
## Formula: eggs ~ size + (1 | loc) + (1 | year)
## Data: df_eggs
##
## AIC BIC logLik -2*log(L) df.resid
## 966.3 980.1 -479.1 958.3 230
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.64621 -0.75136 -0.04156 0.55940 2.89614
##
## Random effects:
## Groups Name Variance Std.Dev.
## loc (Intercept) 0.2638 0.5137
## year (Intercept) 0.1397 0.3737
## Number of obs: 234, groups: loc, 23; year, 14
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.04073 0.15306 6.799 1.05e-11 ***
## size 0.46766 0.06605 7.081 1.44e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## size -0.043
The literature on model selection for GLMMs is thin at best. What does exist suggests that you should include all of the random effects that you think are important based on the system of interest, the design of the experiment or observational study, and limitations in the data. The idea is that these should not be “tested” for inclusion/exclusion, but rather purposefully included to account for blocks, groups, pseudorepliction, etc. The fixed effects within a GLMM can then be evaluated via AIC.
We did not cover Bayesian methods in class, but they are an exceptionally powerful and flexible way of estimating the parameters in all kinds of linear and nonlinear models. One of the languages for doing Bayesian estimation is Stan, which uses Hamiltonian Monte Carlo to estimate model parameters. The Stan website also contains a large and growing list of tutorials.
The developers of Stan have also created a package
called rstanarm,
which uses a notation very similar to glm() and
glmer(). Jonah Gabry and Ben Goodrich have written a vignette
on using rstanarm for GLMMs including a comparison with
glmer().