These lab exercises focus on fitting and evaluating generalized linear mixed models (GLMM). As we saw in lecture, GLMMs are relatively new and therefore the existing literature is rather sparse and technical. The good news, though, is that GLMMs follow similar methods to GLMs and LMMs. The bad news is that there are multiple approaches for fitting models, each with different pros and cons.
Method | Advantages | Disadvantages | R functions |
---|---|---|---|
Penalized quasi-likelihood | Flexible, widely implemented | inference may be inappropriate; potentially biased | MASS::glmmPQL |
Laplace approximation | More accurate than PQL | Slower and less flexible than PQL | lme4::glmer glmmsr::glmm glmmML::glmmML |
Gauss-Hermite quadrature | More accurate than Laplace | Slower than Laplace; limited random effects | lme4::glmer glmmsr::glmm glmmML::glmmML |
Just like GLMs, GLMMs have three components:
Distribution of the data \(f(y; \theta)\)
Link function \(g(\eta)\)
Linear predictor \(\eta\)
For GLMMs, our linear predictor also includes random effects, such that
\[ \eta = \beta_0 + \beta_1 x_1 + \dots + \beta_k x_k + \alpha_0 + \alpha_1 z_1 + \dots + \alpha_l z_l\\ \Downarrow \\ \eta = \mathbf{X} \boldsymbol{\beta} + \mathbf{Z} \boldsymbol{\alpha} \]
where the \(\beta_i\) are fixed effects of the covariates \(x_i\).
Let’s consider a long-term study of invasive brown tree snakes in Guam, which were introduced to the island shortly after WWII. Theses snakes are voracious predators on native birds and other vertebrates, and there are many efforts underway to understand their ecology and possible control measures.
Our data consist of counts of the number of eggs per female at 23 locations over 14 years. We are interested in the fixed effect of body size and the random effects of location and year. We’ll begin with only the effects of body size and location. Here are the simulated data.
## code adapted from Sarah Converse (2019)
set.seed(514)
## sample and group sizes
nn <- 234
n_loc <- 23
n_yrs <- 14
## grand mean
mu <- 1
## slope
beta <- 0.5
## random effects
re_loc <- rnorm(n_loc, 0, 0.4)
re_year <- rnorm(n_yrs, 0, 0.4)
## covariates
loc <- sample(seq(n_loc), nn, replace = TRUE)
year <- sample(seq(n_yrs), nn, replace = TRUE)
size <- runif(nn, -1, 1)
## mean number of eggs
mean <- exp(mu + re_loc[loc] + re_year[year] + beta * size)
## number of eggs
eggs <- rpois(nn,mean)
## data frame
df_eggs <- data.frame(cbind(eggs, size, loc, year))
## plot area
par(mai = c(0.9, 0.9, 0.1, 0.1),
omi = c(0, 0, 0, 0), bg = NA,
cex.main = 1.2, cex.lab = 1.2)
## histogram of eggs
hist(eggs, breaks = seq(0, max(eggs)), las = 1, col = "brown", border = "gray",
main = "", xlab = "Number of eggs")
We fit PQL models with MASS::glmPQL()
## load MASS
library(MASS)
## fit model
snakes_pql <- glmmPQL(eggs ~ size, random = ~1 | loc, data = df_eggs,
family = poisson)
## model summary
summary(snakes_pql)
## Linear mixed-effects model fit by maximum likelihood
## Data: df_eggs
## AIC BIC logLik
## NA NA NA
##
## Random effects:
## Formula: ~1 | loc
## (Intercept) Residual
## StdDev: 0.5077229 1.183238
##
## Variance function:
## Structure: fixed weights
## Formula: ~invwt
## Fixed effects: eggs ~ size
## Value Std.Error DF t-value p-value
## (Intercept) 1.1247363 0.11687536 210 9.623383 0
## size 0.5079533 0.07916825 210 6.416124 0
## Correlation:
## (Intr)
## size -0.069
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -1.7744344 -0.7176552 -0.2481373 0.5028263 3.3994803
##
## Number of Observations: 234
## Number of Groups: 23
We can fit Laplace models with lme4::glmer()
## load lme4
library(lme4)
## fit model
snakes_lap <- glmer(eggs ~ size + (1 | loc), data = df_eggs,
family = poisson)
## model summary
summary(snakes_lap)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: poisson ( log )
## Formula: eggs ~ size + (1 | loc)
## Data: df_eggs
##
## AIC BIC logLik deviance df.resid
## 1006.7 1017.1 -500.4 1000.7 231
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.1158 -0.8480 -0.2741 0.5931 4.0679
##
## Random effects:
## Groups Name Variance Std.Dev.
## loc (Intercept) 0.2753 0.5247
## Number of obs: 234, groups: loc, 23
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.09929 0.11726 9.374 < 2e-16 ***
## size 0.50619 0.06644 7.619 2.56e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## size -0.054
One question people have is, "What is the Correlation of Fixed Effects
in the output from glmer()
? Here the Correlation of Fixed Effects
is not the correlation between (among) the predictors (i.e., it does not involve multicollinearity). Rather, it concerns the expected correlation of the model coefficients. So, in this case, if you repeated the study and the coefficient for size
got smaller, there is some chance the intercept (Intr
) would get bigger (i.e., there is a small negative correlation between size
and Intr
). In general, this phenomenon is expected for regression models.
We can fit GHQ models with lme4::glmer(..., nAGQ = pts)
, but note that this method only works with one random effect.
## fit model
snakes_ghq <- glmer(eggs ~ size + (1 | loc), data = df_eggs,
family = poisson, nAGQ = 20)
## model summary
summary(snakes_ghq)
## Generalized linear mixed model fit by maximum likelihood (Adaptive
## Gauss-Hermite Quadrature, nAGQ = 20) [glmerMod]
## Family: poisson ( log )
## Formula: eggs ~ size + (1 | loc)
## Data: df_eggs
##
## AIC BIC logLik deviance df.resid
## 397.7 408.1 -195.9 391.7 231
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.1159 -0.8479 -0.2739 0.5929 4.0681
##
## Random effects:
## Groups Name Variance Std.Dev.
## loc (Intercept) 0.2761 0.5254
## Number of obs: 234, groups: loc, 23
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.09919 0.11754 9.352 < 2e-16 ***
## size 0.50618 0.06681 7.576 3.56e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## size -0.054
A common question is how many knots one should use when fitting GHQ models? I have not seen good guidance on this, but one option is to fit a series of models with increasing values for nAGQ
and see how the estimates change.
ngq <- seq(2, 20)
## empty matrix for coefs
gq <- matrix(NA, length(ngq), 3)
colnames(gq) <- c("nAGQ", "beta", "SD_RE")
gq[,1] <- ngq
## loop over increasing nAGQ
for(i in ngq) {
tmp <- glmer(eggs ~ size + (1 | loc), data = df_eggs,
family = poisson, nAGQ = i)
gq[i-1,2] <- coef(tmp)$loc[1,2]
gq[i-1,3] <- unlist(summary(tmp)$varcor)^2
}
## inspect them
gq
## nAGQ beta SD_RE
## [1,] 2 0.5061948 0.07570794
## [2,] 3 0.5061908 0.07587672
## [3,] 4 0.5061823 0.07621622
## [4,] 5 0.5061789 0.07620287
## [5,] 6 0.5061805 0.07621448
## [6,] 7 0.5061761 0.07621682
## [7,] 8 0.5061827 0.07621968
## [8,] 9 0.5061827 0.07621968
## [9,] 10 0.5061827 0.07621968
## [10,] 11 0.5061827 0.07621968
## [11,] 12 0.5061827 0.07621968
## [12,] 13 0.5061827 0.07621968
## [13,] 14 0.5061827 0.07621968
## [14,] 15 0.5061827 0.07621968
## [15,] 16 0.5061827 0.07621968
## [16,] 17 0.5061827 0.07621968
## [17,] 18 0.5061827 0.07621968
## [18,] 19 0.5061827 0.07621968
## [19,] 20 0.5061827 0.07621968
It looks like the results converge for nAGQ
\(\geq\) 7.
Here is a summary of the results from the 3 methods
## betas and SE's
tbl_smry <- cbind(summary(snakes_pql)$tTable[, 1:2],
summary(snakes_lap)$coefficients[, 1:2],
summary(snakes_ghq)$coefficients[, 1:2])
## SD for RE's
loc_SD <- c(c(sqrt(insight::get_variance_random(snakes_pql)), NA),
c(sqrt(insight::get_variance_random(snakes_lap)), NA),
c(sqrt(insight::get_variance_random(snakes_ghq)), NA))
## table of results
tbl_smry <- rbind(tbl_smry, "location SD" = loc_SD)
colnames(tbl_smry) <- c(" PQL ", "SE ", " Laplace", "SE ", " GHQ ", "SE ")
round(tbl_smry, 3)
## PQL SE Laplace SE GHQ SE
## (Intercept) 1.125 0.117 1.099 0.117 1.099 0.118
## size 0.508 0.079 0.506 0.066 0.506 0.067
## location SD 0.508 NA 0.525 NA 0.525 NA
What if we also want to include the random effect of year? If so, our options are much more limited. glmmPQL
only allows for nested random effects and glmer(..., nAGQ = pts)
only allows for one random effect, but we can use the Laplace approximation via glmer
.
## fit model
snakes_lap_2 <- glmer(eggs ~ size + (1 | loc) + (1 | year),
data = df_eggs, family = poisson)
## model summary
summary(snakes_lap_2)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: poisson ( log )
## Formula: eggs ~ size + (1 | loc) + (1 | year)
## Data: df_eggs
##
## AIC BIC logLik deviance df.resid
## 928.8 942.6 -460.4 920.8 230
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.7498 -0.6251 -0.0568 0.5055 3.5431
##
## Random effects:
## Groups Name Variance Std.Dev.
## loc (Intercept) 0.2522 0.5022
## year (Intercept) 0.1557 0.3945
## Number of obs: 234, groups: loc, 23; year, 14
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.03612 0.15518 6.677 2.44e-11 ***
## size 0.51380 0.07063 7.274 3.48e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## size -0.048
The literature on model selection for GLMMs is thin at best. What does exist suggests that you should include all of the random effects that you think are important based on the system of interest, the design of the experiment or observational study, and limitations in the data. The idea is that these should not be “tested” for inclusion/exclusion, but rather purposefully included to account for blocks, groups, pseudorepliction, etc. The fixed effects within a GLMM can then be evaluated via AIC.
We did not cover Bayesian methods in class, but they are an exceptionally powerful and flexible way of estimating the parameters in all kinds of linear and nonlinear models. One of the languages for doing Bayesian estimation is Stan, which uses Hamiltonian Monte Carlo to estimate model parameters. The Stan website also contains a large and growing list of tutorials.
The developers of Stan have also created a package called rstanarm, which uses a notation very similar to glm()
and glmer()
. Jonah Gabry and Ben Goodrich have written a vignette on using rstanarm for GLMMs including a comparison with glmer()
.