Background

These lab exercises focus on fitting and evaluating generalized linear mixed models (GLMM). As we saw in lecture, GLMMs are relatively new and therefore the existing literature is rather sparse and technical. The good news, though, is that GLMMs follow similar methods to GLMs and LMMs. The bad news is that there are multiple approaches for fitting models, each with different pros and cons.


Method Advantages Disadvantages R functions
Penalized quasi-likelihood Flexible, widely implemented inference may be inappropriate; potentially biased MASS::glmmPQL
Laplace approximation More accurate than PQL Slower and less flexible than PQL lme4::glmer glmmsr::glmm glmmML::glmmML
Gauss-Hermite quadrature More accurate than Laplace Slower than Laplace; limited random effects lme4::glmer glmmsr::glmm glmmML::glmmML


Just like GLMs, GLMMs have three components:

  1. Distribution of the data \(f(y; \theta)\)

  2. Link function \(g(\eta)\)

  3. Linear predictor \(\eta\)

For GLMMs, our linear predictor also includes random effects, such that

\[ \eta = \beta_0 + \beta_1 x_1 + \dots + \beta_k x_k + \alpha_0 + \alpha_1 z_1 + \dots + \alpha_l z_l\\ \Downarrow \\ \eta = \mathbf{X} \boldsymbol{\beta} + \mathbf{Z} \boldsymbol{\alpha} \]

where the \(\beta_i\) are fixed effects of the covariates \(x_i\).

Model for tree snakes

Let’s consider a long-term study of invasive brown tree snakes in Guam, which were introduced to the island shortly after WWII. Theses snakes are voracious predators on native birds and other vertebrates, and there are many efforts underway to understand their ecology and possible control measures.

Our data consist of counts of the number of eggs per female at 23 locations over 14 years. We are interested in the fixed effect of body size and the random effects of location and year. We’ll begin with only the effects of body size and location. Here are the simulated data.

## code adapted from Sarah Converse (2019)
set.seed(514)
## sample and group sizes
nn <- 234
n_loc <- 23
n_yrs <- 14
## grand mean
mu <- 1
## slope
beta <- 0.5
## random effects
re_loc <- rnorm(n_loc, 0, 0.4)
re_year <- rnorm(n_yrs, 0, 0.4)
## covariates
loc <- sample(seq(n_loc), nn, replace = TRUE)
year <- sample(seq(n_yrs), nn, replace = TRUE)
size <- runif(nn, -1, 1)

## mean number of eggs 
mean <- exp(mu + re_loc[loc] + re_year[year] + beta * size)
## number of eggs
eggs <- rpois(nn,mean)
## data frame
df_eggs <- data.frame(cbind(eggs, size, loc, year))

## plot area
par(mai = c(0.9, 0.9, 0.1, 0.1),
    omi = c(0, 0, 0, 0), bg = NA,
    cex.main = 1.2, cex.lab = 1.2)
## histogram of eggs
hist(eggs, breaks = seq(0, max(eggs)), las = 1, col = "brown", border = "gray",
     main = "", xlab = "Number of eggs")

Penalized quasi-likelihood

We fit PQL models with MASS::glmPQL()

## load MASS
library(MASS)
## fit model
snakes_pql <- glmmPQL(eggs ~ size, random = ~1 | loc, data = df_eggs,
                      family = poisson)
## model summary
summary(snakes_pql)
## Linear mixed-effects model fit by maximum likelihood
##  Data: df_eggs 
##   AIC BIC logLik
##    NA  NA     NA
## 
## Random effects:
##  Formula: ~1 | loc
##         (Intercept) Residual
## StdDev:   0.5077229 1.183238
## 
## Variance function:
##  Structure: fixed weights
##  Formula: ~invwt 
## Fixed effects: eggs ~ size 
##                 Value  Std.Error  DF  t-value p-value
## (Intercept) 1.1247363 0.11687536 210 9.623383       0
## size        0.5079533 0.07916825 210 6.416124       0
##  Correlation: 
##      (Intr)
## size -0.069
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -1.7744344 -0.7176552 -0.2481373  0.5028263  3.3994803 
## 
## Number of Observations: 234
## Number of Groups: 23

Laplace approximation

We can fit Laplace models with lme4::glmer()

## load lme4
library(lme4)
## fit model
snakes_lap <- glmer(eggs ~ size + (1 | loc), data = df_eggs,
                    family = poisson)
## model summary
summary(snakes_lap)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: poisson  ( log )
## Formula: eggs ~ size + (1 | loc)
##    Data: df_eggs
## 
##      AIC      BIC   logLik deviance df.resid 
##   1006.7   1017.1   -500.4   1000.7      231 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.1158 -0.8480 -0.2741  0.5931  4.0679 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  loc    (Intercept) 0.2753   0.5247  
## Number of obs: 234, groups:  loc, 23
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.09929    0.11726   9.374  < 2e-16 ***
## size         0.50619    0.06644   7.619 2.56e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##      (Intr)
## size -0.054

One question people have is, "What is the Correlation of Fixed Effects in the output from glmer()? Here the Correlation of Fixed Effects is not the correlation between (among) the predictors (i.e., it does not involve multicollinearity). Rather, it concerns the expected correlation of the model coefficients. So, in this case, if you repeated the study and the coefficient for size got smaller, there is some chance the intercept (Intr) would get bigger (i.e., there is a small negative correlation between size and Intr). In general, this phenomenon is expected for regression models.

Gauss-Hermite quadrature

We can fit GHQ models with lme4::glmer(..., nAGQ = pts), but note that this method only works with one random effect.

## fit model
snakes_ghq <- glmer(eggs ~ size + (1 | loc), data = df_eggs,
                    family = poisson, nAGQ = 20)
## model summary
summary(snakes_ghq)
## Generalized linear mixed model fit by maximum likelihood (Adaptive
##   Gauss-Hermite Quadrature, nAGQ = 20) [glmerMod]
##  Family: poisson  ( log )
## Formula: eggs ~ size + (1 | loc)
##    Data: df_eggs
## 
##      AIC      BIC   logLik deviance df.resid 
##    397.7    408.1   -195.9    391.7      231 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.1159 -0.8479 -0.2739  0.5929  4.0681 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  loc    (Intercept) 0.2761   0.5254  
## Number of obs: 234, groups:  loc, 23
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.09919    0.11754   9.352  < 2e-16 ***
## size         0.50618    0.06681   7.576 3.56e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##      (Intr)
## size -0.054

How many knots?

A common question is how many knots one should use when fitting GHQ models? I have not seen good guidance on this, but one option is to fit a series of models with increasing values for nAGQ and see how the estimates change.

ngq <- seq(2, 20)
## empty matrix for coefs
gq <- matrix(NA, length(ngq), 3)
colnames(gq) <- c("nAGQ", "beta", "SD_RE")
gq[,1] <- ngq
## loop over increasing nAGQ
for(i in ngq) {
  tmp <- glmer(eggs ~ size + (1 | loc), data = df_eggs,
                    family = poisson, nAGQ = i)
  gq[i-1,2] <- coef(tmp)$loc[1,2]
  gq[i-1,3] <- unlist(summary(tmp)$varcor)^2
}
## inspect them
gq
##       nAGQ      beta      SD_RE
##  [1,]    2 0.5061948 0.07570794
##  [2,]    3 0.5061908 0.07587672
##  [3,]    4 0.5061823 0.07621622
##  [4,]    5 0.5061789 0.07620287
##  [5,]    6 0.5061805 0.07621448
##  [6,]    7 0.5061761 0.07621682
##  [7,]    8 0.5061827 0.07621968
##  [8,]    9 0.5061827 0.07621968
##  [9,]   10 0.5061827 0.07621968
## [10,]   11 0.5061827 0.07621968
## [11,]   12 0.5061827 0.07621968
## [12,]   13 0.5061827 0.07621968
## [13,]   14 0.5061827 0.07621968
## [14,]   15 0.5061827 0.07621968
## [15,]   16 0.5061827 0.07621968
## [16,]   17 0.5061827 0.07621968
## [17,]   18 0.5061827 0.07621968
## [18,]   19 0.5061827 0.07621968
## [19,]   20 0.5061827 0.07621968

It looks like the results converge for nAGQ \(\geq\) 7.

Model comparison

Here is a summary of the results from the 3 methods

## betas and SE's
tbl_smry <- cbind(summary(snakes_pql)$tTable[, 1:2],
                  summary(snakes_lap)$coefficients[, 1:2],
                  summary(snakes_ghq)$coefficients[, 1:2])
## SD for RE's
loc_SD <- c(c(sqrt(insight::get_variance_random(snakes_pql)), NA),
            c(sqrt(insight::get_variance_random(snakes_lap)), NA),
            c(sqrt(insight::get_variance_random(snakes_ghq)), NA))
## table of results
tbl_smry <- rbind(tbl_smry, "location SD" = loc_SD)
colnames(tbl_smry) <- c("    PQL ", "SE ", "   Laplace", "SE ", "     GHQ ", "SE ")
round(tbl_smry, 3)
##                 PQL    SE     Laplace   SE       GHQ    SE 
## (Intercept)    1.125 0.117      1.099 0.117     1.099 0.118
## size           0.508 0.079      0.506 0.066     0.506 0.067
## location SD    0.508    NA      0.525    NA     0.525    NA

2+ random effects

What if we also want to include the random effect of year? If so, our options are much more limited. glmmPQL only allows for nested random effects and glmer(..., nAGQ = pts) only allows for one random effect, but we can use the Laplace approximation via glmer.

## fit model
snakes_lap_2 <- glmer(eggs ~ size + (1 | loc) + (1 | year),
                      data = df_eggs, family = poisson)
## model summary
summary(snakes_lap_2)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: poisson  ( log )
## Formula: eggs ~ size + (1 | loc) + (1 | year)
##    Data: df_eggs
## 
##      AIC      BIC   logLik deviance df.resid 
##    928.8    942.6   -460.4    920.8      230 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.7498 -0.6251 -0.0568  0.5055  3.5431 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  loc    (Intercept) 0.2522   0.5022  
##  year   (Intercept) 0.1557   0.3945  
## Number of obs: 234, groups:  loc, 23; year, 14
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.03612    0.15518   6.677 2.44e-11 ***
## size         0.51380    0.07063   7.274 3.48e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##      (Intr)
## size -0.048

Model selection

The literature on model selection for GLMMs is thin at best. What does exist suggests that you should include all of the random effects that you think are important based on the system of interest, the design of the experiment or observational study, and limitations in the data. The idea is that these should not be “tested” for inclusion/exclusion, but rather purposefully included to account for blocks, groups, pseudorepliction, etc. The fixed effects within a GLMM can then be evaluated via AIC.

Bayesian estimation

We did not cover Bayesian methods in class, but they are an exceptionally powerful and flexible way of estimating the parameters in all kinds of linear and nonlinear models. One of the languages for doing Bayesian estimation is Stan, which uses Hamiltonian Monte Carlo to estimate model parameters. The Stan website also contains a large and growing list of tutorials.

The developers of Stan have also created a package called rstanarm, which uses a notation very similar to glm() and glmer(). Jonah Gabry and Ben Goodrich have written a vignette on using rstanarm for GLMMs including a comparison with glmer().