Background

These lab exercises focus on fitting and evaluating linear mixed models. We will be working with some hypothetical survey data on the abundance of seabirds and waterfowl in Puget Sound with the intention of estimating trends among locations over time. The data consist of an aggregate index of abundance for 50+ species, which is based upon counts of birds per unit effort at each of 9 sites (ie, 3 surveys each in the north, central, and south basins). The surveys were conducted from 2000-2018.

Data ingest

The data are contained in the accompanying file seabirds.csv, which consists of a column for year plus 9 more columns with the abundance index at each site. Thus, we will need to reshape the data from its “wide” format into a “long” format where each row of the data frame consists of a single record for year, site, and index.

## get raw data
raw_data <- read.csv("seabirds.csv")

## seq of years of data
years <- raw_data$year

## number of years of data
n_yrs <- length(years)

## number of sites
n_sites <- ncol(raw_data) - 1

## take a peek
head(raw_data)
##   year  N_1  N_2  N_3  C_1  C_2  C_3  S_1  S_2  S_3
## 1 2000 26.3 20.0 14.9 26.2 20.0 15.9 14.9 14.5 18.0
## 2 2001 16.4 23.2 22.7 24.3 20.8 19.9 28.7 18.2 21.0
## 3 2002 23.7 18.8 18.5 27.4 21.2 21.5 27.2 17.7 21.8
## 4 2003 23.5 14.8 20.7 25.2 20.0 21.1 17.8 26.5 20.9
## 5 2004 25.6 19.8 19.9 32.2 22.4 17.1 21.9 18.2 23.3
## 6 2005 25.2 21.1 17.9 34.0 21.9 20.1 21.1 13.8 18.7

Let’s plot the index for each of the sites over time.

## set color Palette
clr <- viridis::plasma(n_sites, begin = 0.2, end = 0.8)

## plot the raw data
par(mai = c(0.9, 0.9, 0.6, 0.1),
    omi = c(0, 0, 0, 0))
matplot(raw_data[,1], raw_data[,-1], pch = 16, las = 1, col = clr,
        ylab = "Seabird abundance index", xlab = "Year")
Figure 1. Time series of abundance of Puget Sound seabirds at 9 different sites.

Figure 1. Time series of abundance of Puget Sound seabirds at 9 different sites.

Data formatting

To reshape the data, we’ll use pivot_longer() from the tidyr package (see ?tidyr::pivot_longer for more info). Before doing so, however, we’ll replace the actual years with a simple time index that runs from 1 to 19, which will adjust our model intercept term to be relative to the year 2000 instead of the year 0 AD.

## tidy the data
## replace years with time index from 1:n_yrs
raw_data[,"year"] <- seq(n_yrs)

## convert from wide to long
df <- tidyr::pivot_longer(as.data.frame(raw_data), -year,
                          names_to = "site", values_to = "index")

## convert the tibble to a data frame
df <- as.data.frame(df)

## inspect our tidy data
head(df, 10)
##    year site index
## 1     1  N_1  26.3
## 2     1  N_2  20.0
## 3     1  N_3  14.9
## 4     1  C_1  26.2
## 5     1  C_2  20.0
## 6     1  C_3  15.9
## 7     1  S_1  14.9
## 8     1  S_2  14.5
## 9     1  S_3  18.0
## 10    2  N_1  16.4

Fitting models with lmer()

As we saw briefly in lecture, we cannot fit mixed models with our standard lm() function. Instead, we will use the lmer() function in the lme4 package by Bates et al. (2015).

As with other statistical models, lmer() uses a standard formula notation with the response and expression separated by ~.

response ~ expression

The expression includes any fixed effects (FE) and random effects (RE) separated by +. However, each RE term is of the form (RE_expr | factor), where the vertical bar | indicates that the random effects given by RE_expr should be nested within factor.

Random effects for group means

Consider a simple one-way ANOVA model for an observation \(i\) from group \(j\) \((y_{ij})\) that includes a global mean \(\mu\) plus the group-level means \(\alpha_j\), such that

\[ y_{ij} = \mu + \alpha_j + \epsilon_{ij} \\ \alpha_{j} \sim \text{N}(0, \sigma_\alpha^2) \\ \epsilon_{ij} \sim \text{N}(0, \sigma_\epsilon^2) \]

Assuming the data y and group ID group are contained in data_frame, we would fit this model using

lmer(y ~ 1 + (1 | group), data = data_frame)

where the first 1 represents \(\mu\) and (1 | group) represents the \(\alpha_j\) (ie, a random effect for each group). Just as with lm(), we can drop the leading 1 + to make the global mean implicit, such that

lmer(y ~ (1 | group), data = data_frame)

Random intercepts & fixed slope

Now consider an ANCOVA model for an observation \(i\) from group \(j\) \((y_{ij})\) that includes a fixed effect (slope) \(\beta_1\) of a predictor \(x\) plus group-level offsets \(\alpha_j\) to the overall intercept \(\beta_0\), such that

\[ y_{ij} = (\beta_0 + \alpha_j) + \beta_1 x_{ij} + \epsilon_{ij} \\ \alpha_{j} \sim \text{N}(0, \sigma_\alpha^2) \\ \epsilon_{ij} \sim \text{N}(0, \sigma_\epsilon^2) \]

That is, each group has a common slope \(\beta_1\) plus a y-intercept equal to \(\beta_0 + \alpha_j\). To see the relationship between this model and its implementation in lmer(), let’s separate the fixed effects and the random effect for group, such that

\[ y_{ij} = \beta_0 + \beta_1 x_{ij} + \alpha_j + \epsilon_{ij} \\ \]

Assuming the data y, predictor x and group ID group are contained in data_frame, we would fit this model using

lmer(y ~ 1 + x + (1 | group), data = data_frame)

where the leading 1 + x represents the fixed intercept and slope \((\beta_0, \beta_1)\) and (1 | group) represents the group-level offset \(\alpha_j\). Just as with lm(), we can drop the leading 1 + to make the global intercept implicit, such that

lmer(y ~ x + (1 | group), data = data_frame)

Random intercepts & slopes

Now consider a model for an observation \(i\) from group \(j\) \((y_{ij})\) that includes random effects for both the effect (slope) of \(x\) \((\beta_1 + \delta_j)\) and the group-level intercepts \((\beta_0 + \alpha_j)\), such that

\[ y_{ij} = (\beta_0 + \alpha_j) + (\beta_1 + \delta_j) x_{ij} + \epsilon_{ij} \\ \alpha_{j} \sim \text{N}(0, \sigma_\alpha^2) \\ \delta_{j} \sim \text{N}(0, \sigma_\delta^2) \\ \epsilon_{ij} \sim \text{N}(0, \sigma_\epsilon^2) \]

Here again, rewriting this model by separating the fixed and random terms makes it easier to see how it aligns with the lmer() notation.

\[ \begin{aligned} y_{ij} &= (\beta_0 + \alpha_j) + (\beta_1 + \delta_j) x_{ij} + \epsilon_{ij} \\ &= \beta_0 + \alpha_j + \beta_1 x_{ij} + \delta_j x_{ij} + \epsilon_{ij} \\ &= \beta_0 + \beta_1 x_{ij} + (\alpha_j + \delta_j x_{ij}) + \epsilon_{ij} \\ \end{aligned} \]

Assuming the data y, predictor x and group ID group are contained in data_frame, we would fit this model using

lmer(y ~ 1 + x + (1 + x || group), data = data_frame)

where the 1 + x represents the fixed intercept and slope and (1 + x || group) represents the \(\alpha_j\) and \(\delta_j\) (ie, random intercepts and slopes for each group). Here we use the double pipe || in the random effect because we don’t want the intercept and slope to be correlated with one another. Just as with lm(), we can drop the 1 + for both the FE and RE terms, such that

lmer(y ~ x + (x || group), data = data_frame)

Model fitting & evaluation

Before we fit any models, let’s begin by thinking about the data. In this case we have surveys from different locations around Puget Sound, and those locations are but a few of the possible sites that might have been surveyed (ie, they do not represent the entirety of available habitat). Thus, it sounds like we should treat site as a random effect.

These data were also collected over time, which suggests that perhaps a model with an explicit accounting for time would be good. One option would be to treat year as a random effect, as we’re not particularly interested in the specific effect of any one year.

Random effect of site

Let’s first fit a model with a random effect for site only \(s\) and ignore the temporal aspect of the data. This is akin to the ANOVA model above, such that

\[ y_{s,t} = \mu + \alpha_s + \epsilon_{s,t} \\ \alpha_{s} \sim \text{N}(0, \sigma_\alpha^2) \\ \epsilon_{s,t} \sim \text{N}(0, \sigma_\epsilon^2) \]

and \(\alpha_j\) is the random effect of site.

## load lme4 package
library(lme4)

## fit model with RE for site
mod_site <- lmer(index ~ 1 + (1 | site), data = df)

## model summary
summary(mod_site)
## Linear mixed model fit by REML ['lmerMod']
## Formula: index ~ 1 + (1 | site)
##    Data: df
## 
## REML criterion at convergence: 987.2
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.59392 -0.70832 -0.06334  0.65978  2.72398 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  site     (Intercept)  4.166   2.041   
##  Residual             17.437   4.176   
## Number of obs: 171, groups:  site, 9
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  24.0351     0.7516   31.98

Notice that this model summary info is somewhat different that what we’ve been seeing from lm(). In particular, there are 2 different blocks of results: one for the random effect of site and the residual errors, and another for the fixed effect of a global mean (Intercept).

Also notice that there are no \(p\)-values in this output. As we discussed in lecture, this is because any assumptions about the distributional form of our null hypothesis tests are likely to be violated when including random effects in the model.

Estimates of random effects

Normally we would use coef() to extract the estimated coefficients, but it will return the sum of \(\mu + \alpha_j\). If we just want the \(\alpha_j\), we can instead use ranef().

## mu + alpha
coef(mod_site)$site
##     (Intercept)
## C_1    28.06491
## C_2    24.13141
## C_3    22.72536
## N_1    24.39020
## N_2    22.98414
## N_3    22.56146
## S_1    25.73155
## S_2    22.46226
## S_3    23.26449
## alpha only
ranef(mod_site)$site
##     (Intercept)
## C_1  4.02982110
## C_2  0.09632466
## C_3 -1.30972780
## N_1  0.35510732
## N_2 -1.05094514
## N_3 -1.47362349
## S_1  1.69646411
## S_2 -1.57282351
## S_3 -0.77059726

Correlation among random effects

As we learned in lecture, we can estimate the correlation among the random effects as

\[ \rho = \frac{\sigma_\alpha^2}{\sigma_\alpha^2 + \sigma_\epsilon^2} \]

which we can compute from the estimates in the above table. The function VarCorr() will return the results, but they are a bit hard to access directly, so we can embed that call inside of as.data.frame().

## get var(RE)
(var_re_site <- as.data.frame(VarCorr(mod_site)))
##        grp        var1 var2      vcov    sdcor
## 1     site (Intercept) <NA>  4.166043 2.041089
## 2 Residual        <NA> <NA> 17.436888 4.175750
## variance of random effects
sigma2_alpha <- var_re_site$vcov[1]
## variance of residuals
sigma2_epsilon <- var_re_site$vcov[2]
## calculate the correlation among RE's
rho <- sigma2_alpha / (sigma2_alpha + sigma2_epsilon)
round(rho, 2)
## [1] 0.19

It looks like there is not a lot of correlation among the random effects. We should also note that \(\rho\) provides us with an estimate of the proportion of the variance explained by the random effects (similar to \(R^2\) for the fixed effects).

Random effect of year

Now let’s fit another model that only includes the random effect of year \(t\), where

\[ y_{s,t} = \mu + \alpha_t + \epsilon_{s,t} \\ \alpha_{t} \sim \text{N}(0, \sigma_\alpha^2) \\ \epsilon_{s,t} \sim \text{N}(0, \sigma_\epsilon^2) \]

and \(\alpha_t\) is now the random effect of year.

## fit model with RE for year
mod_year <- lmer(index ~ 1 + (1 | year), data = df)

## model summary
summary(mod_year)
## Linear mixed model fit by REML ['lmerMod']
## Formula: index ~ 1 + (1 | year)
##    Data: df
## 
## REML criterion at convergence: 989.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.0624 -0.6200 -0.1094  0.5171  2.8591 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  year     (Intercept)  4.528   2.128   
##  Residual             16.847   4.104   
## Number of obs: 171, groups:  year, 19
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  24.0351     0.5804   41.41

Estimates of random effects

Here are the estimates of the \(\alpha_t\).

## alpha only
ranef(mod_year)$year
##     (Intercept)
## 1  -3.586079585
## 2  -1.660015149
## 3  -1.455616474
## 4  -2.029505061
## 5  -1.251217799
## 6  -1.770075974
## 7   0.077373587
## 8   0.879245312
## 9  -1.266940774
## 10 -0.009102775
## 11  0.266049287
## 12  0.674846637
## 13 -0.873866399
## 14  1.083643987
## 15  3.269137510
## 16  2.435819836
## 17  1.154397374
## 18  1.846208274
## 19  2.215698186

Do you notice anything about these random effects for year? The appear to be increasing over time, such that the estimate for 2000 is -3.59 and the estimate for 2018 is 2.22.

Correlation among random effects

Here is the correlation among the random effects for year.

## get var(RE)
(var_re_year <- as.data.frame(VarCorr(mod_year)))
##        grp        var1 var2      vcov    sdcor
## 1     year (Intercept) <NA>  4.528355 2.127993
## 2 Residual        <NA> <NA> 16.846564 4.104457
## variance of random effects
sigma2_alpha <- var_re_year$vcov[1]
## variance of residuals
sigma2_epsilon <- var_re_year$vcov[2]
## calculate the correlation among RE's
rho <- sigma2_alpha / (sigma2_alpha + sigma2_epsilon)
round(rho, 2)
## [1] 0.21

Here, too, there is not a lot of correlation among the random effects.

Random effects for site & year

Now let’s fit a model that includes random effects for both site and year, where

\[ y_{s,t} = \mu + \alpha_s + \alpha_t + \epsilon_{s,t} \\ \alpha_{s} \sim \text{N}(0, \sigma_{\alpha_s}^2) \\ \alpha_{t} \sim \text{N}(0, \sigma_{\alpha_t}^2) \\ \epsilon_{s,t} \sim \text{N}(0, \sigma_\epsilon^2) \]

and \(\alpha_t\) is now the random effect of year.

## model with RE's for both site and year
mod_site_year <- lmer(index ~ 1 + (1 | site) + (1 | year), df)

## model summary
summary(mod_site_year)
## Linear mixed model fit by REML ['lmerMod']
## Formula: index ~ 1 + (1 | site) + (1 | year)
##    Data: df
## 
## REML criterion at convergence: 959.8
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.97613 -0.70823 -0.08333  0.58907  2.70628 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  year     (Intercept)  5.021   2.241   
##  site     (Intercept)  4.430   2.105   
##  Residual             12.416   3.524   
## Number of obs: 171, groups:  year, 19; site, 9
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  24.0351     0.9106    26.4

Here we can see that the residual variance \(\sigma^2\) has decreased quite a bit compared to our other models. We can check how much of the total variance is explained by the combination of the two random effects as follows:

## get var(RE)
var_re_site_year <- as.data.frame(VarCorr(mod_site_year))
## variance of site random effects
sigma2_alpha_s <- var_re_site_year$vcov[1]
## variance of year random effects
sigma2_alpha_t <- var_re_site_year$vcov[2]
## variance of residuals
sigma2_epsilon <- var_re_site_year$vcov[3]
## calculate the correlation among RE's
rho <- (sigma2_alpha_s + sigma2_alpha_t) / 
       (sigma2_alpha_s + sigma2_alpha_t + sigma2_epsilon)
round(rho, 2)
## [1] 0.43

Estimates of random effects

Here are the estimates of the \(\alpha_s\) and \(\alpha_t\).

## alpha only
ranef(mod_site_year)
## $year
##    (Intercept)
## 1  -3.97590361
## 2  -1.84046674
## 3  -1.61384895
## 4  -2.25012198
## 5  -1.38723116
## 6  -1.96249171
## 7   0.08578447
## 8   0.97482349
## 9  -1.40466330
## 10 -0.01009229
## 11  0.29497012
## 12  0.74820570
## 13 -0.96885986
## 14  1.20144128
## 15  3.62450841
## 16  2.70060512
## 17  1.27988590
## 18  2.04689995
## 19  2.45655519
## 
## $site
##     (Intercept)
## C_1   4.2854237
## C_2   0.1024343
## C_3  -1.3928009
## N_1   0.3776310
## N_2  -1.1176043
## N_3  -1.5670922
## S_1   1.8040671
## S_2  -1.6725842
## S_3  -0.8194745

Likelihood ratio tests

We can use likelihood ratio tests to determine whether there is support for any given random effect in a model. These tests require a simulation-based approach that involves a search over a large gridded space of ratios of \(\sigma_\alpha^2 / \sigma_\epsilon^2\). For more information, see Crainiceanu & Ruppert (2004).

To conduct these tests, we’ll use the exactRLRT() function from the RLRsim package.

Single random effect

The first option compares a model with a single random effect to an identical model that lacks the random effect. The null hypothesis for this test is \(H_0 : \text{Var}(\alpha) = \sigma_\alpha^2 = 0\). The test statistic is given by

\[ \lambda = 2 \log \mathcal{L}(\sigma_\alpha^2 \neq 0) - 2 \log \mathcal{L}(\sigma_\alpha^2 = 0) \]

Let’s go ahead and test our 2 models with random effects for site and year. To compare our models with a single random effect, we will compare them against a full model with both random effects. To do so, we need 3 different models:

  1. model with single RE of interest to be tested (model_A)

  2. full model with 2+ RE’s (model_AB)

  3. full model minus the RE in model (1) (model_B)

To conduct the test we use extractRLRT(model_A, model_AB, model_B).

## load RLRsim package
library(RLRsim)

## test `site` model
exactRLRT(mod_site, mod_site_year, mod_year)
## 
##  simulated finite sample distribution of RLRT.
##  
##  (p-value based on 10000 simulated values)
## 
## data:  
## RLRT = 29.969, p-value < 2.2e-16

We can clearly reject \(H_0 : \sigma_{\alpha_s}^2 = 0\) and conclude that there is support for inclusion of the site random effect.

Now let’s go ahead and test the model with a single random effect for year.

## test `year` model
exactRLRT(mod_year, mod_site_year, mod_site)
## 
##  simulated finite sample distribution of RLRT.
##  
##  (p-value based on 10000 simulated values)
## 
## data:  
## RLRT = 27.39, p-value < 2.2e-16

We can clearly reject \(H_0 : \sigma_{\alpha_t}^2 = 0\) and conclude that there is support for inclusion of the year random effect.

Multiple random effects

We can use a bootstrapping approach to test for evidence against including multiple random effects in the same model. Our approach follows this pseudo-code:

  1. fit null model with no random effects

  2. calculate likelihood ratio statistic

  3. simulate data from the null model in (1)

  4. fit simple & full model to data from (3)

  5. calculate likelihood ratio (difference in log-likelihood)

  6. repeats steps 3-5 many times

  7. see where statistic from (2) falls within estimated distribution from (3-6)

Importantly, too, we need to specify that the random effects model be fit using maximum likelihood (ML) instead of the default restricted maximum likelihood (REML).

## set random seed to make it reproducible
set.seed(514)

## Step 1: fit null model with no RE's using `lm()`
null_model <- lm(index ~ 1, data = df)

## Step 2: calculate likelihood ratio (ie, difference in log-likelihood)
lambda <- 2 * (logLik(mod_site_year) - logLik(null_model))

## number of bootstrapped samples
nb <- 1000
## empty vector for storing LRT statistics
LRT_boot <- rep(NA, nb)
## do bootstrapping
for(i in 1:nb) {
  ## Step 3: simulate data from null model
  sim_data <- unlist(simulate(null_model))
  ## Step 4: fit null and RE models to sim data
  m_null <- lm(sim_data ~ 1)
  m_alt <- lmer(sim_data ~ 1 + (1 | site) + (1 | year), data = df, REML = FALSE)
  ## Step 5: calculate likelihood ratio
  LRT_boot[i] <- as.numeric(2*(logLik(m_alt) - logLik(m_null)))
  ## Step 6: repeat this `nb` times
}
## Step 7: calculate approximate p-value
mean(LRT_boot > lambda)
## [1] 0

Here it appears as though none of the 1000 bootstrapped samples had a test statistic larger than the original value of 46.4.

Model diagnostics

We would be remiss if we did not do some diagnostic checks on our model. One of our assumptions is that the residuals \(\epsilon\) and random effects \(\alpha\) are normally distributed, which we can check with Q-Q plots. Let’s do some diagnostic checks for the model with random effects for both site and year.

Q-Q plots

## set plot area
par(mai = c(0.9, 0.9, 0.6, 0.1),
    omi = c(0, 0, 0, 0),
    mfrow = c(1,2), cex.lab = 1.2)

## qq resids
qqnorm(residuals(mod_site_year), main = "QQ plot (residuals)", las = 1, pch = 16)
qqline(residuals(mod_site_year))

## qq RE's
qqnorm(unlist(ranef(mod_site_year)), main = "QQ plot (RE's)", las = 1, pch = 16)
qqline(unlist(ranef(mod_site_year)))

These plots both look pretty good. Perhaps there is some deviation from normality for the random effects, but there are not many values from which to ascertain whether or not the assumption of normality has been violated.

Residuals versus fitted

We can also plot the model residuals against the fitted values to look for evidence of heteroscedasticity or non-linearity in the residuals.

## resids vs fitted
plot(fitted(mod_site_year), residuals(mod_site_year), las = 1, pch = 16,
     xlab = "Fitted", ylab = "Residuals",
     main = "Residuals vs fitted")
abline(h=0, lty = "dashed")

This residual plot looks promising in that there are no obvious patterns to the residuals.

Autocorrelation

Because these data were collected over time, we should be aware of possible autocorrelation among the residuals. It would be a bit messy to create plots for all 9 of the time series, so we’ll just get a table of the results from acf() and see whether any of them exceed the critical value given by

\[ 0 \pm \frac{z_{\alpha/2}}{\sqrt{n}} \]

where \(z_{\alpha/2}\) is the \(1-\alpha/2\) quantile of the standard normal distribution. For example, if \(\alpha\) = 0.05, then \(z_{\alpha/2}\) = 1.96. Here we’ll only examine correlations out to a lag of 5 years because it’s unlikely that counts in this year would be related to counts 6 or more years in the past (and hopefully not at any years in the past).

## Type-I error
alpha_crit <- 0.05

## threshold value for rho (correlation)
(rho_crit <- qnorm(1 - alpha_crit/2) / sqrt(n_yrs))
## [1] 0.4496466
## rearrange residuals into matrix
rr <- matrix(residuals(mod_site_year), n_yrs, n_sites)

## get ACF
ACF <- apply(rr, 2, acf, lag.max = 5, plot = FALSE)
ACF <- lapply(ACF, function(x) x$acf)
## convert list to matrix; don't need row 1 b/c rho_0 = 1
ACF <- do.call(cbind, ACF)[-1,]

## check if any |values| > rho_crit
any(abs(ACF) > rho_crit)
## [1] FALSE

It looks like the random effect for year \(\alpha_t\) must have done an adequate job of accounting for any autocorrelation in the data.

Mixed model with a trend

Both our plot of the raw data and the random effects for year showed evidence of an increasing trend in the index of abundance. Let’s now fit a model that includes a fixed effect for year to account for this apparent trend, plus random effects for both site and year. At first glance, it might seem problematic to include both a fixed and random effect for year, but we have multiple sites sampled per year, so the fixed effect will give us the overall linear trend across all sites, and the random effect will allow for additional variation among years (ie, all sites in a given year may be greater or less than the long-term average). Specifically, the model we want to fit is

\[ y_{s,t} = \beta_0 + \beta_1 t + \alpha_s + \alpha_t + \epsilon_{s,t} \\ \alpha_{s} \sim \text{N}(0, \sigma_{\alpha_s}^2) \\ \alpha_{t} \sim \text{N}(0, \sigma_{\alpha_t}^2) \\ \epsilon_{s,t} \sim \text{N}(0, \sigma_\epsilon^2) \]

Let’s fit this model and compare it to the comparable model we fit earlier without a fixed effect of year.

## model with RE's for both site and year
mod_site_year_f <- lmer(index ~ 1 + year + (1 | site) + (1 | year), df)

## model summary
summary(mod_site_year_f)
## Linear mixed model fit by REML ['lmerMod']
## Formula: index ~ 1 + year + (1 | site) + (1 | year)
##    Data: df
## 
## REML criterion at convergence: 938.5
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.03292 -0.76397 -0.07034  0.58756  2.60509 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  year     (Intercept)  0.2758  0.5252  
##  site     (Intercept)  4.4303  2.1048  
##  Residual             12.4163  3.5237  
## Number of obs: 171, groups:  year, 19; site, 9
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 20.12690    0.93263  21.581
## year         0.39082    0.05389   7.252
## 
## Correlation of Fixed Effects:
##      (Intr)
## year -0.578

You can see from the summary info above that the fixed effect of year is ~0.39, which means the index of abundance is increasing by ~4 units per decade.

Model comparison

In this case we can use AIC to directly compare these models because they only differ with respect to the fixed effect for year. However, before doing so, we need to refit both of the models using ML rather than REML.

## refit site & year model
mod_sy_ml <- lmer(index ~ 1 + (1 | site) + (1 | year), df, REML = FALSE)
## refit 
mod_syf_ml <- lmer(index ~ 1 + year + (1 | site) + (1 | year), df, REML = FALSE)
## get AIC for both models
AIC(mod_sy_ml)
AIC(mod_syf_ml)
## [1] 969.4638
## [1] 945.701

The AIC for the model with a fixed effect of year is ~24 units lower than the more simple model, which is indeed strong evidence for the inclusion of year as a fixed effect. Another option for comparing nested models with the same random effects structure is a \(\chi^2\) test via anova(), which has the added advantage of also providing AIC and BIC values as well. To do so, we use anova(reduced model, full model).

## LRT for fixed effect of `year`
anova(mod_sy_ml, mod_syf_ml)
## Data: df
## Models:
## mod_sy_ml: index ~ 1 + (1 | site) + (1 | year)
## mod_syf_ml: index ~ 1 + year + (1 | site) + (1 | year)
##            Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)    
## mod_sy_ml   4 969.46 982.03 -480.73   961.46                             
## mod_syf_ml  5 945.70 961.41 -467.85   935.70 25.763      1  3.861e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

General considerations

Here are some general guidelines to follow when fitting and evaluating mixed effects models.

Fixed effects

You should use maximum likelihood (ML) when comparing models with different fixed effects, as ML doesn’t rely on the coefficients of the fixed effects, whereas REML assumes that the fixed effects are all set and correct. That said, you should ultimately use the parameter estimates from your final “best” model as obtained via REML because ML estimates of the variances of the random effects tend to be biased low.

Random effects

You can use model selection to help you decide which random effects to retain, but in general, random effects should be based upon your knowledge of the system. For example, is there pseudoreplication? Are the data part of a time series or spatial sampling design? As we saw above, we use REML estimators to compare models with different random effects.

Entire model selection

The most important thing is to not vary both random and fixed effects at the same time. You should address one of the components and then the other, often in a back-and-forth-manner. Here is the model selection process recommended by Zuur et al. (2009):

  • fit the most complex model you can envision based on your possible covariates and random effects

  • sort out the random effects structure using REML-based inference via AIC or likelihood ratio tests

  • sort out the fixed effects structure while keeping your random effects constant using REML and \(F\)- or \(t\)-tests, or compare nested ML models via AIC

  • present the results of your final model using REML estimation