Background

These lab exercises will demonstrate some of the options for model selection that we discussed in lecture, which include both in-sample and out-of-sample methods. We will return to the plant species richness data from the Galapagos Archipelago contained in the gala dataset in the Faraway package. As a reminder, here are the variables:

  • Species: the number of plant species found on the island

  • Endemics: the number of endemic species

  • Area: the area of the island (km\(^2\))

  • Elevation: the highest elevation of the island (m)

  • Nearest: the distance from the nearest island (km)

  • Scruz: the distance from Santa Cruz island (km)

  • Adjacent: the area of the adjacent island (km\(^2\))

We will ignore the number of endemic species (Endemics) and focus on possible models to explain the total number of species (Species) as a function of the 5 covariates (predictors). Before diving right in, let’s consider the possible correlations among these covariates, as we want to avoid including any two covariates in the same model if theyare highly correlated.

## get data
data(gala, package = "faraway")
dat <- gala
head(dat)
##              Species Endemics  Area Elevation Nearest Scruz Adjacent
## Baltra            58       23 25.09       346     0.6   0.6     1.84
## Bartolome         31       21  1.24       109     0.6  26.3   572.33
## Caldwell           3        3  0.21       114     2.8  58.7     0.78
## Champion          25        9  0.10        46     1.9  47.4     0.18
## Coamano            2        1  0.05        77     1.9   1.9   903.82
## Daphne.Major      18       11  0.34       119     8.0   8.0     1.84
## check correlations
round(cor(dat[,3:7]), 2)
##            Area Elevation Nearest Scruz Adjacent
## Area       1.00      0.75   -0.11 -0.10     0.18
## Elevation  0.75      1.00   -0.01 -0.02     0.54
## Nearest   -0.11     -0.01    1.00  0.62    -0.12
## Scruz     -0.10     -0.02    0.62  1.00     0.05
## Adjacent   0.18      0.54   -0.12  0.05     1.00

There appears to be high correlation between Area and Elevation (\(\rho\) = 0.75), and reasonably high correlation between Nearest and Scruz (\(\rho\) = 0.62), so let’s be careful to keep those variables separate from one another in our models.

Creating a model set

We need to create a set of all the models we’d like to fit and evaluate for their data support. To do so, we can make use of the expand.grid() function in R, which will create a data frame from all of the combinations of a supplied vector or data frame. We can then use the rows of this data frame as a set of indices for which covariates to include/exclude from our models.

For example, say we had 2 covariates (\(A\) and \(B\)) that we wanted to use as predictors, and we wanted all possible combinations of them in our models. To do so, we create a Boolean indicator (T/F) for the inclusion/exclusion of each covariate.

## possible covariates
## listing FALSE first builds from fewest to most
df <- data.frame(A = c(FALSE, TRUE), B = c(FALSE, TRUE))
## all possible combinations
expand.grid(df)
##       A     B
## 1 FALSE FALSE
## 2  TRUE FALSE
## 3 FALSE  TRUE
## 4  TRUE  TRUE

Here you can see that there are 4 possible combinations of the 2 predictors: none (intercept-only), only A, only B, both A & B.

Model set for species diversity

Let’s go ahead and create the set of all possible combinations of our 5 covariates in the gala data. Afterwards, we can remove those rows where both Area and Elevation or Nearest and Scruz occur in the same model.

## data frame specifying predictors to include
df <- as.data.frame(matrix(c(FALSE, TRUE), 2, 5))
## add col names
cov_names <- colnames(df) <- colnames(gala)[3:7]

## create set of all possible combinations
full_set <- expand.grid(df)

## rows with (`Area` and `Elevation` = 1) or (`Nearest` and `Scruz` = 1)
ii <- which(full_set$Area + full_set$Elevation == 2 |
              full_set$Nearest + full_set$Scruz == 2)
## create reduced set of models
## converting to a matrix for easier indexing
use_set <- as.matrix(full_set[-ii,])

## number of models in our set
(n_mods <- nrow(use_set))
## [1] 18

In-sample selection

We saw in class that we could use AIC, BIC, or both for in-sample model selection. We can set up a routine to do the following:

  • loop over all possible model combinations
    • fit each model
    • calculate and store the information criteria

To fit each model, we need to pass lm() a formula object, which we can build from the covariate names in our data frame of possible models.

Fit models

## empty matrix for storing results
mod_res <- matrix(NA, n_mods, 2)
colnames(mod_res) <- c("AIC", "BIC")

## fit models & store AIC & BIC
for(i in 1:n_mods) {
  if(i == 1) {
    fmla <- "Species ~ 1"
  } else {
    fmla <- paste("Species ~", paste(cov_names[use_set[i,]], collapse = " + "))
  }
  mod_fit <- lm(as.formula(fmla), data = gala)
  mod_res[i,"AIC"] <- AIC(mod_fit)
  mod_res[i,"BIC"] <- BIC(mod_fit)
}

Calculate \(\Delta\)IC

We saw in class that it’s easier to interpret the information criteria if we adjust them to \(\Delta\)IC values, such that

\[ \Delta IC = IC - \min IC \]

## empty matrix for storing results
delta_res <- matrix(NA, n_mods, 2)
colnames(delta_res) <- c("deltaAIC", "deltaBIC")

## convert IC to deltaIC
delta_res[,"deltaAIC"] <- mod_res[,"AIC"] - min(mod_res[,"AIC"])
delta_res[,"deltaBIC"] <- mod_res[,"BIC"] - min(mod_res[,"BIC"])

## round them for easier viewing
(delta_res <- round(delta_res, 2))
##       deltaAIC deltaBIC
##  [1,]    36.13    33.33
##  [2,]    23.71    22.31
##  [3,]    14.49    13.09
##  [4,]    38.13    36.73
##  [5,]    25.56    25.56
##  [6,]    16.48    16.48
##  [7,]    37.24    35.84
##  [8,]    25.12    25.12
##  [9,]    14.75    14.75
## [10,]    38.11    36.71
## [11,]    25.34    25.34
## [12,]     0.00     0.00
## [13,]    40.11    40.11
## [14,]    27.24    28.64
## [15,]     1.53     2.93
## [16,]    39.20    39.20
## [17,]    26.81    28.21
## [18,]     0.04     1.44

Based on AIC, model 12 is the best of the set, but model 18 has nearly identical support from the data, and model 15 is within 2 units as well. Based on BIC, model 12 is the best of the set, with model 18 being within 2 units as well. Let’s see which predictors are in these models.

## "best" models from our set
cov_names[use_set[12,]]
cov_names[use_set[15,]]
cov_names[use_set[18,]]
## [1] "Elevation" "Adjacent" 
## [1] "Elevation" "Nearest"   "Adjacent" 
## [1] "Elevation" "Scruz"     "Adjacent"

All 3 of these models include both Elevation and Adjacent as predictors, so they are clearly important. There is some evidence that Nearest and Scruz are also important, but they were pretty highly correlated, so they appear to be trading off with one another. Although models 15 and 18 are within 2-3 units of model 12, here we would choose model 12 over the others because it only has 2 regression parameters, making it the most parsimonious.

Best model

Let’s take a look at the summary information for model 12 with Elevation and Adjacent as predictors.

m12 <- lm(Species ~ Elevation + Adjacent, data = gala)
faraway::sumary(m12)
##              Estimate Std. Error t value  Pr(>|t|)
## (Intercept)  1.432872  15.024687  0.0954 0.9247269
## Elevation    0.276568   0.031763  8.7074 2.533e-09
## Adjacent    -0.068886   0.015490 -4.4471 0.0001344
## 
## n = 30, p = 3, Residual SE = 60.85898, R-Squared = 0.74

This looks like a pretty good model based upon its \(R^2\) value, but we should really check our model residuals for possible violations of our model assumptions, check leverages, look for outliers, etc.

Akaike weights

We saw in lecture that we can compute the likelihood of a given model as

\[ \mathcal{L}(y; f_i) \propto \exp \left( - \frac{1}{2} \Delta_i \right) \]

Because the model likelihoods are all relative (just as with other likelihoods), we can create a set of normalized Akaike weights that sum to 1

\[ w_i = \frac{\exp \left( - \frac{1}{2} \Delta_i \right)}{\sum_{s = 1}^S \exp \left( - \frac{1}{2} \Delta_i \right)} \]

We can then compare the support for any model \(i\) over another model \(j\) via evidence ratios

\[ ER_{ij} = \frac{\mathcal{L}(y; f_i)}{\mathcal{L}(y; f_j)} = \frac{w_i}{w_j} \]

Most often, we want to compare lower ranked models to the best, which we can simplify to

\[ \begin{aligned} ER_{1j} &= \frac{\exp \left( - \frac{1}{2} 0 \right)}{\exp \left( - \frac{1}{2} \Delta_j \right)} \\ &= \frac{1}{\exp \left( - \frac{1}{2} \Delta_j \right)} \\ &= \exp \left( \frac{1}{2} \Delta_j \right) \end{aligned} \]

Let’s go ahead and calculate our model weights and the evidence ratios compared to our best model (#12).

## numerator
num <- exp(-0.5 * delta_res[,"deltaAIC"])
## denominator
dem <- sum(num)
## Akaike weights
wts <- num / dem
## evidence ratios
ER <- exp(0.5 * delta_res[,"deltaAIC"])
## data frame with our results
data.frame(model = seq(n_mods),
           weights = round(wts, 3),
           ER = floor(ER))
##    model weights        ER
## 1      1   0.000  70069628
## 2      2   0.000    140786
## 3      3   0.000      1401
## 4      4   0.000 190468998
## 5      5   0.000    355045
## 6      6   0.000      3789
## 7      7   0.000 122057157
## 8      8   0.000    284930
## 9      9   0.000      1595
## 10    10   0.000 188573799
## 11    11   0.000    318061
## 12    12   0.409         1
## 13    13   0.000 512596733
## 14    14   0.000    822414
## 15    15   0.190         2
## 16    16   0.000 325215956
## 17    17   0.000    663311
## 18    18   0.401         1

Clearly the evidence against most of the models is very strong, as their weights are ~0 and the evidence ratios in favor of model #12 are enormous.

Corrected AIC

We learned in lecture that AIC is biased when the sample size is small, but we can account for this by using a corrected form (AICc) given by

\[ AICc = AIC + \frac{2 k (k + 1)}{n - k - 1} \]

Our sample size for this analysis is 30, so it’s worth seeing how the results would differ if we used AICc instead.

## number of parameters in each model
k <- 1 + apply(use_set, 1, sum)
## calculate penalty term
pterm <- (2 * k * (k + 1)) / (nrow(gala) - k - 1)
## calculate AICc
aicc <- mod_res[,"AIC"] + pterm
## compare delta-values for both
data.frame(deltaAIC = round(delta_res[,"deltaAIC"], 1),
           deltaAICc = round(aicc - min(aicc), 1))
##    deltaAIC deltaAICc
## 1      36.1      35.4
## 2      23.7      23.2
## 3      14.5      14.0
## 5      38.1      37.6
## 6      25.6      25.6
## 7      16.5      16.5
## 9      37.2      36.8
## 10     25.1      25.1
## 11     14.8      14.8
## 17     38.1      37.6
## 18     25.3      25.3
## 19      0.0       0.0
## 21     40.1      40.1
## 22     27.2      27.9
## 23      1.5       2.2
## 25     39.2      39.2
## 26     26.8      27.5
## 27      0.0       0.7

Here we see very little difference between the \(\Delta\)-values.

Out-of-sample selection

We saw in lecture that we can use different forms of cross-validation for out-of-sample model selection, which include exhaustive and non-exhaustive methods. Let’s repeat our evaluation of the models in our candidate set using both of these approaches. Before doing so, we also need to choose one of the options for evaluating our predictions. Here we’ll use the mean squared prediction error (MSPE), which is perhaps the most common. Recall that for a model fit to \(n - q\) data points, the MSPE for the remaining \(q\) data points it

\[ MSPE = \frac{\sum_{i = 1}^q (y_i - \hat{y}_i)^2}{q} \]

Exhaustive cross-validation

Recall that exhaustive cross-validation works via a “leave-\(q\)-out” procedure, where we treat \(n - q\) data points as the “training” (fitting) data and the remaining \(q\) data points for evaluating the predictions. If \(q > 1\) and \(n\) even somewhat large, this can be prohibitively slow because there are \(\left( \begin{matrix} n \\ q \end{matrix} \right)\) combinations. For example, if \(q = 3\) and \(n = 20\) there are \(\left( \begin{matrix} 20 \\ 3 \end{matrix} \right)\) = 1140 different permutations. Therefore, here we’ll use \(q = 1\), which gives us the familiar “leave-one-out” (LOO) method and results in \(n\) different models being fit.

Leave-one-out

The general idea here is to set up a routine to do the following:

  • loop over all possible model combinations
    • for each model form, loop over the number of observations
      • drop one observation and fit the model
      • predict the missing value
      • calculate the MSPE for the predictions
## sample size
nn <- nrow(gala)
## empty vector for predictions
loo_res <- rep(NA, nn)
## empty vector for MSPE
mspe_l <- rep(NA, n_mods)

## loop over all possible model combinations
for(i in 1:n_mods) {
  ## create model formula
  if(i == 1) {
    fmla <- "Species ~ 1"
  } else {
    fmla <- paste("Species ~", paste(cov_names[use_set[i,]], collapse = " + "))
  }
  ## loop over number of observations
  for(j in 1:nn) {
    ## drop one observation and fit the model
    fm <- lm(as.formula(fmla), gala[-j,])
    ## predict the missing value
    loo_res[j] <- predict(fm, newdata = data.frame(gala[j,]))
  }
  ## calculate MSPE for the predictions
  mspe_l[i] <- sum((gala$Species - loo_res)^2) / nn
}

Let’s check to see if these results match those from our in-sample comparisons via AIC and BIC.

data.frame(AIC = order(mod_res[,"AIC"]),
           BIC = order(mod_res[,"BIC"]),
           LOO = order(mspe_l))
##    AIC BIC LOO
## 1   12  12  12
## 2   18  18  18
## 3   15  15  15
## 4    3   3   9
## 5    9   9   3
## 6    6   6   6
## 7    2   2   7
## 8    8   8   1
## 9   11  11  16
## 10   5   5  10
## 11  17  17   4
## 12  14  14  13
## 13   1   1   8
## 14   7   7   2
## 15  10  10   5
## 16   4   4  11
## 17  16  16  17
## 18  13  13  14

It looks like all 3 methods agree on the top 3 models, but then the results from the leave-one-out cross-validation start to diverge from the in-sample results.

Non-exhaustive cross-validation

Let’s now try a non-exhaustive method where we don’t use every combination of the \(n - q\) training data.

\(k\)-fold

Here we’ll use \(k\)-fold cross-validation where the data are randomly partitioned into \(k\) equal sized groups. We’ll retain one of the \(k\) sub-samples for validation while the remaining \(k − 1\) groups are used for fitting. This process is then repeated \(k\) times, with each of the \(k\) sub-samples used exactly once for validation.

We have 30 observations in our dataset, so let’s use 5 groups of 6 observations each. The rest of the analysis will proceed as above for the leave-one-out method.

## number of groups
kk <- 5
## grop size
gs <- nn / kk
## empty vector for predictions
kf_res <- rep(NA, nn)
## empty vector for MSPE
mspe_k <- rep(NA, n_mods)

## loop over all possible model combinations
for(i in 1:n_mods) {
  ## create model formula
  if(i == 1) {
    fmla <- "Species ~ 1"
  } else {
    fmla <- paste("Species ~", paste(cov_names[use_set[i,]], collapse = " + "))
  }
  ## loop over folds
  for(fold in 1:kk) {
    ## group index
    grp <- seq(gs) + gs*(fold - 1)
    ## drop one group and fit the model
    fm <- lm(as.formula(fmla), gala[-grp,])
    ## predict the missing values
    kf_res[grp] <- predict(fm, newdata = data.frame(gala[grp,]))
  }
  ## calculate MSPE for the predictions
  mspe_k[i] <- sum((gala$Species - kf_res)^2) / nn
}

Now we can compare these results to the other model selection results from above.

data.frame(AIC = order(mod_res[,"AIC"]),
           BIC = order(mod_res[,"BIC"]),
           LOO = order(mspe_l),
           kfold = order(mspe_k))
##    AIC BIC LOO kfold
## 1   12  12  12    12
## 2   18  18  18    18
## 3   15  15  15    15
## 4    3   3   9     9
## 5    9   9   3     3
## 6    6   6   6     6
## 7    2   2   7     1
## 8    8   8   1    10
## 9   11  11  16     7
## 10   5   5  10    16
## 11  17  17   4     4
## 12  14  14  13    13
## 13   1   1   8     8
## 14   7   7   2     2
## 15  10  10   5     5
## 16   4   4  11    11
## 17  16  16  17    14
## 18  13  13  14    17

Here we see that all four methods come up with the same rankings for the top 3 models, and that both of the out-of-sample methods are nearly identical, too. It’s also important to recognize that we likely would have found a somewhat different result if we had chosen a differet number of groups for the \(k\)-fold cross-validation.

Multi-model inference

We saw in lecture that we can use the Akaike weights to average parameters from different models as a means of addressing uncertainty in our model structures. Recall that for a given parameter \(\theta\), it’s model averaged estimate is

\[ \bar{\hat{\theta}} = \sum_{i = 1}^S w_i \hat{\theta}_i \]

where \(S\) is the total number of models in the set. Usually a given parameter \(\theta\) does not appear in all models, so we can use an indicator function to compute the average estimate

\[ \bar{\hat{\theta}} = \frac{\sum_{i = 1}^S I(f_i) w_i \hat{\theta}_i}{\sum_{i = 1}^S I(f_i) w_i} \\ ~ \\ I(f_i) = \left\{ \begin{array} {ll} 1 & \text{if} ~ \theta ~\text{is in} ~ f_i \\ 0 & \text{otherwise} \end{array} \right. \]

Here we saw that 2 or 3 of our models had very similar support from the data, so it might be worth averaging our coefficients across all of the models. In this case, however, the top 3 models contain 99.9% of the total weights, so the remaining models will have very little influence on the results. (Also, it would have been more efficient to do all of this the first time we fit our models, but that’s okay.)

## empty matrix for storing coefficients
## we'll fill it with 0's and replace them with the param estimates
mod_coef <- matrix(0, n_mods, 1 + ncol(df))
colnames(mod_coef) <- c("Intercept", colnames(df))

## fit models & store AIC & BIC
for(i in 1:n_mods) {
  if(i == 1) {
    fmla <- "Species ~ 1"
  } else {
    fmla <- paste("Species ~", paste(cov_names[use_set[i,]], collapse = " + "))
  }
  mod_fit <- lm(as.formula(fmla), data = gala)
  mod_coef[i, c(TRUE, use_set[i,])] <- coef(mod_fit)
}

## calculate weighted coefs
wtd_coef <- mod_coef * wts
(avg_coef <- colSums(wtd_coef))
##     Intercept          Area     Elevation       Nearest         Scruz 
##  7.544112e+00  6.494906e-07  2.759027e-01 -9.823119e-02 -8.732447e-02 
##      Adjacent 
## -6.851311e-02

It’s important to note here that using this model in practice isn’t a good idea because of the high correlation between Area and Elevation and between Nearest and Scruz. An alternative approach would be to use each of the 18 models to make a prediction for a given \(\mathbf{X}\) and then weight those predictions to come up with a model-averaged predictions.