Background

This week we will see how to fit linear models in R, and estimate parameter uncertainty and confidence intervals. The data come from a paper by Johnson & Raven (1973, Science 179:893-895) and consist of the diversity of plants species across 30 islands in the Galapagos Archipelago. The data set also contains measurements of 5 possible explanatory variables:

  • Area: the area of the island (km2)
  • Elevation: the highest elevation of the island (m)
  • Nearest: the distance from the nearest island (km)
  • Scruz: the distance from Santa Cruz Island (km)
  • Adjacent: the area of the adjacent island (km2)

The gala dataset is included in the faraway package. Let’s load the package and inspect the data.

library(faraway)
head(gala)
##              Species Endemics  Area Elevation Nearest Scruz Adjacent
## Baltra            58       23 25.09       346     0.6   0.6     1.84
## Bartolome         31       21  1.24       109     0.6  26.3   572.33
## Caldwell           3        3  0.21       114     2.8  58.7     0.78
## Champion          25        9  0.10        46     1.9  47.4     0.18
## Coamano            2        1  0.05        77     1.9   1.9   903.82
## Daphne.Major      18       11  0.34       119     8.0   8.0     1.84

The second column Endemics is the number of plant species that are endemic to each island, but we will just use the total number of plant species recorded in Species.

A simple model for plant diversity

We seek a linear model that would help us explain the diversity (number) of plant species as a function of some predictors, which are sometimes referred to as covariates. From the theory of island biogeography, it would seem reasonable that plant diversity is related to the area of an island, such that

species diversity = \(f\)(area)


More formally, we can write our explicit linear regression model as

\[ \text{species}_i = \beta_0 + \beta_1 \text{area}_i + e_i \]

where \(i\) indicates an island.

Let’s plot the repsonse and predictor to see what the relationship looks like.

Estimating \(\hat{\boldsymbol{\beta}}\)

We saw in lecture that we can estimate the regression parameters as

\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y}. \]

Let’s begin by creating the matrix of response values \(\mathbf{y}\), which should have dimensions \(n \times 1\).

nn <- nrow(gala)
yy <- matrix(gala$Species, nrow = nn, ncol = 1)
head(yy)
##      [,1]
## [1,]   58
## [2,]   31
## [3,]    3
## [4,]   25
## [5,]    2
## [6,]   18

Next let’s create the matrix of predictors \(\mathbf{X}\). There is only one predictor in the model (Area), so our matrix should consist of a column of 1’s and a column with the area of each island.

Intercept <- rep(1, nn)
Area <- gala$Area
XX <- cbind(Intercept, Area)
head(XX)
##      Intercept  Area
## [1,]         1 25.09
## [2,]         1  1.24
## [3,]         1  0.21
## [4,]         1  0.10
## [5,]         1  0.05
## [6,]         1  0.34

Matrix multiplication in R uses the %*% operator. You can transpose a matrix with t() and invert a matrix with solve(). So our estimate of the regression parameters in is

beta_hat <- solve(t(XX) %*% XX) %*% t(XX) %*% yy
beta_hat
##                  [,1]
## Intercept 63.78286147
## Area       0.08196317

We can interpret these coefficients to mean that an island with an area of 0 km2 would be expected to have ~64 species of plants(!), and that you would add ~8 species per 100 km2 of island.

Estimating \(\hat{\mathbf{y}}\)

Recall that the fitted values are given by

\[ \begin{align} \hat{\mathbf{y}} &= \mathbf{X} \hat{\boldsymbol{\beta}} \\ &= \mathbf{X} \left( (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y} \right) \\ &= \underbrace{\mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top}}_{\mathbf{H}} \mathbf{y} \\ &= \mathbf{H} \mathbf{y}. \end{align} \]

So we can calculate the hat matrix \(\mathbf{H}\) and solve for \(\hat{\mathbf{y}}\)

HH <- XX %*% solve(t(XX) %*% XX) %*% t(XX)
y_hat <- HH %*% yy

Let’s add the predicted values to our plot of the data.

par(mai = c(0.9, 0.9, 0.3, 0.1), omi = rep(0, 4))
## plot the data
plot(gala$Area, gala$Species, pch = 16, col = "dark green",
     ylab = "Number of plant species", xlab = expression(paste("Area of island (", km^2, ")")))
## add the fitted values (line)
abline(a = beta_hat[1], b = beta_hat[2])

This is not a great fit, but it’s a decent start. We’ll expand upon our model later.

A shortcut to model fitting

As we saw briefly in lecture, it’s easy to fit a linear regression model using the function lm(), which takes arguments like

lm(respone ~ predictor(s), data_to_use)

To do so for our model above, we have

simple_model <- lm(Species ~ Area, gala)
simple_model
## 
## Call:
## lm(formula = Species ~ Area, data = gala)
## 
## Coefficients:
## (Intercept)         Area  
##    63.78286      0.08196

Note that these are the same estimates that we obtained earlier

We can obtain the estimates of the fitted values \(\hat{\mathbf{y}}\) via one of two functions in R: fitted() and predict(). We’ll see later that predict() has additional functionality that we can use.

## via fitted
y_hat_f <- fitted(simple_model)
## via predict
y_hat_p <- predict(simple_model, type = "response")
## compare these to each other
all.equal(y_hat_f, y_hat_p)
## [1] TRUE

Goodness-of-fit

We saw in lecture that we can use the coefficient of determination \((R^2)\) to measure how well a model fits the data. The formula is

\[ R^2 = 1 - \frac{SSE}{SSTO} \]

We can easly compute these quantities from our model fit and data.

## SSE
resids <- yy - y_hat
SSE <- sum(resids^2) # = t(resids) %*% resids
## SSTO
SSTO <- sum((yy - mean(yy))^2)
## R^2
1 - SSE / SSTO
## [1] 0.3817301

Let’s compare this to the value in the ANOVA table that R produces

summary(simple_model)
## 
## Call:
## lm(formula = Species ~ Area, data = gala)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -99.495 -53.431 -29.045   3.423 306.137 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 63.78286   17.52442   3.640 0.001094 ** 
## Area         0.08196    0.01971   4.158 0.000275 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 91.73 on 28 degrees of freedom
## Multiple R-squared:  0.3817, Adjusted R-squared:  0.3596 
## F-statistic: 17.29 on 1 and 28 DF,  p-value: 0.0002748

The estimates are indeed the same.

Adjusted \(R^2\)

The ANOVA table above also reports the Adjusted R-squared, which we haven’t discussed yet. It turns out that one can always increase the \(SSR\) of a model by including more predictors, even if they are completed unrelated to the response. Recall our objective in least squares estimation is

\[ \min \sum SSE = \min \sum_i (y_i - X_i \beta). \]

Increasing the dimension of \(\mathbf{X}\) by adding more predictors (columns) necessitates an increased space within which to minimize this objective function, and hence you can get a better fit o the data. We can demonstrate this with a simple example. Suppose we added another predictor to our simple model that was nothing more than a vector of Gaussian white noise, which is unrelated to the response.

set.seed(514)
## generate a vector of Gaussian white noise
WN <- rnorm(nn)
## add this to our Galapagos data frame
gala$WN <- WN 
## fit a model with Area & WN
summary(lm(Species ~ Area + WN, gala))
## 
## Call:
## lm(formula = Species ~ Area + WN, data = gala)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -93.86 -44.67 -21.73  15.93 308.84 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  70.68115   17.82480   3.965 0.000485 ***
## Area          0.07892    0.01944   4.059 0.000378 ***
## WN          -20.27151   13.91996  -1.456 0.156843    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 89.95 on 27 degrees of freedom
## Multiple R-squared:  0.4268, Adjusted R-squared:  0.3843 
## F-statistic: 10.05 on 2 and 27 DF,  p-value: 0.0005465

Notice that our \(R^2\) value has indeed increased.

To account for this, we can use an adjusted \(R^2\), which corrects for these additional degrees of freedom.

\[ \bar{R}^2 = 1 - \frac{SSE ~ / ~ df_{error}}{SSTO ~ / ~ df_{total}} \]

Let’s calculate \(\bar{R}^2\) for our simple model and confirm that it matches the value in the ANOVA table for the model with only Area as a predictor.

1 - (SSE / (nn - 2)) / (SSTO / (nn - 1))
## [1] 0.359649

A better model for plant diversity

Let’s expand our model from above to include some more predictors. In particular, we’ll add in the effects of an island’s elevation and the distance to the nearest island, such that

plant diversity = \(f\)(area, elevation, distance to nearest island)


More formally, we can write our explicit linear regression model as

\[ \text{species}_i = \beta_0 + \beta_1 \text{area}_i + \beta_2 \text{elevation}_i + \beta_3 \text{nearest}_i + e_i \]

where \(i\) indicates an island.

Fit the model

We’ll use lm() to estimate this model.

## larger model
full_mod <- lm(Species ~ Area + Elevation + Nearest, gala)
full_mod
## 
## Call:
## lm(formula = Species ~ Area + Elevation + Nearest, data = gala)
## 
## Coefficients:
## (Intercept)         Area    Elevation      Nearest  
##    16.46471      0.01908      0.17134      0.07123

Notice that our estimates for the intercept \(\beta_0\) and the effect of island area \(\beta_1\) changed when we accounted for additional predictors in the model.

Hypothesis tests to compare models

We saw in lecture a number of different hypothesis tests we might wish to undertake. Let’s run through examples of those here.

Test of all predictors in a model

Suppose we wanted to test whether this collection of 3 predictors in our model were better than simply estimating the data by their mean.

\[ \Theta: \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{e} \\ \theta: \mathbf{y} = \boldsymbol{\mu} + \mathbf{e} \\ \]

We write the null hypothesis as

\[ H_0: \beta_1 = \beta_2 = \dots = \beta_k = 0 \]

and we would reject \(H_0\) if \(F > F^{(\alpha)}_{k_{\Theta} - k_{\theta}, n - k_{\Theta}}\)

Recall that we wil base this test on an \(F\)-distribution, such that

\[ SSE_{\Theta} = \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \right)^{\top} \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \right) = \mathbf{e}^{\top} \mathbf{e} = SSE \\ SSE_{\theta} = \left( \mathbf{y} - \bar{y} \right)^{\top} \left( \mathbf{y} - \bar{y} \right) = SSTO \\ \Downarrow \\ F = \frac{ \left( SSTO - SSE \right) / (k - 1) } { SSE / (n - k)} \]

\(F\)-test by hand

Calculating the sums-of-squares in R is straightforward with matric operations.

## get matrix of predictors
XX <- model.matrix(full_mod)
## estimate beta
beta_hat <- solve(t(XX) %*% XX) %*% t(XX) %*% yy
## total sum of squares
SSE <- t(yy - XX %*% beta_hat) %*% (yy - XX %*% beta_hat)
## error sum of squares
SSTO <- t(yy - mean(yy)) %*% (yy - mean(yy))
## F statistic
(F_stat <- ((SSTO - SSE) / (4 - 1)) / (SSE / (nn - 4)))
##          [,1]
## [1,] 10.77043
## F test
pf(F_stat, 4-1, nn-4, lower.tail = F)
##              [,1]
## [1,] 8.816845e-05

This \(F\)-statistic is quite large and the \(p\)-value is very small, so we would reject the null hypothesis that we would be justified in dropping the 3 predictors from this model in favor of a mean-only model.

\(F\)-test with built-in functions

We can use the anova() function in R to conduct this \(F\)-test

## null model; the '1' indicates an intercept-only model
null_mod <- lm(Species ~ 1, gala)
## use `anova('simple', 'complex')` to get the F-test results
anova(null_mod, full_mod)
## Analysis of Variance Table
## 
## Model 1: Species ~ 1
## Model 2: Species ~ Area + Elevation + Nearest
##   Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
## 1     29 381081                                 
## 2     26 169918  3    211164 10.77 8.817e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that we get the same \(F\)-statistic and \(p\)-value as earlier

Testing one predictor

We might ask whether any one predictor could be dropped from our model. For example, can nearest be dropped from this model? One option is to fit this reduced model and compare to it to the full model via our \(F\)-test with \(H_0: \beta_3 = 0\)

\[ \begin{aligned} \Theta: \text{species}_i &= \beta_0 + \beta_1 \text{area}_i + \beta_2 \text{elevation}_i + \beta_3 \text{nearest}_i + e_i \\ ~ \\ \theta: \text{species}_i &= \beta_0 + \beta_1 \text{area}_i + \beta_2 \text{elevation}_i + e_i \end{aligned} \]

Here’s our test in R.

## full model as before
full_mod <- lm(Species ~ Area + Elevation + Nearest, gala)
## reduced model without `nearest`
reduced_mod <- lm(Species ~ Area + Elevation, gala)
## use `anova('reduced', 'full')` to get the F-test results
anova(reduced_mod, full_mod)
## Analysis of Variance Table
## 
## Model 1: Species ~ Area + Elevation
## Model 2: Species ~ Area + Elevation + Nearest
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     27 169947                           
## 2     26 169918  1    29.243 0.0045 0.9472

The \(F\)-statistic is small and the \(p\)-value is large so we cannot reject \(H_0\), and hence we conclude that we would be justified in dropping Nearest from our model.

Using a \(t\)-test

Another option for testing the signficance of one predictor is to estimate a \(t\)-statistic as

\[ t_i = \frac{\beta_i}{\text{SE} \left( \beta_i \right)} \]

and compare it to a \(t\)-distribution with \(n - k\) degrees of freedom. We can get the results of the \(t\)-test with summary() from base R or sumary() from the faraway pkg, which is much less verbose.

## via `base::summary()`
summary(full_mod)
## 
## Call:
## lm(formula = Species ~ Area + Elevation + Nearest, data = gala)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -191.856  -33.111  -18.626    5.673  262.209 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 16.46471   23.38884   0.704  0.48772   
## Area         0.01908    0.02676   0.713  0.48216   
## Elevation    0.17134    0.05452   3.143  0.00415 **
## Nearest      0.07123    1.06481   0.067  0.94718   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 80.84 on 26 degrees of freedom
## Multiple R-squared:  0.5541, Adjusted R-squared:  0.5027 
## F-statistic: 10.77 on 3 and 26 DF,  p-value: 8.817e-05
## via `faraway::sumary()`
sumary(full_mod)
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.464711  23.388841  0.7040 0.487718
## Area         0.019085   0.026764  0.7131 0.482158
## Elevation    0.171336   0.054519  3.1427 0.004151
## Nearest      0.071227   1.064806  0.0669 0.947179
## 
## n = 30, p = 4, Residual SE = 80.84116, R-Squared = 0.55

Note that we get \(t^2 = F\) and the same \(p\)-value as before

Testing 2+ predictors

Sometimes we might want to know whether we can drop 2+ predictors from a model. For example, could we drop both \(\text{area}\) and \(\text{nearest}\) from our full model?

\[ \begin{aligned} \Theta: \text{species}_i &= \beta_0 + \beta_1 \text{area}_i + \beta_2 \text{elevation}_i + \beta_3 \text{nearest}_i + e_i \\ ~ \\ \theta: \text{species}_i &= \beta_0 + \beta_2 \text{elevation}_i + e_i \end{aligned} \]

\(H_0 : \beta_1 = \beta_3 = 0\)

Again we can conduct this test by fitting both models and calculating \(F\).

## reduced model without `area` and `nearest`
reduced_mod <- lm(Species ~ Elevation, gala)
## use `anova('reduced', 'full')` to get the F-test results
anova(reduced_mod, full_mod)
## Analysis of Variance Table
## 
## Model 1: Species ~ Elevation
## Model 2: Species ~ Area + Elevation + Nearest
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     28 173254                           
## 2     26 169918  2    3336.3 0.2552 0.7766

The \(F\)-statistic is rather small and the \(p\)-value is big so we cannot reject \(H_0\), suggesting we would be justified in dropping area and elevation from the model.

Multiple \(t\)-tests versus one \(F\)-test

Could we instead evaluate our null hypothesis by examining our table of \(t\)-statistics and associated \(p\)-values?

## t-statistics from our full model
sumary(full_mod)
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.464711  23.388841  0.7040 0.487718
## Area         0.019085   0.026764  0.7131 0.482158
## Elevation    0.171336   0.054519  3.1427 0.004151
## Nearest      0.071227   1.064806  0.0669 0.947179
## 
## n = 30, p = 4, Residual SE = 80.84116, R-Squared = 0.55

This creates the problem of having to make our decision based on 2 \(p\)-values rather than one

  • Each \(p\)-value corresponds to a separate \(t\)-test with the other predictor in the model

  • Therefore we must use the single \(F\)-test

Testing a subspace

Some tests cannot be expressed in terms of the inclusion or exclusion of predictors.

Consider a test of whether the areas of the current and adjacent island could be added together and used in place of the two separate predictors

\[ \text{species}_i = \beta_0 + \beta_1 \text{area}_i + \beta_2 \text{adjacent}_i + \dots + e_i \\ ~ \\ \text{species}_i = \beta_0 + \beta_1 \text{(area + adjacent)}_i + \dots + e_i \]

\(H_0 : \beta_{\text{area}} = \beta_{\text{adjacent}}\)

We can do this in R by using the I() function, which tells R to treat things inside the () as an operation.

## full model as before
full_mod <- lm(Species ~ Area + Adjacent + Elevation + Nearest, gala)
## reduced model without `elevation + nearest`
comb_mod <- lm(Species ~ I(Area + Adjacent) + Elevation + Nearest, gala)
## use `anova('combined', 'full')` to get the F-test results
anova(comb_mod, full_mod)
## Analysis of Variance Table
## 
## Model 1: Species ~ I(Area + Adjacent) + Elevation + Nearest
## Model 2: Species ~ Area + Adjacent + Elevation + Nearest
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     26 116865                              
## 2     25  93867  1     22998 6.1252 0.02046 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The \(F\)-statistic is large and the \(p\)-value is reasonably small so we reject \(H_0\).


What if we wanted to test whether a predictor had a specific (non-zero) value? For example, is there a 1:1 relationship between \(\text{species}\) and \(\text{elevation}\) after controlling for the other predictors?

\[ \text{species}_i = \beta_0 + \beta_1 \text{area}_i + \underline{1} \text{elevation}_i + \beta_3 \text{nearest}_i + e_i \]

\(H_0 : \beta_2 = 1\)

We can do this in R by using the offset() function, which allows us to set a fixed beta.

## full model
full_mod <- lm(Species ~ Area + Elevation + Nearest, gala)
## model with effect of `elevation` = 1
fixed_mod <- lm(Species ~ Area + offset(1 * Elevation) + Nearest, gala)
## use `anova('comb', 'full')` to get the F-test results
anova(fixed_mod, full_mod)
## Analysis of Variance Table
## 
## Model 1: Species ~ Area + offset(1 * Elevation) + Nearest
## Model 2: Species ~ Area + Elevation + Nearest
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1     27 1679731                                  
## 2     26  169918  1   1509813 231.02 1.891e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The \(F\)-statistic is very large and the \(p\)-value is tiny so we reject \(H_0\).

Using a \(t\)-test

We can also modify our \(t\)-test from before and use it for our comparison by including the hypothesized \(\hat{\beta}_{H_0}\) as an offset

\[ t_i = \frac{(\hat{\beta_i} - \beta_{H_0})}{\text{SE} \left( \hat{\beta}_i \right)} \]

We can get \(\hat{\beta}_{\text{elevation}}\) and \(\text{SE}(\hat{\beta}_{\text{elevation}})\) from our model summary

## full model
sumary(full_mod)
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.464711  23.388841  0.7040 0.487718
## Area         0.019085   0.026764  0.7131 0.482158
## Elevation    0.171336   0.054519  3.1427 0.004151
## Nearest      0.071227   1.064806  0.0669 0.947179
## 
## n = 30, p = 4, Residual SE = 80.84116, R-Squared = 0.55

This leaves us with \(\hat{\beta}_{\text{elevation}} \approx 0.17\) and \(\text{SE}(\hat{\beta}_{\text{elevation}}) \approx 0.055\).

We can calculate the \(t\)-test by hand and compare the results to the above \(F\)-test

## t statistic
(t_value <- (0.171336 - 1) / 0.054519)
## p-value = t_alpha * Pr(t_value, df); `pt()` is the pdf for a t-dist
(p_value <- 1.96 * pt(t_value, 26))
## verify t^2 = F
all.equal(t_value^2, anova(fixed_mod, full_mod)$F[2], tolerance = 0.0001)
## [1] -15.19955
## [1] 1.853024e-14
## [1] TRUE

The \(p\)-value is the same and \(t^2 = F\) as before

Confidence intervals for \(\beta\)

We can also use confidence intervals (CI’s) to express uncertainty in \(\hat{\beta}_i\), which take the form

\[ 100(1 - \alpha)\% ~ \text{CI}: \hat{\beta}_{i} \pm t_{n-p}^{(\alpha / 2)} \operatorname{SE}(\hat{\beta}) \]

where here \(\alpha\) is our predetermined Type-I error rate. To do so, we need the information from the model summary.

## t-statistics from our full model
sumary(full_mod)
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.464711  23.388841  0.7040 0.487718
## Area         0.019085   0.026764  0.7131 0.482158
## Elevation    0.171336   0.054519  3.1427 0.004151
## Nearest      0.071227   1.064806  0.0669 0.947179
## 
## n = 30, p = 4, Residual SE = 80.84116, R-Squared = 0.55

We can construct a 95% CI on \(\hat{\beta}_{\text{area}}\) using the 2.5% and 97.5% percentiles of the \(t\)-distribution with \(df = 30 - 4 = 26\).

## critical value for the t-dist
## `qt()` is the quantile function for the t-dist; `p` is the (1-alpha/2) value 
t_crit <- qt(p = 0.975, df = 30-4)
## 95% CI
CI95_beta <- 0.019085 + c(-1,1) * t_crit * 0.026764
round(CI95_beta, 3)
## [1] -0.036  0.074

Confidence intervals for all \(\beta\)

We can quickly get the 95% CI for all the \(\beta_i\) using the function confint().

## all of the 95% CI's
confint(full_mod)
##                    2.5 %      97.5 %
## (Intercept) -31.61174063 64.54116288
## Area         -0.03593020  0.07409948
## Elevation     0.05927051  0.28340204
## Nearest      -2.11751249  2.25996697

Bootstrap confidence intervals

The above \(F\)- and \(t\)-based CI’s depend on the assumption of normality. Recall that the bootstrap method provides a way to construct CI’s without this assumption

Bootstrap procedure

Step 1: Fit your model to the data

## we already fit `full_mod` above

Step 2: Calculate \(\mathbf{e} = \mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}}\)

## residuals from our full model
resids <- residuals(full_mod)

Step 3a: Generate \(\mathbf{e}^*\) by sampling with replacement from \(\mathbf{e}\)
Step 3b: Calculate \(\mathbf{y}^* = \mathbf{X} \hat{\boldsymbol{\beta}} + \mathbf{e}^*\)
Step 3c: Estimate \(\hat{\boldsymbol{\beta}}^*\) from \(\mathbf{X}\) & \(\mathbf{y}^*\)

## number of boostrap samples
nb <- 1000
## empty matrix for beta estimates
beta_est <- matrix(NA, nb, 4)
## fitted values from our full model = X*beta
Xbeta <- fitted(full_mod)
## sample many times
for(i in 1:nb) {
  ## 3a: sample w/ replacement from e
  e_star <- sample(resids, rep = TRUE)
  ## 3b: calculate y_star
  y_star <- Xbeta + e_star
  ## 3c: re-estimate beta_star from X & y_star
  beta_star <- update(full_mod, y_star ~ .)
  ## save estimated betas
  beta_est[i,] <- coef(beta_star)
}

Step 4: Select the \(\tfrac{\alpha}{2}\) and \((1 - \tfrac{\alpha}{2})\) percentiles from the saved \(\hat{\boldsymbol{\beta}}^*\)

## extract 2.5% and 97.5% values
CI95 <- apply(beta_est, 2, quantile, c(0.025, 0.975))
colnames(CI95) <- c("Intercept", colnames(gala[,3:5]))
t(round(CI95, 3))
##              2.5%  97.5%
## Intercept -23.222 62.445
## Area       -0.034  0.077
## Elevation   0.091  0.300
## Nearest    -1.796  2.165

Confidence interval for new predictions

Given a fitted model \(\mathbf{y} = \mathbf{X} \hat{\boldsymbol{\beta}} + \mathbf{e}\), we might want to know the uncertainty around a new estimate \(\mathbf{y}^*\) given some new predictor \(\mathbf{X}^*\). Recall that this CI is given by

\[ \hat{\mathbf{y}}^* \pm ~ t^{(\alpha / 2)}_{df} \sigma \sqrt{ {\mathbf{X}^*}^{\top} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^* } \]

CI for the mean response

Think back to our simple model for plant diversity as a function of island area, such that

\[ \text{species}_i = \beta_0 + \beta_1 \text{area}_i + e_i \]

Suppose we wanted to estimate the uncertainty in the average diversity for an island with an area of 2000 km2. Recall that our CI on the mean response is given by

and here \(\mathbf{X}^* = [1 ~~ 2000]\) (ie, a \(1 \times 2\) row matrix).

CI by hand

We can calculate this by hand using matrix operations in R.

## matrix of predictors
XX <- model.matrix(simple_model)
## new X; vector for now
X_star <- c(Intercept = 1, Area = 2000)
## inside sqrt
inner_X <- t(X_star) %*% solve(t(XX) %*% XX) %*% X_star
## critical t-value
t_crit <- qt(0.975, df = nn-2)
## estimated SD
sigma <- summary(simple_model)$sigma
## predicted y
y_star <- sum(X_star * coef(simple_model))
## 95% CI
y_star + c(-1,1) * t_crit * sigma * sqrt(inner_X)
## [1] 149.5818 305.8366

CI with predict()

It’s much easier to use the predict() function to give us a CI.

predict(simple_model, new = data.frame(t(X_star)),
        level = 0.95, interval = "confidence")
##        fit      lwr      upr
## 1 227.7092 149.5818 305.8366

The estimated lower and upper bounds are the same as those we obtained by hand.

Prediction interval for a new response

We saw in lecture that we can also estimate the uncertainty for a specific prediction rather than the average response. Let’s now imagine we wanted to predict how many plant species we might find on a hypothetical island that might one day emerge from an underwater volcano and is ultimately 2000 km2. Recall that a PI is given by

\[ \hat{\mathbf{y}}^* \pm ~ t^{(\alpha / 2)}_{df} \sigma \sqrt{1 + {\mathbf{X}^*}^{\top} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^* } \]

To calculate the PI by hand simply requires a different calculation for inner_X above.

## new X_star
X_star <- c(Intercept = 1, Area = 2000)
## inside sqrt
inner_X <- 1 + t(X_star) %*% solve(t(XX) %*% XX) %*% X_star
## 95% CI
y_star + c(-1,1) * t_crit * sigma * sqrt(inner_X)
## [1]  24.21065 431.20776

Notice how much wider this interval is than the CI we estimated above. We could instead use predict() as we did above for the CI by switching the interval argument to "prediction".

predict(simple_model, new = data.frame(t(X_star)),
        level = 0.95, interval = "prediction")
##        fit      lwr      upr
## 1 227.7092 24.21065 431.2078