This week we will see how to fit linear models in R, and estimate parameter uncertainty and confidence intervals. The data come from a paper by Johnson & Raven (1973, Science 179:893-895) and consist of the diversity of plants species across 30 islands in the Galapagos Archipelago. The data set also contains measurements of 5 possible explanatory variables:
Area
: the area of the island (km2)Elevation
: the highest elevation of the island (m)Nearest
: the distance from the nearest island (km)Scruz
: the distance from Santa Cruz Island (km)Adjacent
: the area of the adjacent island (km2)The gala
dataset is included in the faraway package. Let’s load the package and inspect the data.
library(faraway)
head(gala)
## Species Endemics Area Elevation Nearest Scruz Adjacent
## Baltra 58 23 25.09 346 0.6 0.6 1.84
## Bartolome 31 21 1.24 109 0.6 26.3 572.33
## Caldwell 3 3 0.21 114 2.8 58.7 0.78
## Champion 25 9 0.10 46 1.9 47.4 0.18
## Coamano 2 1 0.05 77 1.9 1.9 903.82
## Daphne.Major 18 11 0.34 119 8.0 8.0 1.84
The second column Endemics
is the number of plant species that are endemic to each island, but we will just use the total number of plant species recorded in Species
.
We seek a linear model that would help us explain the diversity (number) of plant species as a function of some predictors, which are sometimes referred to as covariates. From the theory of island biogeography, it would seem reasonable that plant diversity is related to the area of an island, such that
More formally, we can write our explicit linear regression model as
\[ \text{species}_i = \beta_0 + \beta_1 \text{area}_i + e_i \]
where \(i\) indicates an island.
Let’s plot the repsonse and predictor to see what the relationship looks like.
We saw in lecture that we can estimate the regression parameters as
\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y}. \]
Let’s begin by creating the matrix of response values \(\mathbf{y}\), which should have dimensions \(n \times 1\).
nn <- nrow(gala)
yy <- matrix(gala$Species, nrow = nn, ncol = 1)
head(yy)
## [,1]
## [1,] 58
## [2,] 31
## [3,] 3
## [4,] 25
## [5,] 2
## [6,] 18
Next let’s create the matrix of predictors \(\mathbf{X}\). There is only one predictor in the model (Area
), so our matrix should consist of a column of 1’s and a column with the area of each island.
Intercept <- rep(1, nn)
Area <- gala$Area
XX <- cbind(Intercept, Area)
head(XX)
## Intercept Area
## [1,] 1 25.09
## [2,] 1 1.24
## [3,] 1 0.21
## [4,] 1 0.10
## [5,] 1 0.05
## [6,] 1 0.34
Matrix multiplication in R uses the %*%
operator. You can transpose a matrix with t()
and invert a matrix with solve()
. So our estimate of the regression parameters in is
beta_hat <- solve(t(XX) %*% XX) %*% t(XX) %*% yy
beta_hat
## [,1]
## Intercept 63.78286147
## Area 0.08196317
We can interpret these coefficients to mean that an island with an area of 0 km2 would be expected to have ~64 species of plants(!), and that you would add ~8 species per 100 km2 of island.
Recall that the fitted values are given by
\[ \begin{align} \hat{\mathbf{y}} &= \mathbf{X} \hat{\boldsymbol{\beta}} \\ &= \mathbf{X} \left( (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y} \right) \\ &= \underbrace{\mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top}}_{\mathbf{H}} \mathbf{y} \\ &= \mathbf{H} \mathbf{y}. \end{align} \]
So we can calculate the hat matrix \(\mathbf{H}\) and solve for \(\hat{\mathbf{y}}\)
HH <- XX %*% solve(t(XX) %*% XX) %*% t(XX)
y_hat <- HH %*% yy
Let’s add the predicted values to our plot of the data.
par(mai = c(0.9, 0.9, 0.3, 0.1), omi = rep(0, 4))
## plot the data
plot(gala$Area, gala$Species, pch = 16, col = "dark green",
ylab = "Number of plant species", xlab = expression(paste("Area of island (", km^2, ")")))
## add the fitted values (line)
abline(a = beta_hat[1], b = beta_hat[2])
This is not a great fit, but it’s a decent start. We’ll expand upon our model later.
As we saw briefly in lecture, it’s easy to fit a linear regression model using the function lm()
, which takes arguments like
lm(respone ~ predictor(s), data_to_use)
To do so for our model above, we have
simple_model <- lm(Species ~ Area, gala)
simple_model
##
## Call:
## lm(formula = Species ~ Area, data = gala)
##
## Coefficients:
## (Intercept) Area
## 63.78286 0.08196
Note that these are the same estimates that we obtained earlier
We can obtain the estimates of the fitted values \(\hat{\mathbf{y}}\) via one of two functions in R: fitted()
and predict()
. We’ll see later that predict()
has additional functionality that we can use.
## via fitted
y_hat_f <- fitted(simple_model)
## via predict
y_hat_p <- predict(simple_model, type = "response")
## compare these to each other
all.equal(y_hat_f, y_hat_p)
## [1] TRUE
We saw in lecture that we can use the coefficient of determination \((R^2)\) to measure how well a model fits the data. The formula is
\[ R^2 = 1 - \frac{SSE}{SSTO} \]
We can easly compute these quantities from our model fit and data.
## SSE
resids <- yy - y_hat
SSE <- sum(resids^2) # = t(resids) %*% resids
## SSTO
SSTO <- sum((yy - mean(yy))^2)
## R^2
1 - SSE / SSTO
## [1] 0.3817301
Let’s compare this to the value in the ANOVA table that R produces
summary(simple_model)
##
## Call:
## lm(formula = Species ~ Area, data = gala)
##
## Residuals:
## Min 1Q Median 3Q Max
## -99.495 -53.431 -29.045 3.423 306.137
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.78286 17.52442 3.640 0.001094 **
## Area 0.08196 0.01971 4.158 0.000275 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 91.73 on 28 degrees of freedom
## Multiple R-squared: 0.3817, Adjusted R-squared: 0.3596
## F-statistic: 17.29 on 1 and 28 DF, p-value: 0.0002748
The estimates are indeed the same.
The ANOVA table above also reports the Adjusted R-squared
, which we haven’t discussed yet. It turns out that one can always increase the \(SSR\) of a model by including more predictors, even if they are completed unrelated to the response. Recall our objective in least squares estimation is
\[ \min \sum SSE = \min \sum_i (y_i - X_i \beta). \]
Increasing the dimension of \(\mathbf{X}\) by adding more predictors (columns) necessitates an increased space within which to minimize this objective function, and hence you can get a better fit o the data. We can demonstrate this with a simple example. Suppose we added another predictor to our simple model that was nothing more than a vector of Gaussian white noise, which is unrelated to the response.
set.seed(514)
## generate a vector of Gaussian white noise
WN <- rnorm(nn)
## add this to our Galapagos data frame
gala$WN <- WN
## fit a model with Area & WN
summary(lm(Species ~ Area + WN, gala))
##
## Call:
## lm(formula = Species ~ Area + WN, data = gala)
##
## Residuals:
## Min 1Q Median 3Q Max
## -93.86 -44.67 -21.73 15.93 308.84
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.68115 17.82480 3.965 0.000485 ***
## Area 0.07892 0.01944 4.059 0.000378 ***
## WN -20.27151 13.91996 -1.456 0.156843
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 89.95 on 27 degrees of freedom
## Multiple R-squared: 0.4268, Adjusted R-squared: 0.3843
## F-statistic: 10.05 on 2 and 27 DF, p-value: 0.0005465
Notice that our \(R^2\) value has indeed increased.
To account for this, we can use an adjusted \(R^2\), which corrects for these additional degrees of freedom.
\[ \bar{R}^2 = 1 - \frac{SSE ~ / ~ df_{error}}{SSTO ~ / ~ df_{total}} \]
Let’s calculate \(\bar{R}^2\) for our simple model and confirm that it matches the value in the ANOVA table for the model with only Area
as a predictor.
1 - (SSE / (nn - 2)) / (SSTO / (nn - 1))
## [1] 0.359649
Let’s expand our model from above to include some more predictors. In particular, we’ll add in the effects of an island’s elevation and the distance to the nearest island, such that
More formally, we can write our explicit linear regression model as
\[ \text{species}_i = \beta_0 + \beta_1 \text{area}_i + \beta_2 \text{elevation}_i + \beta_3 \text{nearest}_i + e_i \]
where \(i\) indicates an island.
We’ll use lm()
to estimate this model.
## larger model
full_mod <- lm(Species ~ Area + Elevation + Nearest, gala)
full_mod
##
## Call:
## lm(formula = Species ~ Area + Elevation + Nearest, data = gala)
##
## Coefficients:
## (Intercept) Area Elevation Nearest
## 16.46471 0.01908 0.17134 0.07123
Notice that our estimates for the intercept \(\beta_0\) and the effect of island area \(\beta_1\) changed when we accounted for additional predictors in the model.
We saw in lecture a number of different hypothesis tests we might wish to undertake. Let’s run through examples of those here.
Suppose we wanted to test whether this collection of 3 predictors in our model were better than simply estimating the data by their mean.
\[ \Theta: \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{e} \\ \theta: \mathbf{y} = \boldsymbol{\mu} + \mathbf{e} \\ \]
We write the null hypothesis as
\[ H_0: \beta_1 = \beta_2 = \dots = \beta_k = 0 \]
and we would reject \(H_0\) if \(F > F^{(\alpha)}_{k_{\Theta} - k_{\theta}, n - k_{\Theta}}\)
Recall that we wil base this test on an \(F\)-distribution, such that
\[ SSE_{\Theta} = \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \right)^{\top} \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \right) = \mathbf{e}^{\top} \mathbf{e} = SSE \\ SSE_{\theta} = \left( \mathbf{y} - \bar{y} \right)^{\top} \left( \mathbf{y} - \bar{y} \right) = SSTO \\ \Downarrow \\ F = \frac{ \left( SSTO - SSE \right) / (k - 1) } { SSE / (n - k)} \]
Calculating the sums-of-squares in R is straightforward with matric operations.
## get matrix of predictors
XX <- model.matrix(full_mod)
## estimate beta
beta_hat <- solve(t(XX) %*% XX) %*% t(XX) %*% yy
## total sum of squares
SSE <- t(yy - XX %*% beta_hat) %*% (yy - XX %*% beta_hat)
## error sum of squares
SSTO <- t(yy - mean(yy)) %*% (yy - mean(yy))
## F statistic
(F_stat <- ((SSTO - SSE) / (4 - 1)) / (SSE / (nn - 4)))
## [,1]
## [1,] 10.77043
## F test
pf(F_stat, 4-1, nn-4, lower.tail = F)
## [,1]
## [1,] 8.816845e-05
This \(F\)-statistic is quite large and the \(p\)-value is very small, so we would reject the null hypothesis that we would be justified in dropping the 3 predictors from this model in favor of a mean-only model.
We can use the anova()
function in R to conduct this \(F\)-test
## null model; the '1' indicates an intercept-only model
null_mod <- lm(Species ~ 1, gala)
## use `anova('simple', 'complex')` to get the F-test results
anova(null_mod, full_mod)
## Analysis of Variance Table
##
## Model 1: Species ~ 1
## Model 2: Species ~ Area + Elevation + Nearest
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 29 381081
## 2 26 169918 3 211164 10.77 8.817e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that we get the same \(F\)-statistic and \(p\)-value as earlier
We might ask whether any one predictor could be dropped from our model. For example, can nearest
be dropped from this model? One option is to fit this reduced model and compare to it to the full model via our \(F\)-test with \(H_0: \beta_3 = 0\)
\[ \begin{aligned} \Theta: \text{species}_i &= \beta_0 + \beta_1 \text{area}_i + \beta_2 \text{elevation}_i + \beta_3 \text{nearest}_i + e_i \\ ~ \\ \theta: \text{species}_i &= \beta_0 + \beta_1 \text{area}_i + \beta_2 \text{elevation}_i + e_i \end{aligned} \]
Here’s our test in R.
## full model as before
full_mod <- lm(Species ~ Area + Elevation + Nearest, gala)
## reduced model without `nearest`
reduced_mod <- lm(Species ~ Area + Elevation, gala)
## use `anova('reduced', 'full')` to get the F-test results
anova(reduced_mod, full_mod)
## Analysis of Variance Table
##
## Model 1: Species ~ Area + Elevation
## Model 2: Species ~ Area + Elevation + Nearest
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 27 169947
## 2 26 169918 1 29.243 0.0045 0.9472
The \(F\)-statistic is small and the \(p\)-value is large so we cannot reject \(H_0\), and hence we conclude that we would be justified in dropping Nearest
from our model.
Another option for testing the signficance of one predictor is to estimate a \(t\)-statistic as
\[ t_i = \frac{\beta_i}{\text{SE} \left( \beta_i \right)} \]
and compare it to a \(t\)-distribution with \(n - k\) degrees of freedom. We can get the results of the \(t\)-test with summary()
from base R or sumary()
from the faraway pkg, which is much less verbose.
## via `base::summary()`
summary(full_mod)
##
## Call:
## lm(formula = Species ~ Area + Elevation + Nearest, data = gala)
##
## Residuals:
## Min 1Q Median 3Q Max
## -191.856 -33.111 -18.626 5.673 262.209
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.46471 23.38884 0.704 0.48772
## Area 0.01908 0.02676 0.713 0.48216
## Elevation 0.17134 0.05452 3.143 0.00415 **
## Nearest 0.07123 1.06481 0.067 0.94718
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 80.84 on 26 degrees of freedom
## Multiple R-squared: 0.5541, Adjusted R-squared: 0.5027
## F-statistic: 10.77 on 3 and 26 DF, p-value: 8.817e-05
## via `faraway::sumary()`
sumary(full_mod)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.464711 23.388841 0.7040 0.487718
## Area 0.019085 0.026764 0.7131 0.482158
## Elevation 0.171336 0.054519 3.1427 0.004151
## Nearest 0.071227 1.064806 0.0669 0.947179
##
## n = 30, p = 4, Residual SE = 80.84116, R-Squared = 0.55
Note that we get \(t^2 = F\) and the same \(p\)-value as before
Sometimes we might want to know whether we can drop 2+ predictors from a model. For example, could we drop both \(\text{area}\) and \(\text{nearest}\) from our full model?
\[ \begin{aligned} \Theta: \text{species}_i &= \beta_0 + \beta_1 \text{area}_i + \beta_2 \text{elevation}_i + \beta_3 \text{nearest}_i + e_i \\ ~ \\ \theta: \text{species}_i &= \beta_0 + \beta_2 \text{elevation}_i + e_i \end{aligned} \]
\(H_0 : \beta_1 = \beta_3 = 0\)
Again we can conduct this test by fitting both models and calculating \(F\).
## reduced model without `area` and `nearest`
reduced_mod <- lm(Species ~ Elevation, gala)
## use `anova('reduced', 'full')` to get the F-test results
anova(reduced_mod, full_mod)
## Analysis of Variance Table
##
## Model 1: Species ~ Elevation
## Model 2: Species ~ Area + Elevation + Nearest
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 28 173254
## 2 26 169918 2 3336.3 0.2552 0.7766
The \(F\)-statistic is rather small and the \(p\)-value is big so we cannot reject \(H_0\), suggesting we would be justified in dropping area and elevation from the model.
Could we instead evaluate our null hypothesis by examining our table of \(t\)-statistics and associated \(p\)-values?
## t-statistics from our full model
sumary(full_mod)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.464711 23.388841 0.7040 0.487718
## Area 0.019085 0.026764 0.7131 0.482158
## Elevation 0.171336 0.054519 3.1427 0.004151
## Nearest 0.071227 1.064806 0.0669 0.947179
##
## n = 30, p = 4, Residual SE = 80.84116, R-Squared = 0.55
This creates the problem of having to make our decision based on 2 \(p\)-values rather than one
Each \(p\)-value corresponds to a separate \(t\)-test with the other predictor in the model
Therefore we must use the single \(F\)-test
Some tests cannot be expressed in terms of the inclusion or exclusion of predictors.
Consider a test of whether the areas of the current and adjacent island could be added together and used in place of the two separate predictors
\[ \text{species}_i = \beta_0 + \beta_1 \text{area}_i + \beta_2 \text{adjacent}_i + \dots + e_i \\ ~ \\ \text{species}_i = \beta_0 + \beta_1 \text{(area + adjacent)}_i + \dots + e_i \]
\(H_0 : \beta_{\text{area}} = \beta_{\text{adjacent}}\)
We can do this in R by using the I()
function, which tells R to treat things inside the ()
as an operation.
## full model as before
full_mod <- lm(Species ~ Area + Adjacent + Elevation + Nearest, gala)
## reduced model without `elevation + nearest`
comb_mod <- lm(Species ~ I(Area + Adjacent) + Elevation + Nearest, gala)
## use `anova('combined', 'full')` to get the F-test results
anova(comb_mod, full_mod)
## Analysis of Variance Table
##
## Model 1: Species ~ I(Area + Adjacent) + Elevation + Nearest
## Model 2: Species ~ Area + Adjacent + Elevation + Nearest
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 26 116865
## 2 25 93867 1 22998 6.1252 0.02046 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The \(F\)-statistic is large and the \(p\)-value is reasonably small so we reject \(H_0\).
What if we wanted to test whether a predictor had a specific (non-zero) value? For example, is there a 1:1 relationship between \(\text{species}\) and \(\text{elevation}\) after controlling for the other predictors?
\[ \text{species}_i = \beta_0 + \beta_1 \text{area}_i + \underline{1} \text{elevation}_i + \beta_3 \text{nearest}_i + e_i \]
\(H_0 : \beta_2 = 1\)
We can do this in R by using the offset()
function, which allows us to set a fixed beta.
## full model
full_mod <- lm(Species ~ Area + Elevation + Nearest, gala)
## model with effect of `elevation` = 1
fixed_mod <- lm(Species ~ Area + offset(1 * Elevation) + Nearest, gala)
## use `anova('comb', 'full')` to get the F-test results
anova(fixed_mod, full_mod)
## Analysis of Variance Table
##
## Model 1: Species ~ Area + offset(1 * Elevation) + Nearest
## Model 2: Species ~ Area + Elevation + Nearest
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 27 1679731
## 2 26 169918 1 1509813 231.02 1.891e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The \(F\)-statistic is very large and the \(p\)-value is tiny so we reject \(H_0\).
We can also modify our \(t\)-test from before and use it for our comparison by including the hypothesized \(\hat{\beta}_{H_0}\) as an offset
\[ t_i = \frac{(\hat{\beta_i} - \beta_{H_0})}{\text{SE} \left( \hat{\beta}_i \right)} \]
We can get \(\hat{\beta}_{\text{elevation}}\) and \(\text{SE}(\hat{\beta}_{\text{elevation}})\) from our model summary
## full model
sumary(full_mod)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.464711 23.388841 0.7040 0.487718
## Area 0.019085 0.026764 0.7131 0.482158
## Elevation 0.171336 0.054519 3.1427 0.004151
## Nearest 0.071227 1.064806 0.0669 0.947179
##
## n = 30, p = 4, Residual SE = 80.84116, R-Squared = 0.55
This leaves us with \(\hat{\beta}_{\text{elevation}} \approx 0.17\) and \(\text{SE}(\hat{\beta}_{\text{elevation}}) \approx 0.055\).
We can calculate the \(t\)-test by hand and compare the results to the above \(F\)-test
## t statistic
(t_value <- (0.171336 - 1) / 0.054519)
## p-value = t_alpha * Pr(t_value, df); `pt()` is the pdf for a t-dist
(p_value <- 1.96 * pt(t_value, 26))
## verify t^2 = F
all.equal(t_value^2, anova(fixed_mod, full_mod)$F[2], tolerance = 0.0001)
## [1] -15.19955
## [1] 1.853024e-14
## [1] TRUE
The \(p\)-value is the same and \(t^2 = F\) as before
We can also use confidence intervals (CI’s) to express uncertainty in \(\hat{\beta}_i\), which take the form
\[ 100(1 - \alpha)\% ~ \text{CI}: \hat{\beta}_{i} \pm t_{n-p}^{(\alpha / 2)} \operatorname{SE}(\hat{\beta}) \]
where here \(\alpha\) is our predetermined Type-I error rate. To do so, we need the information from the model summary.
## t-statistics from our full model
sumary(full_mod)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.464711 23.388841 0.7040 0.487718
## Area 0.019085 0.026764 0.7131 0.482158
## Elevation 0.171336 0.054519 3.1427 0.004151
## Nearest 0.071227 1.064806 0.0669 0.947179
##
## n = 30, p = 4, Residual SE = 80.84116, R-Squared = 0.55
We can construct a 95% CI on \(\hat{\beta}_{\text{area}}\) using the 2.5% and 97.5% percentiles of the \(t\)-distribution with \(df = 30 - 4 = 26\).
## critical value for the t-dist
## `qt()` is the quantile function for the t-dist; `p` is the (1-alpha/2) value
t_crit <- qt(p = 0.975, df = 30-4)
## 95% CI
CI95_beta <- 0.019085 + c(-1,1) * t_crit * 0.026764
round(CI95_beta, 3)
## [1] -0.036 0.074
We can quickly get the 95% CI for all the \(\beta_i\) using the function confint()
.
## all of the 95% CI's
confint(full_mod)
## 2.5 % 97.5 %
## (Intercept) -31.61174063 64.54116288
## Area -0.03593020 0.07409948
## Elevation 0.05927051 0.28340204
## Nearest -2.11751249 2.25996697
The above \(F\)- and \(t\)-based CI’s depend on the assumption of normality. Recall that the bootstrap method provides a way to construct CI’s without this assumption
Step 1: Fit your model to the data
## we already fit `full_mod` above
Step 2: Calculate \(\mathbf{e} = \mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}}\)
## residuals from our full model
resids <- residuals(full_mod)
Step 3a: Generate \(\mathbf{e}^*\) by sampling with replacement from \(\mathbf{e}\)
Step 3b: Calculate \(\mathbf{y}^* = \mathbf{X} \hat{\boldsymbol{\beta}} + \mathbf{e}^*\)
Step 3c: Estimate \(\hat{\boldsymbol{\beta}}^*\) from \(\mathbf{X}\) & \(\mathbf{y}^*\)
## number of boostrap samples
nb <- 1000
## empty matrix for beta estimates
beta_est <- matrix(NA, nb, 4)
## fitted values from our full model = X*beta
Xbeta <- fitted(full_mod)
## sample many times
for(i in 1:nb) {
## 3a: sample w/ replacement from e
e_star <- sample(resids, rep = TRUE)
## 3b: calculate y_star
y_star <- Xbeta + e_star
## 3c: re-estimate beta_star from X & y_star
beta_star <- update(full_mod, y_star ~ .)
## save estimated betas
beta_est[i,] <- coef(beta_star)
}
Step 4: Select the \(\tfrac{\alpha}{2}\) and \((1 - \tfrac{\alpha}{2})\) percentiles from the saved \(\hat{\boldsymbol{\beta}}^*\)
## extract 2.5% and 97.5% values
CI95 <- apply(beta_est, 2, quantile, c(0.025, 0.975))
colnames(CI95) <- c("Intercept", colnames(gala[,3:5]))
t(round(CI95, 3))
## 2.5% 97.5%
## Intercept -23.222 62.445
## Area -0.034 0.077
## Elevation 0.091 0.300
## Nearest -1.796 2.165
Given a fitted model \(\mathbf{y} = \mathbf{X} \hat{\boldsymbol{\beta}} + \mathbf{e}\), we might want to know the uncertainty around a new estimate \(\mathbf{y}^*\) given some new predictor \(\mathbf{X}^*\). Recall that this CI is given by
\[ \hat{\mathbf{y}}^* \pm ~ t^{(\alpha / 2)}_{df} \sigma \sqrt{ {\mathbf{X}^*}^{\top} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^* } \]
Think back to our simple model for plant diversity as a function of island area, such that
\[ \text{species}_i = \beta_0 + \beta_1 \text{area}_i + e_i \]
Suppose we wanted to estimate the uncertainty in the average diversity for an island with an area of 2000 km2. Recall that our CI on the mean response is given by
and here \(\mathbf{X}^* = [1 ~~ 2000]\) (ie, a \(1 \times 2\) row matrix).
We can calculate this by hand using matrix operations in R.
## matrix of predictors
XX <- model.matrix(simple_model)
## new X; vector for now
X_star <- c(Intercept = 1, Area = 2000)
## inside sqrt
inner_X <- t(X_star) %*% solve(t(XX) %*% XX) %*% X_star
## critical t-value
t_crit <- qt(0.975, df = nn-2)
## estimated SD
sigma <- summary(simple_model)$sigma
## predicted y
y_star <- sum(X_star * coef(simple_model))
## 95% CI
y_star + c(-1,1) * t_crit * sigma * sqrt(inner_X)
## [1] 149.5818 305.8366
predict()
It’s much easier to use the predict()
function to give us a CI.
predict(simple_model, new = data.frame(t(X_star)),
level = 0.95, interval = "confidence")
## fit lwr upr
## 1 227.7092 149.5818 305.8366
The estimated lower and upper bounds are the same as those we obtained by hand.
We saw in lecture that we can also estimate the uncertainty for a specific prediction rather than the average response. Let’s now imagine we wanted to predict how many plant species we might find on a hypothetical island that might one day emerge from an underwater volcano and is ultimately 2000 km2. Recall that a PI is given by
\[ \hat{\mathbf{y}}^* \pm ~ t^{(\alpha / 2)}_{df} \sigma \sqrt{1 + {\mathbf{X}^*}^{\top} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^* } \]
To calculate the PI by hand simply requires a different calculation for inner_X
above.
## new X_star
X_star <- c(Intercept = 1, Area = 2000)
## inside sqrt
inner_X <- 1 + t(X_star) %*% solve(t(XX) %*% XX) %*% X_star
## 95% CI
y_star + c(-1,1) * t_crit * sigma * sqrt(inner_X)
## [1] 24.21065 431.20776
Notice how much wider this interval is than the CI we estimated above. We could instead use predict()
as we did above for the CI by switching the interval
argument to "prediction"
.
predict(simple_model, new = data.frame(t(X_star)),
level = 0.95, interval = "prediction")
## fit lwr upr
## 1 227.7092 24.21065 431.2078