- Understand the use of F-tests for hypothesis testing
- Understand how to estimate confidence intervals
10 April 2026
We’ve see how model parameters and their variance, we might want to draw conclusions from our analysis
We might also want to try several forms of models and ask,
Imagine we had 2 linear models of varying complexity:
a model with one predictor
a model with two predictors
It would seem logical to ask whether the complexity of (2) is necessary?
Recall our partitioning of sums-of-squares, where
\[ SSTO = SSR + SSE \]
We might prefer the more complex model (call it \(C\)) over the simple model (call it \(S\)) if
\[ SSE_{C} < SSE_{S} \]
or, more formally, if
\[ \frac{SSE_{S} - SSE_{C}}{SSE_{C}} > \text{constant} \]
If \(C\) has \(k\) parameters and \(S\) has \(j\), we can scale this ratio to arrive at an \(F\)-statistic that follows an \(F\) distribution
\[ F = \frac{ \left( SSE_{S} - SSE_{C} \right) / (k - j)}{ SSE_{C} / (n - k)} \sim F_{(k - j),(n - k)} \]
The \(F\)-distribution is the ratio of two random variates, each with a \(\chi^2_{n}\) distribution
If \(A \sim \chi_{df_{A}}^{2}\) and \(B \sim \chi_{df_{B}}^{2}\) are independent, then
\[ \frac{\left( \frac{A}{df_{A}} \right) }{ \left( \frac{B}{df_{B}} \right) } \sim F _{df_{A},df_{B}} \]
| Test | \(H_0\) | Ex full model | Ex reduced model |
|---|---|---|---|
| All predictors | \(\beta_1 = \cdots = \beta_p = 0\) | \(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i}\) | \(y_i = \beta_0\) |
| One predictor | \(\beta_i = 0 | \text{other predictors}\) | \(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i}\) | \(y_i = \beta_0 + \beta_1 x_{1,i}\) |
| Subset of predictors | \(\beta_i = \beta_j\) | \(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i}\) | \(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_1 x_{2,i}\) |
Note that we can only test nested models
For example, we can’t test
\(y_i = \beta_0 + \beta_1 x_{1,i}\) versus \(y_i = \beta_0 + \beta_2 x_{2,i}\)
We will address model selection more fully later in the quarter
Suppose we wanted to test whether the collection of predictors in a model were better than simply estimating the data by their mean
\[ C: \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{e} \\ S: \mathbf{y} = \boldsymbol{\mu} + \mathbf{e} \\ \]
We would write the null hypothesis as
\[ H_0: \beta_1 = \beta_2 = \dots = \beta_k = 0 \]
and we would reject \(H_0\) if \(F > F^{(\alpha)}_{k - j, n - k}\)
Our \(F\) statistic is given by
\[ F = \frac{ \left( SSE_{S} - SSE_{C} \right) / (k - j)}{ SSE_{C} / (n - k)} \sim F_{(k - j),(n - k)} \]
and the sum-of-squares terms are given by
\[ SSE_{S} = \mathbf{e}^{\top} \mathbf{e} = \left( \mathbf{y} - \bar{y} \right)^{\top} \left( \mathbf{y} - \bar{y} \right) \\ ~ \\ SSE_{C} = \mathbf{e}^{\top} \mathbf{e} = \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \right)^{\top} \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \right) \]
Later in lab we will work with the gala dataset\(\dagger\) in the faraway package, which contains data on the diversity of plant species across 30 Galapagos islands
For now let’s hypothesize that
\(\dagger\)From Johnson & Raven (1973) Science 179:893-895
We might ask whether any one predictor could be dropped from a model
For example, can \(\text{nearest}\) be dropped from our full model?
\[ \text{species}_i = \alpha + \beta_1 \text{area}_i + \beta_2 \text{elevation}_i + \beta_3 \text{nearest}_i + \epsilon_i \]
One option is to fit these two models and compare them via our \(F\)-test
\[ H_0: \beta_3 = 0 \\ ~ \\ \begin{aligned} \text{species}_i &= \alpha + \beta_1 \text{area}_i + \beta_2 \text{elevation}_i + \beta_3 \text{nearest}_i + \epsilon_i \\ ~ \\ \text{species}_i &= \alpha + \beta_1 \text{area}_i + \beta_2 \text{elevation}_i + \epsilon_i \end{aligned} \]
Another option is to estimate a \(t\)-statistic as
\[ t_i = \frac{\widehat{\beta}_i}{\text{SE} \left( \widehat{\beta}_i \right)} \]
and compare it to a \(t\)-distribution with \(n - k\) degrees of freedom
We might want to know whether we can drop 2+ predictors from a model
For example, can we drop both \(\text{elevation}\) and \(\text{nearest}\) from our full model?
\[ H_0 : \beta_2 = \beta_3 = 0 \\ ~ \\ \begin{aligned} \text{species}_i &= \alpha + \beta_1 \text{area}_i + \beta_2 \text{elevation}_i + \beta_3 \text{nearest}_i + \epsilon_i \\ ~ \\ \text{species}_i &= \alpha + \beta_1 \text{area}_i + \epsilon_i \end{aligned} \]
Some tests cannot be expressed in terms of the inclusion or exclusion of predictors
Consider a test of whether the areas of the current and adjacent island could be added together and used in place of the two separate predictors
\[ H_0 : \beta_{\text{area}} = \beta_{\text{adjacent}} \\ ~ \\ \text{species}_i = \alpha + \beta_1 \text{area}_i + \beta_2 \text{adjacent}_i + \dots + \epsilon_i \\ ~ \\ \text{species}_i = \alpha + \beta_1 \text{(area + adjacent)}_i + \dots + \epsilon_i \]
What if we wanted to test whether a predictor had a specific (non-zero) value?
For example, is there a 1:1 relationship between \(\text{species}\) and \(\text{elevation}\) after controlling for the other predictors?
\[ H_0 : \beta_2 = 1 \\ ~ \\ \text{species}_i = \alpha + \beta_1 \text{area}_i + \underline{1} \text{elevation}_i + \beta_3 \text{nearest}_i + \epsilon_i \]
We can also modify our \(t\)-test from before and use it for our comparison by including the hypothesized \(\beta_{H_0}\) as an offset
\[ t_i = \frac{(\widehat{\beta_i} - \beta_{H_0})}{\text{SE} \left( \widehat{\beta}_i \right)} \]
Null hypothesis testing (NHT) is a slippery slope
We can also use confidence intervals (CI’s) to express uncertainty in \(\widehat{\beta}_i\)
They take the form
\[ 100(1 - \alpha)\% ~ \text{CI}: \widehat{\beta}_{i} \pm t_{n-k}^{(\alpha / 2)} \operatorname{SE}(\widehat{\beta}) \]
where \(\alpha\) is our predetermined Type-I error rate
The \(t\)-distribution is symmetrical like the normal distribution but used because it is more conservative at smaller sample sizes
As \(n\) approaches \(\infty\) the \(t\)-distribution approaches the normal distribution
The \(F\)- and \(t\)-based CI’s we have described depend on the assumption of normality
The bootstrap\(\dagger\) method provides a way to construct CI’s without this assumption
\(\dagger\)Efron (1979) The Annals of Statistics 7:1–26
Fit your model to the data
Calculate \(\mathbf{e} = \mathbf{y} - \mathbf{X} \widehat{\boldsymbol{\beta}}\)
Do the following many times:
Generate \(\mathbf{e}^*\) by sampling with replacement from \(\mathbf{e}\)
Calculate \(\mathbf{y}^*\) = \(\mathbf{X} \widehat{\boldsymbol{\beta}} + \mathbf{e}^*\)
Estimate \(\widehat{\boldsymbol{\beta}}^*\) from \(\mathbf{X}\) & \(\mathbf{y}^*\)
Select the \(\tfrac{\alpha}{2}\) and \((1 - \tfrac{\alpha}{2})\) percentiles from the saved \(\widehat{\boldsymbol{\beta}}^*\)
## number of bootstrap samples
n_boot <- 1000
## empty vector for results
betas <- rep(NA, n_boot)
## fit initial model: y = alpha + beta*x + e
mod_fit <- lm(y ~ x)
## iterate many times
for(i in 1:n_boot) {
## get residuals
e <- residuals(mod_fit)
## sample with replacement
e_star <- sample(x = e, size = length(y), replace = TRUE)
## get model parameters
alpha <- coef(mod_fit)[1]; betas[i] <- coef(mod_fit)[2]
## calculate new y
y_star <- alpha + betas[i] * x + e_star
## refit model with new y
mod_fit <- lm(y_star ~ x)
}
## sort results and pick out 95th percentiles
CI_95 <- sort(betas)[c(25, 975)]
Given a fitted model \(\mathbf{y} = \mathbf{X} \widehat{\boldsymbol{\beta}} + \mathbf{e}\), we might want to know the uncertainty around a new estimate \(\mathbf{y}^*\) given some new predictor \(\mathbf{X}^*\)
Suppose we wanted to estimate the uncertainty in the average response given by
\[ \widehat{\mathbf{y}}^* = \mathbf{X}^* \widehat{\boldsymbol{\beta}} \]
Recall that the general formula for a CI on a quantity \(z\) is
\[ 100(1 - \alpha)\% ~ \text{CI}: \text{E}(z) ~ \pm ~ t^{(\alpha / 2)}_{df}\text{SD}(z) \]
So we would have
\[ \widehat{\mathbf{y}}^* ~ \pm ~ t^{(\alpha / 2)}_{df} \sqrt{\text{Var} \left( \widehat{\mathbf{y}}^* \right)} \]
We can calculate the SD of our expectation as
\[ \begin{aligned} \text{Var} \left( \widehat{\mathbf{y}}^* \right) &= \text{Var} \left( \mathbf{X}^* \widehat{\boldsymbol{\beta}} \right) \\ &= {\mathbf{X}^*}^{\top} \text{Var}\left( \widehat{\boldsymbol{\beta}} \right) \mathbf{X}^* \\ &= {\mathbf{X}^*}^{\top} \left[ \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1} \right] \mathbf{X}^* \\ &\Downarrow \\ \text{SD} \left( \widehat{\mathbf{y}}^* \right) &= \sigma \sqrt{ {\mathbf{X}^*}^{\top} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^* } \end{aligned} \]
So our CI on the mean response is given by
\[ \widehat{\mathbf{y}}^* \pm ~ t^{(\alpha / 2)}_{df} \sigma \sqrt{ {\mathbf{X}^*}^{\top} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^* } \]
What about the uncertainty in a specific prediction?
In that case we need to account for our additional uncertainty owing to the error in our relationship, which is given by
\[ \widehat{\mathbf{y}}^* = \mathbf{X}^* \widehat{\boldsymbol{\beta}} + \mathbf{e} \]
The SD of the new prediction is given by
\[ \begin{aligned} \text{Var} \left( \widehat{\mathbf{y}}^* \right) &= {\mathbf{X}^*}^{\top} \text{Var}\left( \widehat{\boldsymbol{\beta}} \right) \mathbf{X}^* + \text{Var} \left( \mathbf{e} \right) \\ &= {\mathbf{X}^*}^{\top} \left[ \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1} \right] \mathbf{X}^* + \sigma^2\\ &= \sigma^2 \left( {\mathbf{X}^*}^{\top} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^* + 1 \right) \\ &\Downarrow \\ \text{SD} \left( \widehat{\mathbf{y}}^* \right) &= \sigma \sqrt{1 + {\mathbf{X}^*}^{\top} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^* } \end{aligned} \]
So our CI on the new prediction is given by
\[ \widehat{\mathbf{y}}^* \pm ~ t^{(\alpha / 2)}_{df} \sigma \sqrt{1 + {\mathbf{X}^*}^{\top} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^* } \]
This is typically referred to as the prediction interval