10 April 2026

Goals for today

  • Understand the use of F-tests for hypothesis testing
  • Understand how to estimate confidence intervals

Inferential methods

Once we’ve estimated the model parameters and their variance, we might want to draw conclusions from our analysis

Comparing models

Imagine we had 2 linear models of varying complexity:

  1. a model with one predictor

  2. a model with five predictors

It would seem logical to ask whether the complexity of (2) is necessary?

Hypothesis test to compare models

Recall our partitioning of sums-of-squares, where

\[ SSTO = SSR + SSE \]

We might prefer the more complex model (call it \(\Theta\)) over the simple model (call it \(\theta\)) if

\[ SSE_{\Theta} < SSE_{\theta} \]

or, more formally, if

\[ \frac{SSE_{\theta} - SSE_{\Theta}}{SSE_{\Theta}} > \text{a constant} \]

Hypothesis test to compare models

If \(\Theta\) has \(k_{\Theta}\) parameters and \(\theta\) has \(k_{\theta}\), we can scale this ratio to arrive at an \(F\)-statistic that follows an \(F\) distribution

\[ F = \frac{ \left( SSE_{\theta} - SSE_{\Theta} \right) / (k_{\Theta} - k_{\theta})}{ SSE_{\Theta} / (n - k_{\Theta})} \sim F_{k_{\Theta} - k_{\theta}, n - k_{\Theta}} \]

\(F\)-distribution

The \(F\)-distribution is the ratio of two random variates, each with a \(\chi^2_{n}\) distribution

If \(A \sim \chi_{df_{A}}^{2}\) and \(B \sim \chi_{df_{B}}^{2}\) are independent, then

\[ \frac{\left( \frac{A}{df_{A}} \right) }{ \left( \frac{B}{df_{B}} \right) } \sim F _{df_{A},df_{B}} \]

\(F\)-distribution

Test of all predictors in a model

Suppose we wanted to test whether the collection of predictors in a model were better than simply estimating the data by their mean.

\[ \Theta: \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{e} \\ \theta: \mathbf{y} = \boldsymbol{\mu} + \mathbf{e} \\ \]

We write the null hypothesis as

\[ H_0: \beta_1 = \beta_2 = \dots = \beta_k = 0 \]

and we would reject \(H_0\) if \(F > F^{(\alpha)}_{k_{\Theta} - k_{\theta}, n - k_{\Theta}}\)

Hypothesis test to compare models

\[ SSE_{\Theta} = \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \right)^{\top} \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \right) = \mathbf{e}^{\top} \mathbf{e} = SSE \\ SSE_{\theta} = \left( \mathbf{y} - \bar{y} \right)^{\top} \left( \mathbf{y} - \bar{y} \right) = SSTO \\ \Downarrow \\ F = \frac{ \left( SSTO - SSE \right) / (k - 1) } { SSE / (n - k)} \]

Predictors of plant diversity

Later in lab we will work with the gala dataset\(\dagger\) in the faraway package, which contains data on the diversity of plant species across 30 Galapagos islands

For now let’s hypothesize that

diversity = \(f\)(area, elevation, distance to nearest island)



\(\dagger\)From Johnson & Raven (1973) Science 179:893-895

Testing one predictor

We might ask whether any one predictor could be dropped from a model

For example, can \(\text{nearest}\) be dropped from ourf full model?

\[ \text{species}_i = \alpha + \beta_1 \text{area}_i + \beta_2 \text{elevation}_i + \beta_3 \text{nearest}_i + \epsilon_i \]

Testing one predictor

One option is to fit these two models and compare them via our \(F\)-test with \(H_0: \beta_3 = 0\)


\[ \begin{aligned} \text{species}_i &= \alpha + \beta_1 \text{area}_i + \beta_2 \text{elevation}_i + \beta_3 \text{nearest}_i + \epsilon_i \\ ~ \\ \text{species}_i &= \alpha + \beta_1 \text{area}_i + \beta_2 \text{elevation}_i + \epsilon_i \end{aligned} \]

Testing one predictor

Another option is to estimate a \(t\)-statistic as

\[ t_i = \frac{\widehat{\beta}_i}{\text{SE} \left( \widehat{\beta}_i \right)} \]

and compare it to a \(t\)-distribution with \(n - k\) degrees of freedom

Testing 2+ predictors

Sometimes we might want to know whether we can drop 2+ predictors from a model

For example, can we drop both \(\text{elevation}\) and \(\text{nearest}\) from our full model?

\[ \begin{aligned} \text{species}_i &= \alpha + \beta_1 \text{area}_i + \beta_2 \text{elevation}_i + \beta_3 \text{nearest}_i + \epsilon_i \\ ~ \\ \text{species}_i &= \alpha + \beta_1 \text{area}_i + \epsilon_i \end{aligned} \]

\(H_0 : \beta_2 = \beta_3 = 0\)

Testing a subspace

Some tests cannot be expressed in terms of the inclusion or exclusion of predictors

Consider a test of whether the areas of the current and adjacent island could be added together and used in place of the two separate predictors

\[ \text{species}_i = \alpha + \beta_1 \text{area}_i + \beta_2 \text{adjacent}_i + \dots + \epsilon_i \\ ~ \\ \text{species}_i = \alpha + \beta_1 \text{(area + adjacent)}_i + \dots + \epsilon_i \]

\(H_0 : \beta_{\text{area}} = \beta_{\text{adjacent}}\)

Testing a subspace

What if we wanted to test whether a predictor had a specific (non-zero) value?

For example, is there a 1:1 relationship between \(\text{species}\) and \(\text{elevation}\) after controlling for the other predictors?

\[ \text{species}_i = \alpha + \beta_1 \text{area}_i + \underline{1} \text{elevation}_i + \beta_3 \text{nearest}_i + \epsilon_i \]

\(H_0 : \beta_2 = 1\)

Testing a subspace

We can also modify our \(t\)-test from before and use it for our comparison by including the hypothesized \(\beta_{H_0}\) as an offset

\[ t_i = \frac{(\widehat{\beta_i} - \beta_{H_0})}{\text{SE} \left( \widehat{\beta}_i \right)} \]

Caveats about hypothesis testing

Null hypothesis testing (NHT) is a slippery slope

  • \(p\)-values are simply the probability of obtaining a test statistic as large or greater than that observed
  • \(p\)-values are not weights of evidence
  • “Critical” or “threshold” values against which to compare \(p\)-values must be chosen a priori
  • Be aware of “\(p\) hacking” where researchers make many tests to find significance

QUESTIONS?

Confidence intervals for \(\beta\)

We can also use confidence intervals (CI’s) to express uncertainty in \(\widehat{\beta}_i\)

They take the form

\[ 100(1 - \alpha)\% ~ \text{CI}: \widehat{\beta}_{i} \pm t_{n-p}^{(\alpha / 2)} \operatorname{SE}(\widehat{\beta}) \]

where here \(\alpha\) is our predetermined Type-I error rate

Bootstrap confidence intervals

The \(F\)- and \(t\)-based CI’s we have described depend on the assumption of normality

The bootstrap\(\dagger\) method provides a way to construct CI’s without this assumption



\(\dagger\)Efron (1979) The Annals of Statistics 7:1–26

Bootstrap procedure

  1. Fit your model to the data

  2. Calculate \(\mathbf{e} = \mathbf{y} - \mathbf{X} \widehat{\boldsymbol{\beta}}\)

  3. Do the following many times:

    • Generate \(\mathbf{e}^*\) by sampling with replacement from \(\mathbf{e}\)
    • Calculate \(\mathbf{y}^*\) = \(\mathbf{X} \widehat{\boldsymbol{\beta}} + \mathbf{e}^*\)
    • Estimate \(\widehat{\boldsymbol{\beta}}^*\) from \(\mathbf{X}\) & \(\mathbf{y}^*\))
  4. Select the \(\tfrac{\alpha}{2}\) and \((1 - \tfrac{\alpha}{2})\) percentiles from the saved \(\widehat{\boldsymbol{\beta}}^*\)

Confidence interval for new predictions

Given a fitted model \(\mathbf{y} = \mathbf{X} \widehat{\boldsymbol{\beta}} + \mathbf{e}\), we might want to know the uncertainty around a new estimate \(\mathbf{y}^*\) given some new predictor \(\mathbf{X}^*\)

CI for the mean response

Suppose we wanted to estimate the uncertainty in the average response given by

\[ \widehat{\mathbf{y}}^* = \mathbf{X}^* \widehat{\boldsymbol{\beta}} \]

Recall that the general formula for a CI on a quantity \(z\) is

\[ 100(1 - \alpha)\% ~ \text{CI}: \text{E}(z) ~ \pm ~ t^{(\alpha / 2)}_{df}\text{SD}(z) \]

So we would have

\[ \widehat{\mathbf{y}}^* ~ \pm ~ t^{(\alpha / 2)}_{df} \sqrt{\text{Var} \left( \widehat{\mathbf{y}}^* \right)} \]

CI for the mean response

We can calculate the SD of our expectation as

\[ \begin{aligned} \text{Var} \left( \widehat{\mathbf{y}}^* \right) &= \text{Var} \left( \mathbf{X}^* \widehat{\boldsymbol{\beta}} \right) \\ &= {\mathbf{X}^*}^{\top} \text{Var}\left( \widehat{\boldsymbol{\beta}} \right) \mathbf{X}^* \\ &= {\mathbf{X}^*}^{\top} \left[ \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1} \right] \mathbf{X}^* \\ &\Downarrow \\ \text{SD} \left( \widehat{\mathbf{y}}^* \right) &= \sigma \sqrt{ {\mathbf{X}^*}^{\top} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^* } \end{aligned} \]

CI for the mean response

So our CI on the mean response is given by

\[ \widehat{\mathbf{y}}^* \pm ~ t^{(\alpha / 2)}_{df} \sigma \sqrt{ {\mathbf{X}^*}^{\top} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^* } \]

CI for a specific response

What about the uncertainty in a specific prediction?

In that case we need to account for our additional uncertainty owing to the error in our relationship, which is given by

\[ \widehat{\mathbf{y}}^* = \mathbf{X}^* \widehat{\boldsymbol{\beta}} + \mathbf{e} \]

CI for a specific response

The SD of the new prediction is given by

\[ \begin{aligned} \text{Var} \left( \widehat{\mathbf{y}}^* \right) &= {\mathbf{X}^*}^{\top} \text{Var}\left( \widehat{\boldsymbol{\beta}} \right) \mathbf{X}^* + \text{Var} \left( \mathbf{e} \right) \\ &= {\mathbf{X}^*}^{\top} \left[ \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1} \right] \mathbf{X}^* + \sigma^2\\ &= \sigma^2 \left( {\mathbf{X}^*}^{\top} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^* + 1 \right) \\ &\Downarrow \\ \text{SD} \left( \widehat{\mathbf{y}}^* \right) &= \sigma \sqrt{1 + {\mathbf{X}^*}^{\top} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^* } \end{aligned} \]

CI for a specific response

So our CI on the new prediction is given by

\[ \widehat{\mathbf{y}}^* \pm ~ t^{(\alpha / 2)}_{df} \sigma \sqrt{1 + {\mathbf{X}^*}^{\top} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^* } \]


This is typically referred to as the prediction interval