10 April 2026

Goals for today

  • Understand the use of F-tests for hypothesis testing
  • Understand how to estimate confidence intervals

Our general approach

Question \(\rightarrow\) Data \(\rightarrow\) Model \(\rightarrow\) Inference \(\rightarrow\) Prediction

Inferential methods

We’ve see how model parameters and their variance, we might want to draw conclusions from our analysis

We might also want to try several forms of models and ask,

“Which of them is the ‘best’?”

Comparing models

Imagine we had 2 linear models of varying complexity:

  1. a model with one predictor

  2. a model with two predictors

It would seem logical to ask whether the complexity of (2) is necessary?

Hypothesis testing

Null hypothesis (\(H_0\))

  • A statement that the effect of interest does not exist
  • Can be exact (eg, \(H_0: \beta = 0\)) or inexact (eg, \(H_0: \beta \geq 1\))

Hypothesis testing

p-value

  • Probability of obtaining a test statistic as large (or larger), assuming the null hypothesis is correct
  • Provides evidence against the null hypothesis
  • Not the probability that the null hypothesis is true
  • Not the probability that the results were produced by random chance alone

Hypothesis test to compare models

Recall our partitioning of sums-of-squares, where

\[ SSTO = SSR + SSE \]

We might prefer the more complex model (call it \(C\)) over the simple model (call it \(S\)) if

\[ SSE_{C} < SSE_{S} \]

or, more formally, if

\[ \frac{SSE_{S} - SSE_{C}}{SSE_{C}} > \text{constant} \]

Hypothesis test to compare models

If \(C\) has \(k\) parameters and \(S\) has \(j\), we can scale this ratio to arrive at an \(F\)-statistic that follows an \(F\) distribution

\[ F = \frac{ \left( SSE_{S} - SSE_{C} \right) / (k - j)}{ SSE_{C} / (n - k)} \sim F_{(k - j),(n - k)} \]

\(F\)-distribution

The \(F\)-distribution is the ratio of two random variates, each with a \(\chi^2_{n}\) distribution

If \(A \sim \chi_{df_{A}}^{2}\) and \(B \sim \chi_{df_{B}}^{2}\) are independent, then

\[ \frac{\left( \frac{A}{df_{A}} \right) }{ \left( \frac{B}{df_{B}} \right) } \sim F _{df_{A},df_{B}} \]

\(F\)-distribution

\(F\)-tests for nested models

Test \(H_0\) Ex full model Ex reduced model
All predictors \(\beta_1 = \cdots = \beta_p = 0\) \(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i}\) \(y_i = \beta_0\)
One predictor \(\beta_i = 0 | \text{other predictors}\) \(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i}\) \(y_i = \beta_0 + \beta_1 x_{1,i}\)
Subset of predictors \(\beta_i = \beta_j\) \(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i}\) \(y_i = \beta_0 + \beta_1 x_{1,i} + \beta_1 x_{2,i}\)

\(F\)-tests for nested models

Note that we can only test nested models

For example, we can’t test

\(y_i = \beta_0 + \beta_1 x_{1,i}\) versus \(y_i = \beta_0 + \beta_2 x_{2,i}\)


We will address model selection more fully later in the quarter

Test of all predictors in a model

Suppose we wanted to test whether the collection of predictors in a model were better than simply estimating the data by their mean

\[ C: \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{e} \\ S: \mathbf{y} = \boldsymbol{\mu} + \mathbf{e} \\ \]

We would write the null hypothesis as

\[ H_0: \beta_1 = \beta_2 = \dots = \beta_k = 0 \]

and we would reject \(H_0\) if \(F > F^{(\alpha)}_{k - j, n - k}\)

Hypothesis test to compare models

Our \(F\) statistic is given by

\[ F = \frac{ \left( SSE_{S} - SSE_{C} \right) / (k - j)}{ SSE_{C} / (n - k)} \sim F_{(k - j),(n - k)} \]


and the sum-of-squares terms are given by

\[ SSE_{S} = \mathbf{e}^{\top} \mathbf{e} = \left( \mathbf{y} - \bar{y} \right)^{\top} \left( \mathbf{y} - \bar{y} \right) \\ ~ \\ SSE_{C} = \mathbf{e}^{\top} \mathbf{e} = \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \right)^{\top} \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \right) \]

Predictors of plant diversity

Later in lab we will work with the gala dataset\(\dagger\) in the faraway package, which contains data on the diversity of plant species across 30 Galapagos islands

For now let’s hypothesize that

diversity = \(f\)(area, elevation, distance to nearest island)



\(\dagger\)From Johnson & Raven (1973) Science 179:893-895

Testing one predictor

We might ask whether any one predictor could be dropped from a model

For example, can \(\text{nearest}\) be dropped from our full model?

\[ \text{species}_i = \alpha + \beta_1 \text{area}_i + \beta_2 \text{elevation}_i + \beta_3 \text{nearest}_i + \epsilon_i \]

Testing one predictor

One option is to fit these two models and compare them via our \(F\)-test


\[ H_0: \beta_3 = 0 \\ ~ \\ \begin{aligned} \text{species}_i &= \alpha + \beta_1 \text{area}_i + \beta_2 \text{elevation}_i + \beta_3 \text{nearest}_i + \epsilon_i \\ ~ \\ \text{species}_i &= \alpha + \beta_1 \text{area}_i + \beta_2 \text{elevation}_i + \epsilon_i \end{aligned} \]

Testing one predictor

Another option is to estimate a \(t\)-statistic as

\[ t_i = \frac{\widehat{\beta}_i}{\text{SE} \left( \widehat{\beta}_i \right)} \]

and compare it to a \(t\)-distribution with \(n - k\) degrees of freedom

Testing 2+ predictors

We might want to know whether we can drop 2+ predictors from a model

For example, can we drop both \(\text{elevation}\) and \(\text{nearest}\) from our full model?


\[ H_0 : \beta_2 = \beta_3 = 0 \\ ~ \\ \begin{aligned} \text{species}_i &= \alpha + \beta_1 \text{area}_i + \beta_2 \text{elevation}_i + \beta_3 \text{nearest}_i + \epsilon_i \\ ~ \\ \text{species}_i &= \alpha + \beta_1 \text{area}_i + \epsilon_i \end{aligned} \]

Testing a subspace

Some tests cannot be expressed in terms of the inclusion or exclusion of predictors

Consider a test of whether the areas of the current and adjacent island could be added together and used in place of the two separate predictors

\[ H_0 : \beta_{\text{area}} = \beta_{\text{adjacent}} \\ ~ \\ \text{species}_i = \alpha + \beta_1 \text{area}_i + \beta_2 \text{adjacent}_i + \dots + \epsilon_i \\ ~ \\ \text{species}_i = \alpha + \beta_1 \text{(area + adjacent)}_i + \dots + \epsilon_i \]

Testing a subspace

What if we wanted to test whether a predictor had a specific (non-zero) value?

For example, is there a 1:1 relationship between \(\text{species}\) and \(\text{elevation}\) after controlling for the other predictors?

\[ H_0 : \beta_2 = 1 \\ ~ \\ \text{species}_i = \alpha + \beta_1 \text{area}_i + \underline{1} \text{elevation}_i + \beta_3 \text{nearest}_i + \epsilon_i \]

Testing a subspace

We can also modify our \(t\)-test from before and use it for our comparison by including the hypothesized \(\beta_{H_0}\) as an offset

\[ t_i = \frac{(\widehat{\beta_i} - \beta_{H_0})}{\text{SE} \left( \widehat{\beta}_i \right)} \]

Caveats about null hypothesis testing

Null hypothesis testing (NHT) is a slippery slope

  • \(p\)-values are simply the probability of obtaining a test statistic as large or greater than that observed
  • \(p\)-values are not weights of evidence
  • “Critical” or “threshold” values against which to compare \(p\)-values must be chosen a priori
  • Null hypotheses can be extremely weak (eg, \(H_0: \beta_1 = 0\))
  • Beware of “\(p\) hacking” where researchers make many tests to find significance

QUESTIONS?

Confidence intervals for \(\beta\)

We can also use confidence intervals (CI’s) to express uncertainty in \(\widehat{\beta}_i\)

They take the form

\[ 100(1 - \alpha)\% ~ \text{CI}: \widehat{\beta}_{i} \pm t_{n-k}^{(\alpha / 2)} \operatorname{SE}(\widehat{\beta}) \]

where \(\alpha\) is our predetermined Type-I error rate

\(t\)-distribution

The \(t\)-distribution is symmetrical like the normal distribution but used because it is more conservative at smaller sample sizes

\(t\)-distribution

As \(n\) approaches \(\infty\) the \(t\)-distribution approaches the normal distribution

Bootstrap confidence intervals

The \(F\)- and \(t\)-based CI’s we have described depend on the assumption of normality

The bootstrap\(\dagger\) method provides a way to construct CI’s without this assumption




\(\dagger\)Efron (1979) The Annals of Statistics 7:1–26

Bootstrap procedure

  1. Fit your model to the data

  2. Calculate \(\mathbf{e} = \mathbf{y} - \mathbf{X} \widehat{\boldsymbol{\beta}}\)

  3. Do the following many times:

    • Generate \(\mathbf{e}^*\) by sampling with replacement from \(\mathbf{e}\)

    • Calculate \(\mathbf{y}^*\) = \(\mathbf{X} \widehat{\boldsymbol{\beta}} + \mathbf{e}^*\)

    • Estimate \(\widehat{\boldsymbol{\beta}}^*\) from \(\mathbf{X}\) & \(\mathbf{y}^*\)

  4. Select the \(\tfrac{\alpha}{2}\) and \((1 - \tfrac{\alpha}{2})\) percentiles from the saved \(\widehat{\boldsymbol{\beta}}^*\)

Bootstrap pseudocode

## number of bootstrap samples
n_boot <- 1000
## empty vector for results
betas <- rep(NA, n_boot)
## fit initial model: y = alpha + beta*x + e
mod_fit <- lm(y ~ x)
## iterate many times
for(i in 1:n_boot) {
  ## get residuals
  e <- residuals(mod_fit)
  ## sample with replacement
  e_star <- sample(x = e, size = length(y), replace = TRUE)
  ## get model parameters
  alpha <- coef(mod_fit)[1];  betas[i] <- coef(mod_fit)[2]
  ## calculate new y
  y_star <- alpha + betas[i] * x + e_star
  ## refit model with new y
  mod_fit <- lm(y_star ~ x)
}
## sort results and pick out 95th percentiles
CI_95 <- sort(betas)[c(25, 975)]

Confidence interval for new predictions

Given a fitted model \(\mathbf{y} = \mathbf{X} \widehat{\boldsymbol{\beta}} + \mathbf{e}\), we might want to know the uncertainty around a new estimate \(\mathbf{y}^*\) given some new predictor \(\mathbf{X}^*\)

CI for the mean response

Suppose we wanted to estimate the uncertainty in the average response given by

\[ \widehat{\mathbf{y}}^* = \mathbf{X}^* \widehat{\boldsymbol{\beta}} \]

Recall that the general formula for a CI on a quantity \(z\) is

\[ 100(1 - \alpha)\% ~ \text{CI}: \text{E}(z) ~ \pm ~ t^{(\alpha / 2)}_{df}\text{SD}(z) \]

So we would have

\[ \widehat{\mathbf{y}}^* ~ \pm ~ t^{(\alpha / 2)}_{df} \sqrt{\text{Var} \left( \widehat{\mathbf{y}}^* \right)} \]

CI for the mean response

We can calculate the SD of our expectation as

\[ \begin{aligned} \text{Var} \left( \widehat{\mathbf{y}}^* \right) &= \text{Var} \left( \mathbf{X}^* \widehat{\boldsymbol{\beta}} \right) \\ &= {\mathbf{X}^*}^{\top} \text{Var}\left( \widehat{\boldsymbol{\beta}} \right) \mathbf{X}^* \\ &= {\mathbf{X}^*}^{\top} \left[ \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1} \right] \mathbf{X}^* \\ &\Downarrow \\ \text{SD} \left( \widehat{\mathbf{y}}^* \right) &= \sigma \sqrt{ {\mathbf{X}^*}^{\top} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^* } \end{aligned} \]

CI for the mean response

So our CI on the mean response is given by

\[ \widehat{\mathbf{y}}^* \pm ~ t^{(\alpha / 2)}_{df} \sigma \sqrt{ {\mathbf{X}^*}^{\top} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^* } \]

CI for a specific response

What about the uncertainty in a specific prediction?

In that case we need to account for our additional uncertainty owing to the error in our relationship, which is given by

\[ \widehat{\mathbf{y}}^* = \mathbf{X}^* \widehat{\boldsymbol{\beta}} + \mathbf{e} \]

CI for a specific response

The SD of the new prediction is given by

\[ \begin{aligned} \text{Var} \left( \widehat{\mathbf{y}}^* \right) &= {\mathbf{X}^*}^{\top} \text{Var}\left( \widehat{\boldsymbol{\beta}} \right) \mathbf{X}^* + \text{Var} \left( \mathbf{e} \right) \\ &= {\mathbf{X}^*}^{\top} \left[ \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1} \right] \mathbf{X}^* + \sigma^2\\ &= \sigma^2 \left( {\mathbf{X}^*}^{\top} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^* + 1 \right) \\ &\Downarrow \\ \text{SD} \left( \widehat{\mathbf{y}}^* \right) &= \sigma \sqrt{1 + {\mathbf{X}^*}^{\top} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^* } \end{aligned} \]

CI for a specific response

So our CI on the new prediction is given by

\[ \widehat{\mathbf{y}}^* \pm ~ t^{(\alpha / 2)}_{df} \sigma \sqrt{1 + {\mathbf{X}^*}^{\top} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^* } \]


This is typically referred to as the prediction interval