Inference from linear models

8 April 2026

Goals for today

Understand the concept and practice of partitioning sums-of-squares

Understand the uses of R² and adjusted-R² for linear models

General approach

Question \(\rightarrow\) Data \(\rightarrow\) Model \(\rightarrow\) Inference \(\rightarrow\) Prediction

Inference

Relationship(s) between predictor(s) and response

How big is an effect? Is an effect “important”?

How certain are we about the sign and size of the effect?

General approach

Question \(\rightarrow\) Data \(\rightarrow\) Model \(\rightarrow\) Inference \(\rightarrow\) Prediction

Prediction

What do we predict for a novel case?

How good will our predictions be?

Partitioning variance

In general, we have something like

\[ DATA = MODEL + ERRORS \]

and hence

\[ \text{Var}(DATA) = \text{Var}(MODEL) + \text{Var}(ERRORS) \]

Partitioning total deviations

The total deviations in the data equal the sum of those for the model and errors

\[ \underbrace{y_i - \bar{y}}_{\text{Total}} = \underbrace{\widehat{y}_i - \bar{y}}_{\text{Model}} + \underbrace{y_i - \widehat{y}_i}_{\text{Error}} \]

Partitioning variance

An example

Let’s consider a model for the relationship between

soil surface temperature (C) measured across 15 small plots on a hill, and
the aspect of each plot (direction in degrees; N to S)

Partitioning variance

An example

Partitioning variance

An example

Let’s fit this model:

\[temperature_i = \beta_0 + \beta_1 \times aspect_i + e_i\]

and estimate \(\hat{\boldsymbol{\beta}}\) and \(\hat{\boldsymbol{y}}\)

Let’s use matrix algebra like we learned last time!

Partitioning variance

An example

Create our design matrix \(\mathbf{X}\)

## construct our "design matrix"
XX <- cbind(intercept = rep(1, nn), x = aspect)
head(XX)

##      intercept   x
## [1,]         1 180
## [2,]         1 122
## [3,]         1  10
## [4,]         1  17
## [5,]         1 157
## [6,]         1  85

Partitioning variance

An example

We can calculate our betas using

\(\hat{\boldsymbol{\beta}} = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y}\)

## calculate betas  
beta_hat <- solve(t(XX) %*% XX) %*% t(XX) %*% temp

t() transforms (rotates) a matrix and solve() inverts it

%*% denotes matrix multiplication

Partitioning variance

An example

We can calculate our model estimates (\(\hat{\mathbf{y}}\)) using

\(\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}\)

## calculate model estimates  
temp_hat <- XX %*% beta_hat

Partitioning variance

An example

Let’s plot our model estimates (ie, the fitted regression line)

Partitioning total deviations

Partitioning sums-of-squares

The sums-of-squares have the same additive property as the deviations

\[ \underbrace{\sum (y_i - \bar{y})^2}_{SSTO} = \underbrace{\sum (\widehat{y}_i - \bar{y})^2}_{SSR} + \underbrace{\sum (y_i - \widehat{y}_i)^2}_{SSE} \]

Sum-of-squares: Total

The total sum-of-squares \((SSTO)\) measures the total variation in the data as the differences between the data and their mean

\[ SSTO = \sum \left( y_i - \bar{y} \right)^2 \]

Sum-of-squares: Model

The model (regression) sum-of-squares \((SSR)\) measures the variation between the model fits and the mean of the data

\[ SSR = \sum \left( \widehat{y}_i - \bar{y} \right)^2 \]

Sum-of-squares: Error

The error sum-of-squares \((SSE)\) measures the variation between the data and the model fits

\[ SSE = \sum \left( y_i - \widehat{y}_i \right)^2 \]

Sum-of-squares

An example

Let’s calculate the total sum-of-squares (SSTO) using

\(SSTO = \sum \left( y_i - \bar{y} \right)^2\)

## mean of the response
temp_bar <- mean(temp)

## total sum-of-squares
SSTO <- t(temp - temp_bar) %*% (temp - temp_bar)

Recall that \(\mathbf{x}^{\top} \mathbf{x}\) will give the sum of the squared elements in \(\mathbf{x}\)

Sum-of-squares

An example

Let’s calculate the model sum-of-squares (SSR) using

\(SSR = \sum \left( \widehat{y}_i - \bar{y} \right)^2\)

## model sum-of-squares
SSR <- t(temp_hat - temp_bar) %*% (temp_hat - temp_bar)

Sum-of-squares

An example

Let’s calculate the error sum-of-squares (SSE) using

\(SSE = \sum \left( y_i - \widehat{y}_i \right)^2\)

## error sum-of-squares
SSE <- t(temp - temp_hat) %*% (temp - temp_hat)

QUESTIONS?

Goodness-of-fit

How about a measure of how well a model fits the data?

\(SSTO\) measures the variation in \(y\) without considering \(X\)

\(SSE\) measures the reduced variation in \(y\) after considering \(X\)

Let’s consider this reduction in variance as a proportion of the total

Goodness-of-fit

A common option is the coefficient of determination or \((R^2)\)

\[ R^2 = \frac{SSR}{SSTO} = 1 - \frac{SSE}{SSTO} \\ ~ \\ 0 < R^2 < 1 \]

Goodness-of-fit

An example

Let’s calculate the \((R^2)\) for our model

## coefficient of determination
SSR / SSTO

##           [,1]
## [1,] 0.7011654

## or via
1 - SSE / SSTO

##           [,1]
## [1,] 0.7011654

Degrees of freedom

The number of independent elements that are free to vary when estimating quantities of interest

Degrees of freedom

An example

Imagine you have 7 hats and you want to wear a different one on each day of the week

On day 1 you can choose any of the 7, on day 2 any of the remaining 6, and so forth

When day 7 rolls around, however, you are out of choices: there is only one unworn hat

Thus, you had 7 - 1 = 6 days of freedom to choose your hat

Model in geometric space

\(\mathbf{y}\) is \(n\)-dim; \(\widehat{\mathbf{y}}\) is \(k\)-dim; \(\mathbf{e}\) is \((n-k)\)-dim

Degrees of freedom

Linear models

Beginning with \(SSTO\), we have

\[ SSTO = \sum \left( y_i - \bar{y} \right)^2 \]

The data are unconstrained and lie in an \(n\)-dimensional space, but estimating the mean \((\bar{y})\) from the data costs 1 degree of freedom \((df)\), so

\[ df_{SSTO} = n - 1 \]

Degrees of freedom

Linear models

For the \(SSR\) we have

\[ SSR = \sum \left( \widehat{y}_i - \bar{y} \right)^2 \]

We estimate the data \((\widehat{y})\) with a \(k\)-dimensional model, but we lose 1 \(df\) when estimating the mean, so

\[ df_{SSR} = k - 1 \]

Degrees of freedom

Linear models

The \(SSE\) is analogous

\[ SSE = \sum \left( y_i - \widehat{y}_i \right)^2 \]

The data lie in an \(n\)-dimensional space and we represent them in a \(k\)-dimensional subspace, so

\[ df_{SSE} = n - k \]

Mean squares

The expectation of the sum-of-squares or “mean square” gives an indication of the variance for the model and errors

Mean squares

The expectation of the sum-of-squares or “mean square” gives an indication of the variance for the model and errors

A mean square is a sum-of-squares divided by its degrees of freedom

\[ MS = \frac{SS}{df} \\ \]

Mean squares

Mean square model

\(MSR = \frac{SSR}{k - 1}\)

Mean square error

\(MSE = \frac{SSE}{n - k}\)

Variance estimates

We are typically interested in two variance estimates:

The variance of the residuals \(\mathbf{e}\)
The variance of the model parameters \(\mathbf{B}\)

Variance estimates

Residuals

In a least squares context, we assume that the model errors (residuals) are independent and identically distributed with mean 0 and variance \(\sigma^2\)

The problem is that we don’t know \(\sigma^2\) and therefore we must estimate it

Variance estimates

Residuals

If \(z_i \sim \text{N}(0, 1)\) then

\[ \sum_{i = 1}^{n} z_i^2 = \mathbf{z}^{\top}\mathbf{z} \\ \Downarrow \\ \mathbf{z}^{\top}\mathbf{z} \sim \chi^2_{n} \]

Variance estimates

Residuals

In our linear model, \(e_i \sim \text{N}(0, \sigma^2)\) so

\[ \sum_{i = 1}^{n} e_i^2 = \mathbf{e}^{\top}\mathbf{e} \\ \Downarrow \\ \mathbf{e}^{\top}\mathbf{e} \sim \sigma^2 \cdot \chi^2_{n - k} \]

Variance estimates

Residuals

So, if

\[ \mathbf{e}^{\top}\mathbf{e} \sim \sigma^2 \cdot \chi^2_{n - k} \]

then the expectation (mean) is

\[ \text{E}(\mathbf{e}^{\top}\mathbf{e}) = \sigma^2 ~ \text{E}(\chi^2_{n - k}) \]

Variance estimates

Residuals

Thus, given

\[ \text{E}(\mathbf{e}^{\top}\mathbf{e}) = \sigma^2 ~ \text{E}(\chi^2_{n - k}) \\ \Downarrow \\ SSE = \sigma^2 (n - k) \\ \Downarrow \\ \sigma^2 = \frac{SSE}{n - k} = MSE \]

Variance estimates

Parameters

Recall that our estimate of the model parameters is

\[ \widehat{\boldsymbol{\beta}} = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y} \]

Variance estimates

Parameters

Estimating the variance of the model parameters \(\boldsymbol{\beta}\) requires some linear algebra

for a scalar \(z\), if \(\text{Var}(z) = \sigma^2\) then \(\text{Var}(az) = a^2 \sigma^2\)

for a vector \(\mathbf{z}\), if \(\text{Var}(\mathbf{z}) = \mathbf{\Sigma}\) then \(\text{Var}(\mathbf{A z}) = \mathbf{A} \mathbf{\Sigma} \mathbf{A}^{\top}\)

Variance estimates

Parameters

The variance of the parameters is therefore

\[ \begin{aligned} \widehat{\boldsymbol{\beta}} &= (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y} \\ &= \left[ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \right] \mathbf{y} \\ \end{aligned} \\ \Downarrow \\ \text{Var}(\widehat{\boldsymbol{\beta}}) = \left[ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \right] \text{Var}(\mathbf{y}) \left[ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \right]^{\top} \]

Variance estimates

Parameters

Recall that we can write our model in matrix form as

\[ y_i = \beta_0 + \beta_1 x_i + e_i \\ e_i \sim \text{N}(0, \sigma^2) \\ \Downarrow \\ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{e} \\ \mathbf{e} \sim \text{MVN}(\mathbf{0}, \sigma^2 \mathbf{I}) \]

Note that \(\mathbf{I}\) is the identity matrix, which has 1’s on the diagonal and 0’s elsewhere

Variance estimates

Parameters

We can rewrite our matrix form of the model more compactly

\[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{e} \\ \mathbf{e} \sim \text{MVN}(\mathbf{0}, \sigma^2 \mathbf{I}) \\ \Downarrow \\ \mathbf{y} \sim \text{MVN} ( \mathbf{X} \boldsymbol{\beta}, \sigma^2 \mathbf{I} ) \\ \]

where \(\sigma^2 \mathbf{I}\) is the variance of \(\mathbf{y}\) conditional on \((\mathbf{X} \boldsymbol{\beta})\)

Variance estimates

Parameters

Returning to our estimate of the variance of \(\widehat{\boldsymbol{\beta}}\)

Variance estimates

Parameters

Returning to our estimate of the variance of \(\widehat{\boldsymbol{\beta}}\)

\[ \text{Var}(\widehat{\boldsymbol{\beta}}) = \left[ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \right] \text{Var}(\mathbf{y}) \left[ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \right]^{\top} \\ \Downarrow \\ \begin{aligned} \text{Var}(\widehat{\boldsymbol{\beta}}) &= \left[ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \right] \sigma^2 \mathbf{I} \left[ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \right]^{\top} \\ &= \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1} (\mathbf{X}^{\top} \mathbf{X}) \left[ (\mathbf{X}^{\top} \mathbf{X})^{-1} \right]^{\top} \\ &= \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1} \end{aligned} \]

Variance estimates

Parameters

Let’s think about the variance of \(\widehat{\boldsymbol{\beta}}\) with respect to \(\mathbf{X}\)

\[ \text{Var}(\widehat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1} \]

What happens as the spread (variance) in \(\mathbf{X}\) increases?

Variance estimates

Parameters

Let’s think about the variance of \(\widehat{\boldsymbol{\beta}}\) with respect to \(\mathbf{X}\)

\[ \text{Var}(\widehat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1} \]

What happens as the spread (variance) in \(\mathbf{X}\) increases?

The variance in our parameter estimates decreases with the spread in \(\mathbf{X}\)

Effect of \(\mathbf{X}\) on parameter precision

Consider these two scenarios where the slope of the relationship is identical