8 April 2026

Goals for today

  • Understand the concept and practice of partitioning sums-of-squares
  • Understand the uses of R2 and adjusted-R2 for linear models

General approach

Question \(\rightarrow\) Data \(\rightarrow\) Model \(\rightarrow\) Inference \(\rightarrow\) Prediction

Inference

Relationship(s) between predictor(s) and response

  • How big is an effect? Is an effect “important”?
  • How certain are we about the sign and size of the effect?

General approach

Question \(\rightarrow\) Data \(\rightarrow\) Model \(\rightarrow\) Inference \(\rightarrow\) Prediction

Prediction

  • What do we predict for a novel case?
  • How good will our predictions be?

Partitioning variance

In general, we have something like

\[ DATA = MODEL + ERRORS \]

and hence

\[ \text{Var}(DATA) = \text{Var}(MODEL) + \text{Var}(ERRORS) \]

Partitioning total deviations

The total deviations in the data equal the sum of those for the model and errors

\[ \underbrace{y_i - \bar{y}}_{\text{Total}} = \underbrace{\widehat{y}_i - \bar{y}}_{\text{Model}} + \underbrace{y_i - \widehat{y}_i}_{\text{Error}} \]

Partitioning variance

An example

Let’s consider a model for the relationship between

  • soil surface temperature (C) measured across 15 small plots on a hill, and

  • the aspect of each plot (direction in degrees; N to S)

Partitioning variance

An example

Partitioning variance

An example

Let’s fit this model:

\[temperature_i = \beta_0 + \beta_1 \times aspect_i + e_i\]

and find \(\hat{\boldsymbol{\beta}}\) and \(\hat{\boldsymbol{y}}\)


Let’s use matrix algebra like we learned last time!

Partitioning variance

An example

Create our design matrix \(\mathbf{X}\)

## construct our "Design matrix"
XX <- cbind(intercept = rep(1, nn), x = aspect)
head(XX)
##      intercept   x
## [1,]         1  49
## [2,]         1 144
## [3,]         1  66
## [4,]         1 172
## [5,]         1 114
## [6,]         1  25

Partitioning variance

An example

We can calculate our betas using

\(\hat{\boldsymbol{\beta}} = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y}\)

## calculate betas  
beta_hat <- solve(t(XX) %*% XX) %*% t(XX) %*% temp


Recall that t() transforms (rotates) a matrix and solve() inverts it

Partitioning variance

An example

We can calculate our model estimates (\(\hat{\mathbf{y}}\)) using

\(\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}\)

## calculate model estimates  
temp_hat <- XX %*% beta_hat

Partitioning variance

An example

Let’s plot our model estimates (ie, the fitted regression line)

Partitioning total deviations

Partitioning sums-of-squares

The sums-of-squares have the same additive property as the deviations

\[ \underbrace{\sum (y_i - \bar{y})^2}_{SSTO} = \underbrace{\sum (\widehat{y}_i - \bar{y})^2}_{SSR} + \underbrace{\sum (y_i - \widehat{y}_i)^2}_{SSE} \]

Sum-of-squares: Total

The total sum-of-squares \((SSTO)\) measures the total variation in the data as the differences between the data and their mean

\[ SSTO = \sum \left( y_i - \bar{y} \right)^2 \]

Sum-of-squares: Model

The model (regression) sum-of-squares \((SSR)\) measures the variation between the model fits and the mean of the data

\[ SSR = \sum \left( \widehat{y}_i - \bar{y} \right)^2 \]

Sum-of-squares: Error

The error sum-of-squares \((SSE)\) measures the variation between the data and the model fits

\[ SSE = \sum \left( y_i - \widehat{y}_i \right)^2 \]

Sum-of-squares

An example

Let’s calculate the total sum-of-squares (SSTO) using

\(SSTO = \sum \left( y_i - \bar{y} \right)^2\)

## mean of the response
y_bar <- mean(temp)

## total sum-of-squares
SSTO <- t(temp - y_bar) %*% (temp - y_bar)


Recall that \(\mathbf{x}^{\top} \mathbf{x}\) will give the sum of the squared elements in \(\mathbf{x}\)

Sum-of-squares

An example

Let’s calculate the model sum-of-squares (SSR) using

\(SSR = \sum \left( \widehat{y}_i - \bar{y} \right)^2\)

## model sum-of-squares
SSR <- t(temp_hat - y_bar) %*% (temp_hat - y_bar)

Sum-of-squares

An example

Let’s calculate the error sum-of-squares (SSE) using

\(SSE = \sum \left( y_i - \widehat{y}_i \right)^2\)

## error sum-of-squares
SSE <- t(temp - temp_hat) %*% (temp - temp_hat)

QUESTIONS?

Goodness-of-fit

How about a measure of how well a model fits the data?

  • \(SSTO\) measures the variation in \(y\) without considering \(X\)
  • \(SSE\) measures the reduced variation in \(y\) after considering \(X\)
  • Let’s consider this reduction in variance as a proportion of the total

Goodness-of-fit

A common option is the coefficient of determination or \((R^2)\)

\[ R^2 = \frac{SSR}{SSTO} = 1 - \frac{SSE}{SSTO} \\ ~ \\ 0 < R^2 < 1 \]

Goodness-of-fit

An example

Let’s calculate the \((R^2)\) for our model

## coefficient of determination
SSR / SSTO
##           [,1]
## [1,] 0.8831171
## or via
1 - SSE / SSTO
##           [,1]
## [1,] 0.8831171

Degrees of freedom

The number of independent elements that are free to vary when estimating quantities of interest

Degrees of freedom

An example

  • Imagine you have 7 hats and you want to wear a different one on each day of the week.
  • On day 1 you can choose any of the 7, on day 2 any of the remaining 6, and so forth
  • When day 7 rolls around, however, you are out of choices: there is only one unworn hat
  • Thus, you had 7 - 1 = 6 days of freedom to choose your hat

Model in geometric space

\(\mathbf{y}\) is \(n\)-dim; \(\widehat{\mathbf{y}}\) is \(k\)-dim; \(\mathbf{e}\) is \((n-k)\)-dim

Degrees of freedom

Linear models

Beginning with \(SSTO\), we have

\[ SSTO = \sum \left( y_i - \bar{y} \right)^2 \]

The data are unconstrained and lie in an \(n\)-dimensional space, but estimating the mean \((\bar{y})\) from the data costs 1 degree of freedom \((df)\), so

\[ df_{SSTO} = n - 1 \]

Degrees of freedom

Linear models

For the \(SSR\) we have

\[ SSR = \sum \left( \widehat{y}_i - \bar{y} \right)^2 \]

We estimate the data \((\widehat{y})\) with a \(k\)-dimensional model, but we lose 1 \(df\) when estimating the mean, so

\[ df_{SSR} = k - 1 \]

Degrees of freedom

Linear models

The \(SSE\) is analogous

\[ SSE = \sum \left( y_i - \widehat{y}_i \right)^2 \]

The data lie in an \(n\)-dimensional space and we represent them in a \(k\)-dimensional subspace, so

\[ df_{SSE} = n - k \]

Mean squares

The expectation of the sum-of-squares or “mean square” gives an indication of the variance for the model and errors

A mean square is a sum-of-squares divided by its degrees of freedom

\[ MS = \frac{SS}{df} \\ \Downarrow \\ MSR = \frac{SSR}{k - 1} ~~~ \& ~~~ MSE = \frac{SSE}{n - k} \]

Variance estimates

We are typically interested in two variance estimates:

  1. The variance of the residuals \(\mathbf{e}\)

  2. The variance of the model parameters \(\mathbf{B}\)

Variance estimates

Residuals

In a least squares context, we assume that the model errors (residuals) are independent and identically distributed with mean 0 and variance \(\sigma^2\)

The problem is that we don’t know \(\sigma^2\) and therefore we must estimate it

Variance estimates

Residuals

If \(z_i \sim \text{N}(0, 1)\) then

\[ \sum_{i = 1}^{n} z_i^2 = \mathbf{z}^{\top}\mathbf{z} \sim \chi^2_{n} \]

Variance estimates

Residuals

If \(z_i \sim \text{N}(0, 1)\) then

\[ \sum_{i = 1}^{n} z_i^2 = \mathbf{z}^{\top}\mathbf{z} \sim \chi^2_{n} \]

In our linear model, \(e_i \sim \text{N}(0, \sigma^2)\) so

\[ \sum_{i = 1}^{n} e_i^2 = \mathbf{e}^{\top}\mathbf{e} \sim \sigma^2 \cdot \chi^2_{n - k} \]

Variance estimates

Residuals

Thus, given

\[ \mathbf{e}^{\top}\mathbf{e} \sim \sigma^2 \cdot \chi^2_{n - k} \\ \text{E}(\chi^2_{n - k}) = n - k \\ \text{E}(\mathbf{e}^{\top}\mathbf{e}) = SSE \]

then

\[ SSE = \sigma^2 (n - k) ~ \Rightarrow ~ \sigma^2 = \frac{SSE}{n - k} = MSE \]

Variance estimates

Parameters

Recall that our estimate of the model parameters is

\[ \widehat{\boldsymbol{\beta}} = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y} \]

Variance estimates

Parameters

Estimating the variance of the model parameters \(\boldsymbol{\beta}\) requires some linear algebra

For a scalar \(z\), if \(\text{Var}(z) = \sigma^2\) then \(\text{Var}(az) = a^2 \sigma^2\)

For a vector \(\mathbf{z}\), if \(\text{Var}(\mathbf{z}) = \mathbf{\Sigma}\) then \(\text{Var}(\mathbf{A z}) = \mathbf{A} \mathbf{\Sigma} \mathbf{A}^{\top}\)

Variance estimates

Parameters

The variance of the parameters is therefore

\[ \begin{aligned} \widehat{\boldsymbol{\beta}} &= (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y} \\ &= \left[ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \right] \mathbf{y} \\ \end{aligned} \\ \Downarrow \\ \text{Var}(\widehat{\boldsymbol{\beta}}) = \left[ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \right] \text{Var}(\mathbf{y}) \left[ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \right]^{\top} \]

Variance estimates

Parameters

Recall that we can write our model in matrix form as

\[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{e} \\ \mathbf{e} \sim \text{MVN}(\mathbf{0}, \sigma^2 \mathbf{I}) \]

Variance estimates

Parameters

We can rewrite our model more compactly as

\[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{e} \\ \mathbf{e} \sim \text{MVN}(\mathbf{0}, \sigma^2 \mathbf{I}) \\ \Downarrow \\ \mathbf{y} \sim \text{MVN}(\mathbf{X} \boldsymbol{\beta}, \underbrace{\sigma^2 \mathbf{I}}_{\text{Var}(\mathbf{y} | \mathbf{X} \boldsymbol{\beta})}) \\ \]

Variance estimates

Parameters

Our estimate of \(\text{Var}(\widehat{\boldsymbol{\beta}})\) is then

\[ \begin{aligned} \text{Var}(\widehat{\boldsymbol{\beta}}) &= \left[ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \right] \text{Var}(\mathbf{y}) \left[ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \right]^{\top} \\ &= \left[ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \right] \sigma^2 \mathbf{I} \left[ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \right]^{\top} \\ &= \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1} (\mathbf{X}^{\top} \mathbf{X}) \left[ (\mathbf{X}^{\top} \mathbf{X})^{-1} \right]^{\top} \\ &= \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1} \end{aligned} \]

Partitioning total deviations

Here is a plot of some data \(y\) and a predictor \(x\)

Partitioning total deviations

And let’s consider this model: \(y_i = \alpha + \beta x_i + e_i\)

Partitioning total deviations

Here is our model fit to the data

Variance estimates

Parameters

Let’s think about the variance of \(\widehat{\boldsymbol{\beta}}\)

\[ \text{Var}(\widehat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1} \]

This suggests that our confidence in our estimate increases with the spread in \(\mathbf{X}\)

Effect of \(\mathbf{X}\) on parameter precision

Consider these two scenarios where the slope of the relationship is identical