- Understand how to represent a linear model with matrix notation
- Understand the concept, assumptions & practice of least squares estimation for linear models
- Understand the concept of identifiability
6 April 2026
\[ y = \text{intercept} + \text{slope} \times x + \text{error} \]
\[ y_i = \alpha + \beta x_{i} + e_i \\ \]
The \(i\) subscript indicates one of a total \(N\) observations
Let’s switch the notation for our intercept and slope
\[ y_i = \alpha + \beta x_{i} + e_i \\ \Downarrow \\ y_i = \beta_0 + \beta_1 x_{i} + e_i \]
\(^*\)The reason for this notation switch will become clear later
Let’s make this general statement more specific
\[ y_i = \beta_0 + \beta_1 x_{i} + e_i \\ \Downarrow \\ y_1 = \beta_0 + \beta_1 x_{1} + e_1 \\ y_2 = \beta_0 + \beta_1 x_{2} + e_2 \\ \vdots \\ y_N = \beta_0 + \beta_1 x_{N} + e_N \\ \]
Let’s now make the implicit “1” multiplier on \(\beta_0\) explicit
\[ y_1 = \beta_0 \underline{1} + \beta_1 x_{1} + e_1 \\ y_2 = \beta_0 \underline{1} + \beta_1 x_{2} + e_2 \\ \vdots \\ y_N = \beta_0 \underline{1} + \beta_1 x_{N} + e_N \\ \]
Let’s next gather the common terms into column vectors
\[ y_1 = \beta_0 1 + \beta_1 x_{1} + e_1 \\ y_2 = \beta_0 1 + \beta_1 x_{2} + e_2 \\ \vdots \\ y_N = \beta_0 1 + \beta_1 x_{N} + e_N \\ \]
Maybe something like this?
\[ \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix} = \begin{bmatrix} \beta_0 \\ \beta_0 \\ \vdots \\ \beta_0 \end{bmatrix} \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix} + \begin{bmatrix} \beta_1 \\ \beta_1 \\ \vdots \\ \beta_1 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_N \end{bmatrix} + \begin{bmatrix} e_1 \\ e_2 \\ \vdots \\ e_N \end{bmatrix} \]
We refer to the dimensions of matrices in a row-by-column manner
\([rows \times columns]\)
When adding matrices, the dimensions must match
\([m \times n] + [m \times n]\) ✓
\([m \times n] + [m \times p]\) X
When multiplying 2 matrices, the inner dimensions must match
\([m \times \underline{n}] [\underline{n} \times p]\) ✓
\([m \times \underline{n}] [\underline{p} \times n]\) X
When multiplying 2 matrices, the dimensions of the product are
[rows-of-first \(\times\) columns-of-second]
\([\underline{m} \times n] [n \times \underline{p}] = [m \times p]\)
Let’s check the dimensions
\[ \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix} = \begin{bmatrix} \beta_0 \\ \beta_0 \\ \vdots \\ \beta_0 \end{bmatrix} \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix} + \begin{bmatrix} \beta_1 \\ \beta_1 \\ \vdots \\ \beta_1 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_N \end{bmatrix} + \begin{bmatrix} e_1 \\ e_2 \\ \vdots \\ e_N \end{bmatrix} \]
Let’s check the dimensions
\[ \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix} = \begin{bmatrix} \beta_0 \\ \beta_0 \\ \vdots \\ \beta_0 \end{bmatrix} \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix} + \begin{bmatrix} \beta_1 \\ \beta_1 \\ \vdots \\ \beta_1 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_N \end{bmatrix} + \begin{bmatrix} e_1 \\ e_2 \\ \vdots \\ e_N \end{bmatrix} \]
\[ [N \times 1] = \underbrace{[N \times 1] [N \times 1]}_{\text{OOPS!}} + \underbrace{[N \times 1] [N \times 1]}_{\text{OOPS!}} + [N \times 1] \]
When multiplying a scalar times a vector/matrix, it’s just element-wise
\[ a \begin{bmatrix} X \\ Y \\ Z \end{bmatrix} = \begin{bmatrix} aX \\ aY \\ aZ \end{bmatrix} \]
So this looks better
\[ \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix} = \beta_0 \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix} + \beta_1 \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_N \end{bmatrix} + \begin{bmatrix} e_1 \\ e_2 \\ \vdots \\ e_N \end{bmatrix} \]
\[ [N \times 1] = [N \times 1] + [N \times 1] + [N \times 1] \]
This is nice, but can we make \(\beta_0\) and \(\beta_1\) more compact?
\[ \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix} = \beta_0 \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix} + \beta_1 \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_N \end{bmatrix} + \begin{bmatrix} e_1 \\ e_2 \\ \vdots \\ e_N \end{bmatrix} \]
What if we move \(\beta_0\) & \(\beta_1\) to the other side of the predictors…
\[ \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix} = \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix} \beta_0 + \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_N \end{bmatrix} \beta_1 + \begin{bmatrix} e_1 \\ e_2 \\ \vdots \\ e_N \end{bmatrix} \]
…and group the predictors and parameters into matrices
\[ \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix} = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_N \end{bmatrix} \begin{bmatrix} \beta_0 & \beta_1 \end{bmatrix} + \begin{bmatrix} e_1 \\ e_2 \\ \vdots \\ e_N \end{bmatrix} \]
Let’s check the dimensions
\[ \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix} = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_N \end{bmatrix} \begin{bmatrix} \beta_0 & \beta_1 \end{bmatrix} + \begin{bmatrix} e_1 \\ e_2 \\ \vdots \\ e_N \end{bmatrix} \]
\[ [N \times 1] = \underbrace{[N \times 2] [1 \times 2]}_{\text{OOPS!}} + [N \times 1] \]
Matrix multiplication works on a row-times-column manner
\[ \begin{aligned} \begin{bmatrix} a & b \\ c & d \end{bmatrix} \begin{bmatrix} X \\ Y \end{bmatrix} &= \begin{bmatrix} aX + bY \\ cX + dY \end{bmatrix} \\ ~ \\ [2 \times 2] [2 \times 1] &= [2 \times 1] \end{aligned} \]
Let’s transpose the parameter vector \([\beta_0 ~~ \beta_1]^{\top}\)
\[ \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix} = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_N \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} + \begin{bmatrix} e_1 \\ e_2 \\ \vdots \\ e_N \end{bmatrix} \]
and check the dimensions
\[ \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix} = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_N \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} + \begin{bmatrix} e_1 \\ e_2 \\ \vdots \\ e_N \end{bmatrix} \]
\[ \begin{align} [N \times 1] &= [N \times 2] [2 \times 1] + [N \times 1] \\ &= [N \times 1] + [N \times 1] \\ &= [N \times 1] \end{align} \]
Lastly, we can write the model in a more compact notation
\[ \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix} = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_N \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} + \begin{bmatrix} e_1 \\ e_2 \\ \vdots \\ e_N \end{bmatrix} \\ \Downarrow \\ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{e} \]
The matrix form is generalizaable to multiple predictors
\[ \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix} = \begin{bmatrix} 1 & x_{1,1} & x_{2,1} \\ 1 & x_{1,2} & x_{2,2} \\ \vdots & \vdots \\ 1 & x_{1,N} & x_{2,N} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \end{bmatrix} + \begin{bmatrix} e_1 \\ e_2 \\ \vdots \\ e_N \end{bmatrix} \\ \Downarrow \\ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{e} \]
In general, we have something like
\[ DATA = MODEL + ERRORS \]
In general, we have something like
\[ DATA = MODEL + ERRORS \]
Ideally we have something like
\[ DATA \approx MODEL \]
and hence
\[ ERRORS \approx 0 \]
From this it follows that
\[ \text{Var}(DATA) = \text{Var}(MODEL) + \text{Var}(ERRORS) \]
From this it follows that
\[ \text{Var}(DATA) = \text{Var}(MODEL) + \text{Var}(ERRORS) \]
Our hope is that
\[ \text{Var}(DATA) \approx \text{Var}(MODEL) \]
and hence
\[ \text{Var}(ERRORS) \approx 0 \]
Let’s say our model for the data is
\[ y_i = f(\text{predictors}_i) + e_i \]
and our estimate of \(y_i\) is given by
\[ \hat{y}_i = f(\text{predictors}_i) \]
Let’s say our model for the data is
\[ y_i = f(\text{predictors}_i) + e_i \]
and our estimate of \(y_i\) is given by
\[ \hat{y}_i = f(\text{predictors}_i) \]
The errors (residuals) are then given by
\[ y_i = \hat{y}_i + e_i \\ \Downarrow \\ e_i = y_i - \hat{y}_i \\ \]
Given our expression for the errors (residuals)
\[ e_i = y_i - \hat{y}_i \\ \]
we want to minimize each of the \(e_i\)
Specifically, we want to minimize the sum of their squares
\[ \min \sum_{i = 1}^{N}e_i^2 ~ \Rightarrow ~ \min \sum_{i = 1}^{N}(y_i - \hat{y}_i)^2 \]
For our linear regression model, we have
\[ \min \sum_{i = 1}^{N}(y_i - \hat{y}_i)^2 \\ \Downarrow \\ \min \sum_{i = 1}^{N} \left(y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i) \right)^2 \]
Recall that matrix multiplication works in a row-by-column manner
\[ \begin{bmatrix} a & b \\ c & d \end{bmatrix} \begin{bmatrix} X \\ Y \end{bmatrix} = \begin{bmatrix} aX + bY \\ cX + dY \end{bmatrix} \]
If \(\mathbf{v}\) is an \([n \times 1]\) column vector & \(\mathbf{v}^{\top}\) is its \([1 \times n]\) transpose,
multiplying \(\mathbf{v}^{\top} \mathbf{v}\) gives the sum of the squared values in \(\mathbf{v}\)
\[ \begin{aligned} \begin{bmatrix} a & b & c \end{bmatrix} \begin{bmatrix} a \\ b \\ c \end{bmatrix} &= \begin{bmatrix} a^2 + b^2 + c^2 \end{bmatrix} \\ ~ \\ [1 \times n] [n \times 1] &= [1 \times 1] \end{aligned} \]
Writing our linear regression model in matrix form, we have
\[ \mathbf{y} = \mathbf{X} \hat{\boldsymbol{\beta}} + \mathbf{e} \\ \Downarrow \\ \mathbf{e} = \mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}} \]
so the sum of squared errors (SSE) is
\[ \mathbf{e}^{\top} \mathbf{e} = (\mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}})^{\top} (\mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}}) \]
For example, at what value of \(x\) does this parabola reach its minimum?
\[ y = 2 x^2 - 3 x + 1 \]
For example, at what value of \(x\) does this parabola reach its minimum?
\[ y = 2 x^2 - 3 x + 1 \]
Recall from calculus that we can
For example, at what value of \(x\) does this parabola reach its minimum?
\[ y = 2 x^2 - 3 x + 1 \]
differentiate \(y\) with respect to \(x\): \(~~~ \frac{dy}{dx} = 4 x - 3\)
set the result to 0: \(~~~ 4 x - 3 = 0\)
solve for \(x\): \(~~~ x = \tfrac{3}{4}\)
We want to minimize the sum of squared errors (SSE)
\[ \begin{align} \mathbf{e}^{\top} \mathbf{e} &= (\mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}})^{\top} (\mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}}) \\ &= \mathbf{y}^{\top} \mathbf{y} - \mathbf{y}^{\top} \mathbf{X} \hat{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}^{\top} \mathbf{X}^{\top} \mathbf{y} + \hat{\boldsymbol{\beta}}^{\top} \mathbf{X}^{\top} \mathbf{X} \hat{\boldsymbol{\beta}} \end{align} \]
We want to minimize the sum of squared errors (SSE)
\[ \begin{align} \mathbf{e}^{\top} \mathbf{e} &= (\mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}})^{\top} (\mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}}) \\ &= \mathbf{y}^{\top} \mathbf{y} - \mathbf{y}^{\top} \mathbf{X} \hat{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}^{\top} \mathbf{X}^{\top} \mathbf{y} + \hat{\boldsymbol{\beta}}^{\top} \mathbf{X}^{\top} \mathbf{X} \hat{\boldsymbol{\beta}} \end{align} \]
which implies we want the derivative of \(\mathbf{e}^{\top} \mathbf{e}\), but with respect to what?
We want to minimize the sum of squared errors (SSE)
\[ \begin{align} \mathbf{e}^{\top} \mathbf{e} &= (\mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}})^{\top} (\mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}}) \\ &= \mathbf{y}^{\top} \mathbf{y} - \mathbf{y}^{\top} \mathbf{X} \hat{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}^{\top} \mathbf{X}^{\top} \mathbf{y} + \hat{\boldsymbol{\beta}}^{\top} \mathbf{X}^{\top} \mathbf{X} \hat{\boldsymbol{\beta}} \end{align} \]
which implies we want the derivative of \(\mathbf{e}^{\top} \mathbf{e}\), but with respect to what?
The data (\(\mathbf{y}, \mathbf{x}\)) are fixed but we can change \(\boldsymbol{\beta}\)
We want to minimize the sum of squared errors (SSE)
\[ \begin{align} \mathbf{e}^{\top} \mathbf{e} &= (\mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}})^{\top} (\mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}}) \\ &= \mathbf{y}^{\top} \mathbf{y} - \mathbf{y}^{\top} \mathbf{X} \hat{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}^{\top} \mathbf{X}^{\top} \mathbf{y} + \hat{\boldsymbol{\beta}}^{\top} \mathbf{X}^{\top} \mathbf{X} \hat{\boldsymbol{\beta}} \end{align} \]
We want the partial derivative of \(\mathbf{e}^{\top} \mathbf{e}\) with respect to \(\hat{\boldsymbol{\beta}}\)
\[ \frac{\partial}{\partial \hat{\boldsymbol{\beta}}} \left( \mathbf{y}^{\top} \mathbf{y} - \mathbf{y}^{\top} \mathbf{X} \hat{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}^{\top} \mathbf{X}^{\top} \mathbf{y} + \hat{\boldsymbol{\beta}}^{\top} \mathbf{X}^{\top} \mathbf{X} \hat{\boldsymbol{\beta}} \right) \]
\(1.\) differentiate \(\mathbf{e}^{\top} \mathbf{e}\) with respect to \(\hat{\boldsymbol{\beta}}\) (via several steps)
\[\frac{\partial (\mathbf{e}^{\top} \mathbf{e})}{\partial \hat{\boldsymbol{\beta}}} = -2 \mathbf{X}^{\top} \mathbf{y} + 2 \mathbf{X}^{\top} \mathbf{X} \hat{\boldsymbol{\beta}}\]
\(2.\) set the result to 0
\[-2 \mathbf{X}^{\top} \mathbf{y} + 2 \mathbf{X}^{\top} \mathbf{X} \hat{\boldsymbol{\beta}} = 0\]
\(3.\) solve for \(x\)
\[\hat{\boldsymbol{\beta}} = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y}\]
Returning to our estimate of the data, we have
\[ \begin{align} \hat{\mathbf{y}} &= \mathbf{X} \hat{\boldsymbol{\beta}} \\ &= \mathbf{X} \left( (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y} \right) \\ &= \underbrace{\mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top}}_{\mathbf{H}} \mathbf{y} \\ &= \mathbf{H} \mathbf{y} \end{align} \]
\(\mathbf{H}\) is called the “hat matrix” because it maps \(\mathbf{y}\) onto \(\hat{\mathbf{y}}\) (“y-hat”)
Consider for a moment what it means if
\[ \hat{\mathbf{y}} = \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y} \]
Consider for a moment what it means if
\[ \hat{\mathbf{y}} = \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y} \]
We can estimate the data without any model parameters!
\(^*\)parameters are not multiplied or divided by other parameters, nor do they appear as an exponent
How do we know if our errors are IID?
We will discuss this more in later lectures
What can we say about \(\hat{\boldsymbol{\beta}}\) when estimated this way?
It’s the maximum likelihood estimate (MLE)
It’s the best linear unbiased estimate (BLUE)
NOTE: these properties only hold if the errors (\(e_i\)) are independent and identically distributed (IID)
Recall the solution for \(\hat{\boldsymbol{\beta}}\) where
\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y} \]
If the quantity \(\mathbf{X}^{\top} \mathbf{X}\) is not invertible, then \(\hat{\boldsymbol{\beta}}\) is partially unidentifiable.
This occurs when the columns of \(\mathbf{X}\) are not independent
(ie, \(\mathbf{X}\) is not of “full rank”)