- Understand the concept and practice of partitioning sums-of-squares
- Understand the uses of R2 and adjusted-R2 for linear models
- Understand the use of F-tests for hypothesis testing
- Understand how to estimate confidence intervals
10 April 2020
In general, we have something like
\[ DATA = MODEL + ERRORS \]
and hence
\[ \text{Var}(DATA) = \text{Var}(MODEL) + \text{Var}(ERRORS) \]
The total deviations in the data equal the sum of those for the model and errors
\[ \underbrace{y_i - \bar{y}}_{\text{Total}} = \underbrace{\hat{y}_i - \bar{y}}_{\text{Model}} + \underbrace{y_i - \hat{y}_i}_{\text{Error}} \]
Here is a plot of some data \(y\) and a predictor \(x\)
And let’s consider this model: \(y_i = \alpha + \beta x_i + e_i\)
The total sum-of-squares \((SSTO)\) measures the total variation in the data as the differences between the data and their mean
\[ SSTO = \sum \left( y_i - \bar{y} \right)^2 \]
The model (regression) sum-of-squares \((SSR)\) measures the variation between the model fits and the mean of the data
\[ SSR = \sum \left( \hat{y}_i - \bar{y} \right)^2 \]
The error sum-of-squares \((SSE)\) measures the variation between the data and the model fits
\[ SSE = \sum \left( y_i - \hat{y}_i \right)^2 \]
The sums-of-squares have the same additive property as the deviations
\[ \underbrace{\sum (y_i - \bar{y})^2}_{SSTO} = \underbrace{\sum (\hat{y}_i - \bar{y})^2}_{SSR} + \underbrace{\sum (y_i - \hat{y}_i)^2}_{SSE} \]
How about a measure of how well a model fits the data?
A common option is the coefficient of determination or \((R^2)\)
\[ R^2 = \frac{SSR}{SSTO} = 1 - \frac{SSE}{SSTO} \\ ~ \\ 0 < R^2 < 1 \]
The number of independent elements that are free to vary when estimating quantities of interest
Beginning with \(SSTO\), we have
\[ SSTO = \sum \left( y_i - \bar{y} \right)^2 \]
The data are unconstrained and lie in an \(n\)-dimensional space, but estimating the mean \((\bar{y})\) from the data costs 1 degree of freedom \((df)\), so
\[ df_{SSTO} = n - 1 \]
For the \(SSR\) we have
\[ SSR = \sum \left( \hat{y}_i - \bar{y} \right)^2 \]
We estimate the data \((\hat{y})\) with a \(k\)-dimensional model, but we lose 1 \(df\) when estimating the mean, so
\[ df_{SSR} = k - 1 \]
The \(SSE\) is analogous
\[ SSE = \sum \left( y_i - \hat{y}_i \right)^2 \]
The data lie in an \(n\)-dimensional space and we represent them in a \(k\)-dimensional subspace, so
\[ df_{SSE} = n - k \]
The expectation of the sum-of-squares or “mean square” gives an indication of the variance for the model and errors
A mean square is a sum-of-squares divided by its degrees of freedom
\[ MS = \frac{SS}{df} \\ \Downarrow \\ MSR = \frac{SSR}{k - 1} ~~~ \& ~~~ MSE = \frac{SSE}{n - k} \]
We are typically interested in two variance estimates:
The variance of the residuals \(\mathbf{e}\)
The variance of the model parameters \(\mathbf{B}\)
In a least squares context, we assume that the model errors (residuals) are independent and identically distributed with mean 0 and variance \(\sigma^2\)
The problem is that we don’t know \(\sigma^2\) and therefore we must estimate it
If \(z_i \sim \text{N}(0, 1)\) then
\[ \sum_{i = 1}^{n} z_i^2 = \mathbf{z}^{\top}\mathbf{z} \sim \chi^2_{n} \]
If \(z_i \sim \text{N}(0, 1)\) then
\[ \sum_{i = 1}^{n} z_i^2 = \mathbf{z}^{\top}\mathbf{z} \sim \chi^2_{n} \]
In our linear model, \(e_i \sim \text{N}(0, \sigma^2)\) so
\[ \sum_{i = 1}^{n} e_i^2 = \mathbf{e}^{\top}\mathbf{e} \sim \sigma^2 \cdot \chi^2_{n - k} \]
Thus, given
\[ \mathbf{e}^{\top}\mathbf{e} \sim \sigma^2 \cdot \chi^2_{n - k} \\ \text{E}(\chi^2_{n - k}) = n - k \\ \text{E}(\mathbf{e}^{\top}\mathbf{e}) = SSE \]
then
\[ SSE = \sigma^2 (n - k) ~ \Rightarrow ~ \sigma^2 = \frac{SSE}{n - k} = MSE \]
Recall that our estimate of the model parameters is
\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y} \]
Estimating the variance of the model parameters \(\boldsymbol{\beta}\) requires some linear algebra
For a scalar \(z\), if \(\text{Var}(z) = \sigma^2\) then \(\text{Var}(az) = a^2 \sigma^2\)
For a vector \(\mathbf{z}\), if \(\text{Var}(\mathbf{z}) = \mathbf{\Sigma}\) then \(\text{Var}(\mathbf{A z}) = \mathbf{A} \mathbf{\Sigma} \mathbf{A}^{\top}\)
The variance of the parameters is therefore
\[ \begin{aligned} \hat{\boldsymbol{\beta}} &= (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y} \\ &= \left[ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \right] \mathbf{y} \\ \end{aligned} \\ \Downarrow \\ \text{Var}(\hat{\boldsymbol{\beta}}) = \left[ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \right] \text{Var}(\mathbf{y}) \left[ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \right]^{\top} \]
Recall that we can write our model in matrix form as
\[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{e} \\ \mathbf{e} \sim \text{MVN}(\mathbf{0}, \sigma^2 \mathbf{I}) \]
We can rewrite our model more compactly as
\[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{e} \\ \mathbf{e} \sim \text{MVN}(\mathbf{0}, \sigma^2 \mathbf{I}) \\ \Downarrow \\ \mathbf{y} \sim \text{MVN}(\mathbf{X} \boldsymbol{\beta}, \underbrace{\sigma^2 \mathbf{I}}_{\text{Var}(\mathbf{y} | \mathbf{X} \boldsymbol{\beta})}) \\ \]
Our estimate of \(\text{Var}(\hat{\boldsymbol{\beta}})\) is then
\[ \begin{aligned} \text{Var}(\hat{\boldsymbol{\beta}}) &= \left[ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \right] \text{Var}(\mathbf{y}) \left[ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \right]^{\top} \\ &= \left[ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \right] \sigma^2 \mathbf{I} \left[ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \right]^{\top} \\ &= \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1} (\mathbf{X}^{\top} \mathbf{X}) \left[ (\mathbf{X}^{\top} \mathbf{X})^{-1} \right]^{\top} \\ &= \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1} \end{aligned} \]
Let’s think about the variance of \(\hat{\boldsymbol{\beta}}\)
\[ \text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1} \]
This suggests that our confidence in our estimate increases with the spread in \(\mathbf{X}\)
Consider these two scenarios where the slope of the relationship is identical
Once we’ve estimated the model parameters and their variance, we might want to draw conclusions from our analysis
Imagine we had 2 linear models of varying complexity:
a model with one predictor
a model with five predictors
It would seem logical to ask whether the complexity of (2) is necessary?
Recall our partitioning of sums-of-squares, where
\[ SSTO = SSR + SSE \]
We might prefer the more complex model (call it \(\Theta\)) over the simple model (call it \(\theta\)) if
\[ SSE_{\Theta} < SSE_{\theta} \]
or, more formally, if
\[ \frac{SSE_{\theta} - SSE_{\Theta}}{SSE_{\Theta}} > \text{a constant} \]
If \(\Theta\) has \(k_{\Theta}\) parameters and \(\theta\) has \(k_{\theta}\), we can scale this ratio to arrive at an \(F\)-statistic that follows an \(F\) distribution
\[ F = \frac{ \left( SSE_{\theta} - SSE_{\Theta} \right) / (k_{\Theta} - k_{\theta})}{ SSE_{\Theta} / (n - k_{\Theta})} \sim F_{k_{\Theta} - k_{\theta}, n - k_{\Theta}} \]
The \(F\)-distribution is the ratio of two random variates, each with a \(\chi^2_{n}\) distribution
If \(A \sim \chi_{df_{A}}^{2}\) and \(B \sim \chi_{df_{B}}^{2}\) are independent, then
\[ \frac{\left( \frac{A}{df_{A}} \right) }{ \left( \frac{B}{df_{B}} \right) } \sim F _{df_{A},df_{B}} \]
Suppose we wanted to test whether the collection of predictors in a model were better than simply estimating the data by their mean.
\[ \Theta: \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{e} \\ \theta: \mathbf{y} = \boldsymbol{\mu} + \mathbf{e} \\ \]
We write the null hypothesis as
\[ H_0: \beta_1 = \beta_2 = \dots = \beta_k = 0 \]
and we would reject \(H_0\) if \(F > F^{(\alpha)}_{k_{\Theta} - k_{\theta}, n - k_{\Theta}}\)
\[ SSE_{\Theta} = \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \right)^{\top} \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \right) = \mathbf{e}^{\top} \mathbf{e} = SSE \\ SSE_{\theta} = \left( \mathbf{y} - \bar{y} \right)^{\top} \left( \mathbf{y} - \bar{y} \right) = SSTO \\ \Downarrow \\ F = \frac{ \left( SSTO - SSE \right) / (k - 1) } { SSE / (n - k)} \]
Later in lab we will work with the gala
dataset\(\dagger\) in the faraway package, which contains data on the diversity of plant species across 30 Galapagos islands
For now let’s hypothesize that
\(\dagger\)From Johnson & Raven (1973) Science 179:893-895
We might ask whether any one predictor could be dropped from a model
For example, can \(\text{nearest}\) be dropped from ourf full model?
\[ \text{species}_i = \alpha + \beta_1 \text{area}_i + \beta_2 \text{elevation}_i + \beta_3 \text{nearest}_i + \epsilon_i \]
One option is to fit these two models and compare them via our \(F\)-test with \(H_0: \beta_3 = 0\)
\[ \begin{aligned} \text{species}_i &= \alpha + \beta_1 \text{area}_i + \beta_2 \text{elevation}_i + \beta_3 \text{nearest}_i + \epsilon_i \\ ~ \\ \text{species}_i &= \alpha + \beta_1 \text{area}_i + \beta_2 \text{elevation}_i + \epsilon_i \end{aligned} \]
Another option is to estimate a \(t\)-statistic as
\[ t_i = \frac{\hat{\beta}_i}{\text{SE} \left( \hat{\beta}_i \right)} \]
and compare it to a \(t\)-distribution with \(n - k\) degrees of freedom
Sometimes we might want to know whether we can drop 2+ predictors from a model
For example, can we drop both \(\text{elevation}\) and \(\text{nearest}\) from our full model?
\[ \begin{aligned} \text{species}_i &= \alpha + \beta_1 \text{area}_i + \beta_2 \text{elevation}_i + \beta_3 \text{nearest}_i + \epsilon_i \\ ~ \\ \text{species}_i &= \alpha + \beta_1 \text{area}_i + \epsilon_i \end{aligned} \]
\(H_0 : \beta_2 = \beta_3 = 0\)
Some tests cannot be expressed in terms of the inclusion or exclusion of predictors
Consider a test of whether the areas of the current and adjacent island could be added together and used in place of the two separate predictors
\[ \text{species}_i = \alpha + \beta_1 \text{area}_i + \beta_2 \text{adjacent}_i + \dots + \epsilon_i \\ ~ \\ \text{species}_i = \alpha + \beta_1 \text{(area + adjacent)}_i + \dots + \epsilon_i \]
\(H_0 : \beta_{\text{area}} = \beta_{\text{adjacent}}\)
What if we wanted to test whether a predictor had a specific (non-zero) value?
For example, is there a 1:1 relationship between \(\text{species}\) and \(\text{elevation}\) after controlling for the other predictors?
\[ \text{species}_i = \alpha + \beta_1 \text{area}_i + \underline{1} \text{elevation}_i + \beta_3 \text{nearest}_i + \epsilon_i \]
\(H_0 : \beta_2 = 1\)
We can also modify our \(t\)-test from before and use it for our comparison by including the hypothesized \(\beta_{H_0}\) as an offset
\[ t_i = \frac{(\hat{\beta_i} - \beta_{H_0})}{\text{SE} \left( \hat{\beta}_i \right)} \]
Null hypothesis testing (NHT) is a slippery slope
We can also use confidence intervals (CI’s) to express uncertainty in \(\hat{\beta}_i\)
They take the form
\[ 100(1 - \alpha)\% ~ \text{CI}: \hat{\beta}_{i} \pm t_{n-p}^{(\alpha / 2)} \operatorname{SE}(\hat{\beta}) \]
where here \(\alpha\) is our predetermined Type-I error rate
The \(F\)- and \(t\)-based CI’s we have described depend on the assumption of normality
The bootstrap\(\dagger\) method provides a way to construct CI’s without this assumption
\(\dagger\)Efron (1979) The Annals of Statistics 7:1–26
Fit your model to the data
Calculate \(\mathbf{e} = \mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}}\)
Select the \(\tfrac{\alpha}{2}\) and \((1 - \tfrac{\alpha}{2})\) percentiles from the saved \(\hat{\boldsymbol{\beta}}^*\)
Given a fitted model \(\mathbf{y} = \mathbf{X} \hat{\boldsymbol{\beta}} + \mathbf{e}\), we might want to know the uncertainty around a new estimate \(\mathbf{y}^*\) given some new predictor \(\mathbf{X}^*\)
Suppose we wanted to estimate the uncertainty in the average response given by
\[ \hat{\mathbf{y}}^* = \mathbf{X}^* \hat{\boldsymbol{\beta}} \]
Recall that the general formula for a CI on a quantity \(z\) is
\[ 100(1 - \alpha)\% ~ \text{CI}: \text{E}(z) ~ \pm ~ t^{(\alpha / 2)}_{df}\text{SD}(z) \]
So we would have
\[ \hat{\mathbf{y}}^* ~ \pm ~ t^{(\alpha / 2)}_{df} \sqrt{\text{Var} \left( \hat{\mathbf{y}}^* \right)} \]
We can calculate the SD of our expectation as
\[ \begin{aligned} \text{Var} \left( \hat{\mathbf{y}}^* \right) &= \text{Var} \left( \mathbf{X}^* \hat{\boldsymbol{\beta}} \right) \\ &= {\mathbf{X}^*}^{\top} \text{Var}\left( \hat{\boldsymbol{\beta}} \right) \mathbf{X}^* \\ &= {\mathbf{X}^*}^{\top} \left[ \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1} \right] \mathbf{X}^* \\ &\Downarrow \\ \text{SD} \left( \hat{\mathbf{y}}^* \right) &= \sigma \sqrt{ {\mathbf{X}^*}^{\top} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^* } \end{aligned} \]
So our CI on the mean response is given by
\[ \hat{\mathbf{y}}^* \pm ~ t^{(\alpha / 2)}_{df} \sigma \sqrt{ {\mathbf{X}^*}^{\top} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^* } \]
What about the uncertainty in a specific prediction?
In that case we need to account for our additional uncertainty owing to the error in our relationship, which is given by
\[ \hat{\mathbf{y}}^* = \mathbf{X}^* \hat{\boldsymbol{\beta}} + \mathbf{e} \]
The SD of the new prediction is given by
\[ \begin{aligned} \text{Var} \left( \hat{\mathbf{y}}^* \right) &= {\mathbf{X}^*}^{\top} \text{Var}\left( \hat{\boldsymbol{\beta}} \right) \mathbf{X}^* + \text{Var} \left( \mathbf{e} \right) \\ &= {\mathbf{X}^*}^{\top} \left[ \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1} \right] \mathbf{X}^* + \sigma^2\\ &= \sigma^2 \left( {\mathbf{X}^*}^{\top} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^* + 1 \right) \\ &\Downarrow \\ \text{SD} \left( \hat{\mathbf{y}}^* \right) &= \sigma \sqrt{1 + {\mathbf{X}^*}^{\top} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^* } \end{aligned} \]
So our CI on the new prediction is given by
\[ \hat{\mathbf{y}}^* \pm ~ t^{(\alpha / 2)}_{df} \sigma \sqrt{1 + {\mathbf{X}^*}^{\top} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^* } \]
This is typically referred to as the prediction interval