- Recognize that diagnostic checks are necessary for any model
- Learn how to check for constant variance, normally distributed errors, and autocorrelation
- Learn how to check for outlying or influential observations
13 April 2026
Our focus here is on linear models, and we saw previously that we can use linear models to approximate nonlinear functions
The specific form of the model should reflect our understanding of the system and any particular hypotheses we’d like to test
We have seen how to fit models, estimate parameters with uncertainty, and conduct hypothesis tests
All of these rely on a number of assumptions about
So far our models have assumed the errors to be independent and identically distributed (IID)
What exactly does this mean?
Let’s begin with the notion of “identically distributed”, which suggests no change in the variance across the model space
For example, if our errors are assumed to be normally distributed, such that
\[ e_i \sim \text{N}(0, \sigma^2) ~ \Rightarrow ~ \boldsymbol{e} \sim \text{MVN}(\mathbf{0}, \sigma^2 \mathbf{I}) \]
then we expect no difference in \(\sigma^2\) among any of the \(e_i\)
To check this assumption, we can plot our estimates \(e_i = e_i = y- \hat{y}\) against our fitted values \(\hat{y}_i\) and look for any patterns
For a finer resolution, we can also plot \(\sqrt{| e_i |}\) against our fitted values \(\hat{y}_i\) and look for any patterns
The distribution of \(| e_i |\) is a skewed half-normal on the positive interval; the square-root transformation makes them less skewed
We can formally test the assumption of homogeneous variance via Levene’s Test, which compares the absolute values of the residuals among \(j\) groups of data
\[ Z_{ij} = \left| y_{ij} - \hat{y}_j \right| \]
Levene’s test is a one-way ANOVA of the residuals with
\[ H_0: ~ \sigma^2_1 = \sigma^2_2 = \dots = \sigma^2_k \]
The statistic for Levene’s Test is
\[ W=\frac{(n-k)}{(k-1)} \cdot \frac{\sum_{j=1}^{k} n_{j}\left(Z_{j} - \bar{Z} \right)^{2}}{\sum_{j=1}^{k} \sum_{i=1}^{n_{j}} \left( Z_{i j} - \bar{Z_{i}} \right)^{2}} \]
The test statistic \(W\) is approximately \(F\)-distributed with \(k-1\) and \(n-k\) degrees of freedom
Levene’s Test is easy to compute in R
## split residuals (ee) into 2 groups: less than and greater than the median g1 <- ee[ee < median(ee)] g2 <- ee[ee > median(ee)] ## Levene's Test var.test(g1, g2)
## ## F test to compare two variances ## ## data: g1 and g2 ## F = 0.90486, num df = 14, denom df = 14, p-value = 0.8543 ## alternative hypothesis: true ratio of variances is not equal to 1 ## 95 percent confidence interval: ## 0.3037877 2.6951999 ## sample estimates: ## ratio of variances ## 0.9048584
What can we do if we find evidence of heteroscedasticity?
Try a transformation or weighted least squares, which we will see later this week
We can also plot the residuals against any potential predictors that were not included in the model
If we see a (linear) pattern, then consider including that predictor in a new model
We seek a method for assessing whether our residuals are indeed normally distributed
The easiest way is via a so-called \(Q\)-\(Q\) plot (for quantile-quantile)
qqnorm(x) in ROne component of IID errors is “independent”
This means we expect no correlation among any of the errors
We might expect to find correlated errors when working with
Temporal data
Spatial data
Blocked data
Consider a model for tree growth as a function of temperature
Closer examination of the residuals reveals a problem: correlated errors
We can estimate the autocorrelation function in R with acf()
It is often the case that one or more data points do not fit our model well
We refer to these as outliers
Some outliers affect the fit of the model
We refer to these as influential observations
Leverage points are extreme in the predictor \((X)\) space
They may or may not affect model fit
We said previously that
\(\text{var}(e_i) = \sigma^2\)
We said previously that
\(\text{var}(e_i) = \sigma^2\)
In a regression context it becomes
\(\text{var}(e_i) = \sigma^2(1 - h_i)\)
where \(h_i\) is the leverage of observation (datum) \(i\)
Remember the “hat matrix” \((\mathbf{H})\)?
The values along the diagonal \(\color{blue}{h_{i,i}}\) are the leverages
\[ \mathbf{H} = \begin{bmatrix} \color{blue}{ h_{1,1}} & h_{1,2} & \dots & h_{1,n} \\ h_{2,1} & \color{blue}{h_{2,2}} & \dots & h_{2,n} \\ \vdots & \vdots & \color{blue}{\ddots} & \vdots \\ h_{n,1} & h_{n,2} & \dots & \color{blue}{h_{n,n}} \end{bmatrix} \]
with \(\color{blue}{h_{i,i}} \leq 1\); we’ll simply \(h_{i,i}\) to be \(h_i\)
So given that
\(\text{Var}(e_i) = \sigma^2 (1 - h_i)\) and \({h_i} \leq 1\)
What happens as \(h_i \rightarrow 1\)?
So given that
\(\text{Var}(e_i) = \sigma^2 (1 - h_i)\) and \({h_i} \leq 1\)
What happens as \(h_i \rightarrow 1\)?
Large \(h_i\) lead to small variances of \(e_i\) & hence \(\hat{y}_i \approx y_i\)
\(\mathbf{H}\) has dimensions \(n \times n\) and
\[ \text{trace}(\mathbf{H}) = \sum_{i = 1}^n h_i \]
Thus, on average we should expect that \(\bar{h}_i = \frac{k}{n}\)
Any \(h_i > \frac{2 k}{n}\) deserve closer inspection
We can easily compute the \(h_i\) in R via the function hatvalues()
## leverages of points in middle plot on slide 37 hat_values <- hatvalues(m2) ## threshold value for h_i ~= 0.36 threshold <- 2 * (2 / length(hat_values)) ## are any h_ii > threshold? hat_values > threshold
## 1 2 3 4 5 6 7 8 9 10 11 ## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
*see .Rmd file for R code to do this
We can use the leverages to scale the residuals so their variance is 1
\[ r_i = \frac{e_i}{\hat{\sigma} \sqrt{1 - h_i} } \]
Doing so allows for easy examination via \(Q\)-\(Q\) plots because the \(r_i\) should lie on the 1:1 line
Standardized residuals from the high leverage example
One way to detect outliers is to estimate \(n\) different models where we exclude one data point from each model
One way to detect outliers is to estimate \(n\) different models where we exclude one data point from each model
More formally we have
\[ \hat{\mathbf{y}}_{(i)} = \mathbf{X}_{(i)} \hat{\boldsymbol{\beta}}_{(i)} \]
where \((i)\) indicates that the \(i\)th datum has been omitted
If \(y_{i} - \hat{y}_{(i)}\) is large, then observation \(i\) is an outlier
To evaluate the size of particular outlier we need to scale the residuals
This is similar to scaling a parameter estimate by its standard deviation to test model hypotheses, with
\[ t_i = \frac{\beta_i}{\text{SE} \left( \beta_i \right)} \]
and we compare it to a \(t\)-distribution with \(n - k\) degrees of freedom
It turns out that the variance of the difference \(y_{i} - \hat{y}_{(i)}\) is just like that for a prediction interval
\[ \widehat{\text{Var}} \left( y_{i}-\hat{y}_{(i)} \right) = \hat{\sigma}_{(i)}^{2} \left( 1 + \mathbf{X}_{i}^{\top} \left( \mathbf{X}_{(i)}^{\top} \mathbf{X}_{(i)} \right)^{-1} \mathbf{X}_{i} \right) \]
We can now compute the “studentized” (scaled) residuals as
\[ t_i = \frac{y_{i} - \hat{y}_{(i)}}{ \hat{\sigma}_{(i)} \sqrt{ 1 + \mathbf{X}_{i}^{\top} \left( \mathbf{X}_{(i)}^{\top} \mathbf{X}_{(i)} \right)^{-1} \mathbf{X}_{i} } } \]
which are distributed as a \(t\) distribution with \(n - k - 1\) df
There is an easier way to do this without fitting \(n\) different models
Instead, we can define
\[ t_i = \frac{y_{i} - \hat{y}_{(i)}}{ \hat{\sigma}_{(i)} \sqrt{ 1 - h_i } } = e_i \sqrt{ \frac{n - k - 1}{n - k - e_i^2} } \]
where \(e_i\) is a residual based on a model that includes all of the data
Some points to consider
What can be done about outliers?
Influential observations might not be outliers nor have high leverage, but we want to identify them
Cook’s Distance \((D)\) is a popular choice, where
\[ D_i = e_i^2 \frac{1}{k} \left( \frac{h_i}{1 - h_i} \right) \]
\(D_i\) scales the errors by their \(df\) and leverage
We can evaluate Cook’s \(D\) with a half-normal plot
When fitting linear models via least squares we make several assumptions about our model
The importance of our assumptions can be ranked as
If we get this wrong, explanations & predictions will be off
The importance of our assumptions can be ranked as
Systematic form of the model
Independence of errors
Dependence (correlation) among errors means there is less info in the data than the sample size suggests
The importance of our assumptions can be ranked as
Systematic form of the model
Independence of errors
Non-constant variance
This may affect inference and confidence/prediction intervals
The importance of our assumptions can be ranked as
Systematic form of the model
Independence of errors
Non-constant variance
Normality of errors
This is less of a concern as sample size increases