- Recognize that diagnostic checks are necessary for any model
- Learn how to check for constant variance, normally distributed errors, and autocorrelation
- Learn how to check for outlying or influential observations
13 April 2020
We have seen how to fit models, estimate parameters with uncertainty, and conduct hypothesis tests
All of these rely on a number of assumptions about
Our focus here is on linear models, and we saw previously that we can use linear models to approximate nonlinear functions
The specific form of the model should reflect our understanding of the system and any particular hypotheses we’d like to test
So far our models have assumed the errors to be independent and identically distributed (IID)
What exactly does this mean?
Let’s begin with the notion of “identically distributed”, which suggests no change in the variance across the model space
For example, if our errors are assumed to be normally distributed, such that
\[ \epsilon_i \sim \text{N}(0, \sigma^2) ~ \Rightarrow ~ \boldsymbol{\epsilon} \sim \text{MVN}(\mathbf{0}, \sigma^2 \mathbf{I}) \]
then we expect no difference in \(\sigma^2\) among any of the \(\epsilon_i\).
To check this assumption, we can plot our estimates \(\hat{\epsilon}_i = e_i = y- \hat{y}\) against our fitted values \(\hat{y}_i\) and look for any patterns
For a finer resolution, we can also plot \(\sqrt{| \hat{\epsilon}_i |}\) against our fitted values \(\hat{y}_i\) and look for any patterns
The distribution of \(| \hat{\epsilon}_i |\) is a skewed half-normal on the positive interval; the square-root transformation makes them less skewed
We can formally test the assumption of homogeneous variance via Levene’s Test, which compares the absolute values of the residuals among \(j\) groups of data
\[ Z_{ij} = \left| y_{ij} - \hat{y}_j \right| \]
Levene’s test is a one-way ANOVA of the residuals
The statistic for Levene’s Test is
\[ W=\frac{(n-k)}{(k-1)} \cdot \frac{\sum_{j=1}^{k} n_{j}\left(Z_{j} - \bar{Z} \right)^{2}}{\sum_{j=1}^{k} \sum_{i=1}^{n_{j}} \left( Z_{i j} - \bar{Z_{i}} \right)^{2}} \]
The test statistic \(W\) is approximately \(F\)-distributed with \(k-1\) and \(N-k\) degrees of freedom
Levene’s Test is easy to compute in R
## split residuals (ee) into 2 groups g1 <- ee[ee < median(ee)] g2 <- ee[ee > median(ee)] ## Levene's Test var.test(g1, g2)
## ## F test to compare two variances ## ## data: g1 and g2 ## F = 0.90486, num df = 14, denom df = 14, p-value = 0.8543 ## alternative hypothesis: true ratio of variances is not equal to 1 ## 95 percent confidence interval: ## 0.3037877 2.6951999 ## sample estimates: ## ratio of variances ## 0.9048584
What can we do if we find evidence of heteroscedasticity?
Try a transformation or weighted least squares, which we will see later this week
We can also plot the residuals against any potential predictors that were not included in the model
If we see a (linear) pattern, then consider including that predictor in a new model
We seek a method for assessing whether our residuals are indeed normally distributed
The easiest way is via a so-called \(Q\)-\(Q\) plot (for quantile-quantile)
qqnorm(x)
in ROne component of IID errors is “independent”
This means we expect no correlation among any of the errors
We might expect to find correlated errors when working with
Temporal data
Spatial data
Blocked data
Consider a model for tree growth as a function of temperature
Closer examination of the residuals reveals a problem
We can estimate the autocorrelation function in R with acf()
It is often the case that one or more data points do not fit our model well
We refer to these as outliers
Some outliers affect the fit of the model
We refer to these as influential observations
Leverage points are extreme in the predictor \((X)\) space
They may or may not affect model fit
Remember the “hat matrix” \((\mathbf{H})\)?
The values along the diagonal \(h_i = \mathbf{H}_{ii}\) are the leverages
Also recall that
\[ \text{Var}(\hat{\epsilon}_i) = \sigma^2 (1 - h_i) \]
Large \(h_i\) lead to small variances of \(\epsilon_i\) & hence \(\hat{y}_i\) tends to \(y_i\)
\(\mathbf{H}\) has dimensions \(n \times n\) and \(\text{trace}(\mathbf{H}) = \sum_{i = 1}^n h_i = k\)
Thus, on average we should expect that \(\bar{h}_i = \frac{k}{n}\)
Any \(h_i > 2 \frac{k}{n}\) deserve closer inspection
We can easily compute the \(h_i\) in R via the function hatvalues()
## leverages of points in middle plot on slide 30 hv <- hatvalues(m2) ## threshold value for h_i ~= 0.36 th <- 2 * (2 / length(hv)) ## are any h_i > Eh? hv > th
## 1 2 3 4 5 6 7 8 9 10 11 ## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
We can also identify high leverage via a half-normal plot (R)
We can use the leverages to scale the residuals so their variance is 1
\[ r_i = \frac{\hat{\epsilon}_i}{\hat{\sigma} \sqrt{1 - h_i} } \]
Doing so allows for easy examination via \(Q\)-\(Q\) plots as values should lie on the 1:1 line
Standardized residuals from the high leverage example
One way to detect outliers is to estimate \(n\) different models where we exclude one data point from each model
More formally we have
\[ \hat{\mathbf{y}}_{(i)} = \mathbf{X}_{(i)} \hat{\boldsymbol{\beta}}_{(i)} \]
where \((i)\) indicates that the \(i\)th datum has been omitted
If \(y_{i} - \hat{y}_{(i)}\) is large, then observation \(i\) is an outlier
To evaluate the size of particular outlier we need to scale the residuals
This is similar to scaling a parameter estimate by its standard deviation to test model hypotheses, with
\[ t_i = \frac{\beta_i}{\text{SE} \left( \beta_i \right)} \]
and we compare it to a \(t\)-distribution with \(n - k\) degrees of freedom
It turns out that the variance of the difference \(y_{i} - \hat{y}_{(i)}\) is just like that for a prediction interval
\[ \widehat{\text{Var}} \left( y_{i}-\hat{y}_{(i)} \right) = \hat{\sigma}_{(i)}^{2} \left( 1 + \mathbf{X}_{i}^{\top} \left( \mathbf{X}_{(i)}^{\top} \mathbf{X}_{(i)} \right)^{-1} \mathbf{X}_{i} \right) \]
We can now compute the “studentized” (scaled) residuals as
\[ t_i = \frac{y_{i} - \hat{y}_{(i)}}{ \hat{\sigma}_{(i)} \sqrt{ 1 + \mathbf{X}_{i}^{\top} \left( \mathbf{X}_{(i)}^{\top} \mathbf{X}_{(i)} \right)^{-1} \mathbf{X}_{i} } } \]
which are distributed as a \(t\) distribution with \(n - k - 1\) df
There is an easer way to do this without fitting \(n\) different models, where
\[ t_i = \frac{y_{i} - \hat{y}_{(i)}}{ \hat{\sigma}_{(i)} \sqrt{ 1 - h_i } } = e_i \sqrt{ \frac{n - k - 1}{n - k - e_i^2} } \]
and \(r_i\) is the residual for the \(i\)th case based on a model that includes all of the data
Some points to consider
What can be done about outliers?
Influential observations might not be outliers nor have high leverage, but we want to identify them
Cook’s Distance \((D)\) is a popular choice, where
\[ D_i = e_i^2 \frac{1}{k} \left( \frac{h_i}{1 - h_i} \right) \]
\(D_i\) scales the errors by their \(df\) and leverage
We can evaulate Cook’s \(D\) with a half-normal plot
When fitting linear models via least squares we make several assumptions about our model
The importance of our assumptions can be ranked as
If we get this wrong, explanations & predictions will be off
The importance of our assumptions can be ranked as
Systematic form of the model
Independence of errors
Dependence (correlation) among errors means there is less info in the data than the sample size suggests
The importance of our assumptions can be ranked as
Systematic form of the model
Independence of errors
Non-constant variance
This may affect inference and confidence/prediction intervals
The importance of our assumptions can be ranked as
Systematic form of the model
Independence of errors
Non-constant variance
Normality of errors
This is less of a concern as sample size increases