The models we have been discussing in class assume that the errors are independent and identically distributed (IID). In this lab we will explore some options for identifying problems with our errors, and taking corrective actions after we identify them.
Let’s begin with the notion of “identically distributed”, which suggests no change in the variance across the model space. For example, if our errors are assumed to be normally distributed, such that
\[ \epsilon_i \sim \text{N}(0, \sigma^2) ~ \Rightarrow ~ \boldsymbol{\epsilon} \sim \text{MVN}(\mathbf{0}, \sigma^2 \mathbf{I}) \]
then we expect no difference in \(\sigma^2\) among any of the \(\epsilon_i\).
To check this assumption, we can plot our estimates of the residuals \(e_i = y- \hat{y}\) against our fitted values \(\hat{y}_i\) and look for any patterns. Here are three examples of simple linear regression models fit to to three simulated data sets. The first is linear, but the other two exhibit some classical problems.
set.seed(514)
## sample size
nn <- 40
## set x & e
xx <- runif(nn, 0, 10)
ee <- rnorm(nn)
## linear model
y1 <- 1 + 0.5 * xx + ee
m1 <- lm(y1 ~ xx)
e1 <- resid(m1)
## log-linear model
y2 <- exp(0.1 + 0.3 * xx + ee)
m2 <- lm(y2 ~ xx)
e2 <- resid(m2)
## quadratic model
y3 <- 0.2 * (xx - 5)^2 - 1 + ee
m3 <- lm(y3 ~ xx)
e3 <- resid(m3)
Now we can plot the residuals against the fitted value and check for problems.
## set plot area
par(mfrow = c(1,3),
mai = c(0.9,0.5,0.5,0.2),
omi = c(0.5, 0.4, 0.5, 0),
cex = 0.9)
## plot errors
## well behaved errors (left)
plot(fitted(m1), e1, pch = 16, las = 1, xpd = NA,
cex.lab = 1.5, xlab = "Fitted values", ylab = "Residuals",
main = "No problem")
abline(h = 0, lty ="dashed")
## heteroscedastic errors (middle)
plot(fitted(m2), e2, pch = 16, las = 1,
cex.lab = 1.5, xlab = "Fitted values", ylab = "",
main = "Heteroscedastic")
abline(h = 0, lty ="dashed")
## nonlinear errors (right)
plot(fitted(m3), e3, pch = 16, las = 1,
cex.lab = 1.5, xlab = "Fitted values", ylab = "",
main = "Nonlinear")
abline(h = 0, lty ="dashed")
We can formally test the assumption of homogeneous variance via Levene’s Test, which compares the absolute values of the residuals in \(j\) groups of data to their group mean, such that
\[ Z_{ij} = \left| y_{ij} - \bar{y}_j \right|. \]
Levene’s test is a one-way ANOVA of this absolute difference in the residuals. The statistic for Levene’s Test is
\[ W=\frac{(n-k)}{(k-1)} \cdot \frac{\sum_{j=1}^{k} n_{j} \left( Z_{j} - \bar{Z} \right)^{2} }{ \sum_{j=1}^{k} \sum_{i=1}^{n_{j} } \left( Z_{i j} - \bar{Z_{i} } \right)^{2} } \] where
The test statistic \(W\) is approximately \(F\)-distributed with \(k-1\) and \(N-k\) degrees of freedom. Levene’s test is easy to compute in R. Here is the test for the “well behaved” residuals from model m1
above.
## split residuals (e1) into 2 groups
yh <- fitted(m1)
g1 <- e1[yh <= median(yh)]
g2 <- e1[yh > median(yh)]
## Levene's test
var.test(g1, g2)
##
## F test to compare two variances
##
## data: g1 and g2
## F = 2.1135, num df = 19, denom df = 19, p-value = 0.1115
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.8365508 5.3396652
## sample estimates:
## ratio of variances
## 2.113504
You can see there is no justification to reject \(H_0: \text{Var}(group_1) = \text{Var}(group_2)\) based on the residuals from the well behaved model.
Now let’s repeat the test for a set of residuals with clear heteroscedasticity as in the e2
set above.
## split residuals (e2) into 2 groups
g1 <- e2[yh <= median(yh)]
g2 <- e2[yh > median(yh)]
## Levene's test
var.test(g1, g2)
##
## F test to compare two variances
##
## data: g1 and g2
## F = 1.656, num df = 19, denom df = 19, p-value = 0.2804
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.6554471 4.1836891
## sample estimates:
## ratio of variances
## 1.655955
Here we would reject \(H_0\) and conclude that the variances in the two groups are not equal.
We seek a method for assessing whether our residuals are indeed normally distributed. The easiest way is via a so-called \(Q\)-\(Q\) plot (for quantile-quantile). As we saw in lecture, here are some quantiles for a “standard” normal distribution (ie., the mean is 0 and the variance is 1).
## set plot area
par(mai = c(1,1,0.1,0.1),
omi = c(0.5, 0, 0.5, 0),
cex = 1.2)
## plot Gaussian pdf
curve(dnorm, -4, 4, las = 1, bty = "n", lwd = 2,
ylab = "Density", xlab = expression(epsilon))
abline(v = qnorm(c(0.5)), lty = "dashed")
abline(v = qnorm(c(0.25, 0.75)), lty = "dashed", col = "purple")
abline(v = qnorm(c(0.1, 0.9)), lty = "dashed", col = "blue")
abline(v = qnorm(c(0.025, 0.975)), lty = "dashed", col = "red")
In contrast, here are the quantiles for a heavy-tailed (leptokurtic) distribution.
## set plot area
par(mai = c(1,1,0.1,0.1),
omi = c(0.5, 0, 0.5, 0),
cex = 1.2)
## plot Gaussian pdf
curve(dnorm, -4, 4, las = 1, bty = "n", lwd = 2, col = "gray",
ylab = "Density", xlab = expression(epsilon))
## plot Cauchy
curve(dcauchy(x, 0, 0.8), -4, 4, las = 1, bty = "n", lwd = 2, add = TRUE,
ylab = "Density", xlab = expression(epsilon))
abline(v = qcauchy(c(0.5)), lty = "dashed")
abline(v = qcauchy(c(0.25, 0.75)), lty = "dashed", col = "purple")
abline(v = qcauchy(c(0.1, 0.9)), lty = "dashed", col = "blue")
abline(v = qcauchy(c(0.025, 0.975)), lty = "dashed", col = "red")
And here they are for a short-tailed (platykurtic) distribution. Notice that I made up a pdf based on a Butterworth function.
## set plot area
par(mai = c(1,1,0.1,0.1),
omi = c(0.5, 0, 0.5, 0),
cex = 1.2)
## Butterworth fx
butter <- function(x, c = 1, n = 4) {
0.4 / (1 + (x / c)^n)
}
ii <- seq(-40,40)/10
ww <- round(butter(ii, 1, 6)*1e4, 0)
vv <- NULL
for(i in 1:length(ww)) {
tmp <- rep(ii[i], ww[i])
vv <- c(vv, tmp)
}
qb <- quantile(vv, c(2.5, 10, 25, 50, 75, 90, 97.5)/100)
## plot Gaussian pdf
curve(dnorm, -4, 4, las = 1, bty = "n", lwd = 2, col = "gray",
ylab = "Density", xlab = expression(epsilon))
## plot Butterworth
curve(butter(x, 1, 4), -4, 4, las = 1, bty = "n", lwd = 2, add = TRUE,
ylab = "Density", xlab = expression(epsilon))
abline(v = qb[4], lty = "dashed")
abline(v = qb[c(3,5)], lty = "dashed", col = "purple")
abline(v = qb[c(2,6)], lty = "dashed", col = "blue")
abline(v = qb[c(1,7)], lty = "dashed", col = "red")
Here is a comparisons of the \(Q\)-\(Q\) plots for these three examples via qqnorm(x)
.
## set plot area
par(mfrow = c(1,3),
mai = c(0.9,0.5,0.5,0.2),
omi = c(0, 0.4, 0, 0),
cex = 1)
## Q-Q plots
## normal
z1 <- rnorm(nn)
qqnorm(z1, pch =16, main = "Normal", xpd = NA)
qqline(z1)
## leptokurtic
z2 <- rcauchy(nn)
qqnorm(z2, pch =16, main = "Heavy-tailed")
qqline(z2)
## platykurtic
ii <- sample(seq(-40,40)/10, nn)
z3 <- butter(ii, nn) * ii
qqnorm(z3, pch =16, main = "Light-tailed")
qqline(z3)
It is often the case that one or more data points do not fit our model well; we refer to these as outliers. Some outliers affect the fit of the model and we refer to these as influential observations. Leverage points are extreme in the predictor \((X)\) space and may or may not affect model fit.
Here are some examples of unusual observations.
## create data
xr <- round(1:10 + rnorm(10, 0, 0.2), 1)
simdata <- data.frame(x = xr,
y = xr + rnorm(10))
mm <- lm(y ~ x, simdata)
## no leverage or influence
p1 <- c(5.5,12)
m1 <- lm(y ~ x, rbind(simdata, p1))
## high leverage but no influence
p2 <- c(17,17)
m2 <- lm(y ~ x, rbind(simdata, p2))
## high leverage and influence
p3 <- c(17,5.1)
m3 <- lm(y ~ x, rbind(simdata, p3))
## set plot area
par(mfrow = c(1,3),
mai = c(0.9,0.4,0.5,0.1),
omi = c(0.5, 0.5, 0.5, 0),
cex = 1)
## plot examples
## no leverage or influence (left)
plot(y ~ x, rbind(simdata, p1), pch = 16, las = 1, xpd = NA,
cex.lab = 1.5, xlab = expression(italic(x)), ylab = expression(italic(y)),
main = "No leverage or influence", cex.main = 1)
points(5.5, 12, pch = 16, cex = 1.5, col ="red")
abline(mm)
abline(m1, lty=2, col ="red")
## high leverage but no influence (middle)
plot(y ~ x, rbind(simdata, p2), pch = 16, las = 1,
cex.lab = 1.5, xlab = expression(italic(x)), ylab = expression(italic(y)),
main = "Leverage but no influence", cex.main = 1)
points(17, 17, pch = 16, cex = 1.5, col ="red")
abline(mm)
abline(m2, lty=2, col ="red")
## high leverage and influence (right)
plot(y ~ x, rbind(simdata, p3), pch = 16, las = 1,
cex.lab = 1.5, xlab = expression(italic(x)), ylab = expression(italic(y)),
main = "Leverage and influence", cex.main = 1)
points(17, 5.1, pch = 16, cex = 1.5, col ="red")
abline(mm)
abline(m3, lty=2, col ="red")
Recall that we can use the “hat matrix” \((\mathbf{H})\) to project the \(n\)-dimensional data \(\mathbf{y}\) onto the \(k\)-dimensional model. As a reminder, for our linear model
\[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \\ \boldsymbol{\epsilon} \sim \text{MVN}(\mathbf{0}, \sigma^2 \mathbf{I}) \]
we have
\[ \begin{align} \hat{\mathbf{y}} &= \mathbf{X} \hat{\boldsymbol{\beta}} \\ &= \mathbf{X} \left( (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y} \right) \\ &= \underbrace{\mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top}}_{\mathbf{H}} \mathbf{y} \\ &= \mathbf{H} \mathbf{y} \end{align} \]
The values along the diagonal of \(\mathbf{H}\) are known as the leverages (where \(\mathbf{H}_{ii} = h_i\)). These leverage values give an indication of the influence of a particular \(x_{ij}\) on the model fit. The residuals are thus
\[ \begin{aligned} \hat{\boldsymbol{\epsilon}} &= \mathbf{e} \\ &= \mathbf{y} - \hat{\mathbf{y}} \\ &= \mathbf{y} - \mathbf{H} \mathbf{y} \\ &= \mathbf{y} (\mathbf{I} - \mathbf{H}). \end{aligned} \]
From this it follows that
\[ \begin{aligned} \text{Var}(\hat{\boldsymbol{\epsilon}}) &= \text{Var}(\mathbf{e}) \\ &= \text{Var}(\mathbf{y} (\mathbf{I} - \mathbf{H})) \\ &= (\mathbf{I} - \mathbf{H}) \text{Var}(\mathbf{y} ) (\mathbf{I} - \mathbf{H})^{\top} \\ &= \sigma^2 (\mathbf{I} - \mathbf{H})^2 \\ &= \sigma^2 (\mathbf{I} - \mathbf{H}) \end{aligned} \]
and therefore the variance of an individual residual is
\[ \text{Var}(\hat{\epsilon}_i) = \sigma^2 (1 - h_i). \]
From this relationship we can see that large \(h_i\) lead to small variances of \(\hat{\epsilon}_i\) & hence \(\hat{y}_i\) tends to \(y_i\). The hat matrix \(\mathbf{H}\) has dimensions \(n \times n\) and
\[ \text{trace}(\mathbf{H}) = \sum_{i = 1}^n h_i = k \]
Thus, on average we should expect that the average leverage value is \(\bar{h}_i = \frac{k}{n}\). Any \(h_i > 2 \frac{k}{n}\) deserve closer inspection.
We can easily compute the \(h_i\) in R via the function hatvalues()
as shown here.
## leverages of points in middle plot above
hv <- hatvalues(m2)
## trace(H) = number of parameters in the model
k <- sum(hv)
## expected value for h_i (~= 0.36)
Eh <- 2 * (k / length(hv))
## are any h_i > Eh?
hv > Eh
## 1 2 3 4 5 6 7 8 9 10 11
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
We can also plot the individual leverage values to see whether they exceed their expectation, or identify high leverage via a half-normal plot.
## revised `halfnorm()` from Faraway
## `nlab` gives the number of points to label in the plot
## `labels` can be a character vector the same length as `x`
## that can be used to label the points in the plot
## `ylab` can be used to labeling of the y-axis
halfnorm <- function(x, nlab = 1, labels = NULL, ylab = "Sorted data") {
x <- abs(x)
labord <- order(x)
x <- sort(x)
i <- order(x)
n <- length(x)
ui <- qnorm((n + 1:n)/(2 * n + 1))
if(is.null(labels)) {
labels <- as.character(1:length(x))
}
plot(ui, x[i], pch = 16, las = 1,
xlab = "Half-normal quantiles", ylab = ylab,
ylim = c(0, max(x)), type = "n")
if(nlab < n) {
points(ui[1:(n - nlab)], x[i][1:(n - nlab)], pch = 16)
}
text(x = ui[(n - nlab + 1):n], y = x[i][(n - nlab + 1):n],
labels = labels[labord][(n - nlab + 1):n])
}
## set plot area
par(mfrow = c(1,2),
mai = c(0.9,0.9,0.1,0.3),
omi = c(0.5, 0.4, 0.5, 0),
cex = 1)
## plot of inidivdual leverages (left)
plot(model.matrix(m2)[,2], hv, pch = 16, las = 1,
ylab = "Leverage", xlab = expression(italic(x)))
mtext(expression(italic(h)^{"*"}), 4, line = 0.3, cex = 1.1, at = Eh, las = 1)
abline(h = Eh, lty = "dashed")
## half-normal plot of the leverages (right)
halfnorm(hv)
We saw in lecture that we can standardize the model residuals to help identify outliers. Specifically, we can use the leverages to do so, such that the studentized residual \(t_i\) is given by
\[ t_i = \frac{y_{i} - \hat{y}_{(i)}}{ \hat{\sigma}_{(i)} \sqrt{ 1 - h_i } } = e_i \sqrt{ \frac{n - k - 1}{n - k - e_i^2} } \]
and \(e_i\) is the residual for the \(i\)th case based on a model that includes all of the data. This \(t_i\) statistic is distributed as a \(t\)-distribution with \(n - k - 1\) degrees of freedom.
Note, however, that this requires us to undertake \(n\) different null hypothesis tests. Thus, if we chose a Type-I error rate of \(\alpha\) = 0.05, we should expect that 1 in 20 of these tests would be significant by chance alone. To account for all of these tests, we can employ a Bonferroni correction, which instead sets the threshold at \(\alpha_B = \alpha / n\). This correction factor comes up elsewhere in statistical testing and is known to be conservative; it finds fewer outliers than the nominal level of confidence would.
Computing the studentized residuals in R is easy with the rstudent()
function. We can then compare them to the critical \(t\) value with qt()
. Let’s do so for case #1 above (left-hand plot).
## get sample size
n_m1 <- nrow(model.matrix(m1))
## get studentized e
t_stud <- rstudent(m1)
## Bonferroni alpha
alpha <- 0.05 / n_m1
## critical t value
df <- n_m1 - length(coef(m1)) - 1
t_crit <- qt(1 - alpha/2, df)
## compare t_stud to t_crit
t_stud > t_crit
## 1 2 3 4 5 6 7 8 9 10 11
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
We can see that the 11th datum is an outlier (as we suspected from the above plot).
Influential observations might not be outliers nor have high leverage, but we want to identify them anyway. We saw in lecture that Cook’s Distance \((D)\) is a popular choice, where
\[ D_i = e_i^2 \frac{1}{k} \left( \frac{h_i}{1 - h_i} \right). \]
The idea here is that \(D_i\) combines the magnitude of a residual and its leverage. It’s easy to calculate the \(D_i\) with the cooks.distance()
function. We can then visually inspect \(D\) with a half-normal plot (we’ll label the 2 values with the highest \(D\) by setting nlab = 2
).
## Cook's D
cook <- cooks.distance(m2)
## half-normal plot
par(mai = c(0.9,0.9,0.1,0.1),
omi = c(0, 0, 0, 0),
cex = 1)
halfnorm(cook, nlab = 2, ylab = "Cook’s Distance")
The models we have been using assume that the errors are independent and identically distributed (IID). Let’s explore some of the options for dealing with situations where those assumptions may be violated.
In cases where the errors are independent, but not identically distributed, we can use weighted least squares. Let’s consider a data set from the famous statistician Francis Galton, which contains information he collected on the size of pea seeds in parent plants and their offspring plants, and the frequency of each of the paired measurements. The data are contained in the accompanying file galton_peas.csv
.
We begin by reading in the data and plotting them.
## read data file
peas <- read.csv("galton_peas.csv")
## set plot area
par(mai = c(1,1,0.1,0.1),
omi = c(0.5, 0, 0.5, 0),
cex = 1)
## plot the data
plot(peas$parent, peas$offspring, pch = 16, las = 1,
xlab = "Size of parent seed (mm)", ylab = "Size of offspring seed (mm)")
Each of the \(y_i\) here is actually a weighted mean of the offspring pea size in each of the parent size groups, so we should use a weighted regression with \(w_i = 1 / n_i\). This is easy to do with lm()
by passing an additional weights
argument. We’ll also fit a regular unweighted model for comparison.
## fit weighted regression
mpw <- lm(offspring ~ parent, peas, weights = 1/n)
faraway::sumary(mpw)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.481595 2.817401 3.7203 0.004768
## parent 0.852906 0.040052 21.2951 5.217e-09
##
## n = 11, p = 2, Residual SE = 0.10594, R-Squared = 0.98
## compare to unweighted regression
mp <- lm(offspring ~ parent, peas)
faraway::sumary(mp)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.042402 3.822194 4.1972 0.002316
## parent 0.765672 0.055744 13.7354 2.418e-07
##
## n = 11, p = 2, Residual SE = 0.55883, R-Squared = 0.95
Notice that our weighted regression has a much lower \(MSE = \hat{\sigma}^2\), and that the parameter estimates are different as well.
We saw in lecture that we can use robust regression in cases where the errors follow a non-normal distribution. Recall that the idea is to replace the squared function in our estimate of \(SSE\) with some other function.
One of the possibilities is Huber’s method where
\[ SSE = \sum^n_{i = 1} f(z) \\ ~ \\ f(z) = \left\{ \begin{matrix} \frac{z^2}{2} & \text{if} ~ \left| z \right| \leq c \\ c \left| z \right| - \frac{c^2}{2} & \text{otherwise} \end{matrix} \right. \]
and \(c = \hat{\sigma} \propto \text{Median}(\left| \hat{\epsilon} \right|)\).
As an example, let’s return to the plant data from the Galapagos Archipelago that we used last week in lab. To begin, we’ll fit a normal regression model and plot the residuals against the fitted values.
## get the data
library(faraway)
head(gala)
## Species Endemics Area Elevation Nearest Scruz Adjacent
## Baltra 58 23 25.09 346 0.6 0.6 1.84
## Bartolome 31 21 1.24 109 0.6 26.3 572.33
## Caldwell 3 3 0.21 114 2.8 58.7 0.78
## Champion 25 9 0.10 46 1.9 47.4 0.18
## Coamano 2 1 0.05 77 1.9 1.9 903.82
## Daphne.Major 18 11 0.34 119 8.0 8.0 1.84
## fit normal regr model
gm <- lm(Species ~ Area, gala)
## examine fit
sumary(gm)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.782861 17.524416 3.6397 0.0010943
## Area 0.081963 0.019713 4.1578 0.0002748
##
## n = 30, p = 2, Residual SE = 91.73159, R-Squared = 0.38
## set plot area
par(mai = c(1,1,0.1,0.1),
omi = c(0.5, 0, 0.5, 0),
cex = 1)
## plot residuals vs y_hat
plot(fitted(gm), resid(gm), pch = 16, las = 1,
ylab = expression(italic(e)), xlab = expression(hat(italic(y))))
This plot reveals some obvious problems with our model assumptions, so now let’s fit our robust model. To do so, we can use the rlm()
function in the MASS package, wherein the default option is to use Huber’s method.
## fit robust regr model
rgm <- MASS::rlm(Species ~ Area, gala)
## examine fit
summary(rgm)
##
## Call: rlm(formula = Species ~ Area, data = gala)
## Residuals:
## Min 1Q Median 3Q Max
## -42.92 -32.87 -10.07 22.40 334.29
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 44.9056 8.8575 5.0698
## Area 0.0717 0.0100 7.1965
##
## Residual standard error: 48.67 on 28 degrees of freedom
Here we see that the parameter estimates have changed somewhat and the estimate of the residual standard error has decreased substantially. Let’s go a step further and find out which of the islands had high influence and hence low weighting.
## extract weights from model fit
wts <- round(rgm$w, 2)
names(wts) <- row.names(gala)
## sort and inspect them
sort(wts)
## SantaCruz SantaMaria SanCristobal SanSalvador Baltra
## 0.20 0.29 0.33 0.43 1.00
## Bartolome Caldwell Champion Coamano Daphne.Major
## 1.00 1.00 1.00 1.00 1.00
## Daphne.Minor Darwin Eden Enderby Espanola
## 1.00 1.00 1.00 1.00 1.00
## Fernandina Gardner1 Gardner2 Genovesa Isabela
## 1.00 1.00 1.00 1.00 1.00
## Marchena Onslow Pinta Pinzon Las.Plazas
## 1.00 1.00 1.00 1.00 1.00
## Rabida SantaFe Seymour Tortuga Wolf
## 1.00 1.00 1.00 1.00 1.00
Four of the islands have been discounted in the estimation procedure.
M-estimation will fail if the large errors are numerous and extreme in value. If this is the case, then we can use least trimmed squares (LTS) as a resistant regression method that deals well with this situation. LTS minimizes the sum of squares of the \(q\) smallest residuals, such that
\[ SSE = \sum^n_{i = 1} e^2_{i} ~ \rightarrow ~ SSE_q = \sum^q_{i = 1} e^2_{(i)} \]
and \((i)\) indicates the residuals are sorted in ascending order. The default is \(q = \lfloor n/2 \rfloor + \lfloor (k + 1) / 2 \rfloor\) where \(\lfloor \cdot \rfloor\) is the floor function, which always rounds a number down to the nearest integer (type ?floor
for help on this function in R).
We can fit an LTS model with the ltsreg()
function in the MASS package. Let’s do so for the Galapagos plant data and examine the estimated parameters.
## fit LTS model
ltm <- MASS::ltsreg(Species ~ Area, gala)
## examine fit
coef(ltm)
## (Intercept) Area
## 8.400631 1.616162
This method does not automatically report the uncertainty in the parameter estimates, but we can use our bootstrapping method to estimate a confidence interval around them.
## number of boostrap samples
nb <- 1000
## empty matrix for beta estimates
beta_est <- matrix(NA, nb, 2)
## residuals from our model
resids <- residuals(ltm)
## sample many times
for(i in 1:nb) {
## sample w/ replacement from e
e_star <- sample(resids, rep = TRUE)
## calculate y_star
y_star <- predict(ltm) + e_star
## re-estimate model
beta_star <- MASS::ltsreg(y_star ~ Area, gala)
## save estimated betas
beta_est[i,] <- coef(beta_star)
}
## extract 2.5% and 97.5% values (with the median)
CI95 <- apply(beta_est, 2, quantile, c(0.025, 0.5, 0.975))
colnames(CI95) <- c("Intercept", "Area")
t(round(CI95, 3))
## 2.5% 50% 97.5%
## Intercept 4.189 8.306 19.143
## Area 1.597 1.617 1.700