---
title: "Fitting Generalized Additive Models"
subtitle: "Analysis of Ecological and Environmental Data<br>QERM 514"
author: "Mark Scheuerell"
date: "3 June 2026"
output:
  ioslides_presentation:
    wide: true
    css: ../lecture_slides.css
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)

library(mgcv)
library(ggplot2)
library(gratia)
```


## Acknowledgments 

Gavin Simpson and colleagues have produced a lot of useful teaching material on GAMs that I've made use of here. For example: 

[Gavin's blog](https://fromthebottomoftheheap.net)

[Webinar 1](https://www.youtube.com/live/sgw4cu8hrZM?si=pDEYp2OdKB7guJ8T&t=285)

[Webinar 2](https://youtu.be/Ukfvd8akfco?si=g17GQv2ExA1KA0oO&t=8)


## Goals for today

* Understand some diagnostics

* Understand more about model selection techniques with GAMs

* Become familiar with some of the flexibility of GAMs


## Last time 

A generalized additive model is an extension of the GLM with a predictor that is a sum of smoothed (and possibly unsmoothed) terms: 

$$
y_i = \beta_0 + \sum_j s_j(x_{j,i}) + \sum_k \beta_k x_{k,i} + \epsilon_i
$$

<br>

And with splines, our smoothed terms are a sum of weighted basis functions: 

$$
s(x)=\sum_k \beta_k b_k(x)
$$


## Choosing a model 

Ask yourself:

* data/error distribution? 

* predictors? 

* type of smoother(s)?

<br>

Check candidate models via diagnostics 

* Effect of choice of $k$ (number of knots)?


## An example 

Chicago air quality and death rate ([Peng and Welty (2004)](https://journal.r-project.org/articles/RN-2004-011/RN-2004-011.pdf); `{NMMAPSdata}`)

`death`: total deaths per day

`pm25median`: median particles < 2.5 mg per cubic m

`o3median`: ozone in parts per billion

`tmpd`: temperature in Fahrenheit

```{r chicago_air, echo = TRUE}
## get data
data(chicago, package = "gamair")

## fit gam
fit_gam_air <- mgcv::gam(death ~ s(pm25median) + 
                           s(o3median) + 
                           s(tmpd), 
                         family = "poisson", 
                         data = chicago, method = "REML")
```


## An example 

Smoothed Terms 

```{r summary_chicago_air, echo = FALSE}
summary(fit_gam_air)$s.table 
``` 


## GAM interpretation  

* Effective degrees of freedom (`edf`) - degree of non-linearity, but with upper bound set by $k$ 

* Reference degrees of freedom (`Ref.df`) - used for calculating test statistic 

* `p-value` - null hypothesis is a linear effect (this can include a flat linear effect, i.e., no effect) 

  * Biased low because they don't account for need to estimate $\lambda$ 

  * Works well asymptotically 

  * Doesn't work for unpenalized smooths (random effects)


## Model checking | Effect of $k$ on edf (for median particles <2.5 mg per $m^3$)

```{r plot_k_edf, echo = FALSE, fig.align = "center", fig.width = 8, fig.height = 4.5}
## empty matrix for results
record <- matrix(NA, nrow = 20, ncol = 3)

## lop over values of k
for(k in 1:20){
  fit_gam_air <- mgcv::gam(death ~ s(pm25median, k = k + 3) + 
                             s(o3median) + 
                             s(tmpd), 
                           family = "poisson", 
                           data = chicago, method = "REML")
  
record[k,1] <- k + 3
record[k,2] <- summary(fit_gam_air)$s.table[1, 1]
}

par(mai = c(0.9, 0.9, 0.1, 0.1),
    omi = c(0, 0, 0, 0),
    cex.lab = 1.4, cex.axis = 1.2)

plot(record[,1], record[,2],
     las = 1, pch = 16, col = "dodgerblue", cex = 1.5,
     xlab = "k for pm2.5", ylab = "Effective degrees of freedom")
```


## Model checking 

* Check whether $k$ is adequate - can set it very large to start, but this increases computation time 

* Do some iterative checking with larger values of $k$ for multiple parameters, effective degrees of freedom should stabilize   

```{r ex_gam_check, echo = TRUE, eval = FALSE}
## check model fit
gam.check(fit_gam_air) 
``` 


## Model checking | $k$ set to default for each variable {.smaller} 

```{r check_default_k, echo = FALSE, results = TRUE, fig.show = "hide"}
## fit with default k's
fit_gam_air <- mgcv::gam(death ~ s(pm25median) + 
                      s(o3median) + 
                      s(tmpd), 
                    family = "poisson", 
                    data = chicago, method = "REML")

## check p-values
gam.check(fit_gam_air) 
``` 


## Model checking | Again but with $k = 20$ for all $x$

```{r check_k_20, echo = TRUE, results = FALSE, fig.show = "hide"}
## fit with k = 20 for all
fit_gam_air <- mgcv::gam(death ~ s(pm25median, k = 20) + 
                           s(o3median, k = 20) + 
                           s(tmpd, k = 20), 
                         family = "poisson", 
                         data = chicago, method = "REML")

## check p-values
gam.check(fit_gam_air) 
```


## Model checking | $p$-value above $\alpha$ and edf is not close to $k$ {.smaller} 

```{r, echo = FALSE, results = TRUE, fig.show = "hide"}
## fit with k = 20 for all
fit_gam_air <- mgcv::gam(death ~ s(pm25median, k = 20) + 
                      s(o3median, k = 20) + 
                      s(tmpd, k = 20), 
                    family = "poisson", 
                    data = chicago, method = "REML")

## check p-values
gam.check(fit_gam_air, rep=500)

```


## Other model diagnostics | `gam.check()` provides some basic diagnostic plots 

```{r plot_diagnostics, echo = FALSE, results = TRUE, fig.show = "show", fig.align = "center", fig.width = 8, fig.height = 4.5}
par(omi = c(0, 0, 0, 0))

## diagnostic plots
gam.check(fit_gam_air) 
``` 


## Model selection 

* When we select $\lambda$ we are only penalizing wiggliness

> * However, this only penalizes functions of $X$ with a second derivative > 0

> * Thus, linear relationships can't be removed from the model this way  


## Model selection 

Use `select = TRUE` to apply penalties to the linear terms as well

```{r ex_select_TRUE, echo = TRUE, eval = FALSE}
## apply penalties to linear terms 
fit_gam_air <- mgcv::gam(death ~ s(pm25median, k = 20) + 
                           s(o3median, k = 20) + 
                           s(tmpd, k = 20), 
                         ### <b>
                         select = TRUE,
                         ### </b>  
                         family = "poisson",
                         data = chicago, method = "REML")
```


## Model selection 

Check to see if edf < 1 

```{r check_edf, echo = TRUE}
## check edf values
summary(fit_gam_air)$s.table
```


## Model selection | Use `gratia::draw()` to create partial effect plots for the smooths

```{r plot_partial_effects, echo = FALSE, fig.align = "center", fig.width = 8, fig.height = 4.5}
## partial effect plots
gratia::draw(fit_gam_air, scales = "fixed")
```


## Flexibility of GAMs 

We have _lots_ of options when fitting GAMs 

so we have to give very careful thought to our goal(s) 


## Flexibility of GAMs 

We have _lots_ of options when fitting GAMs 

so we have to give very careful thought to our goal(s) 

<br>

Just because we _can_ fit a GAM doesn't mean it will be interpretable or useful!


## Another example | Atmospheric CO<sub>2</sub> concentrations at the South Pole (1957 to 1999)

```{r ex_data_CO2, echo = FALSE, fig.align = "center", fig.width = 8, fig.height = 4.5}
## get CO2 data
data(co2s, package ="gamair")

par(mai = c(0.9, 0.9, 0.1, 0.1),
    omi = c(0, 0, 0, 0),
    cex.lab = 1.4, cex.axis = 1.2)

plot(co2s$c.month, co2s$co2,
     type = "l", lwd = 3, col = "dodgerblue",
     ylab = "CO2 concentration", xlab = "Time")
```


## Candidate model 

Cubic regression basis because have a lot of wiggliness in the data

Let's make $k$ ***really*** large then

```{r fit_co2_cubic, echo = TRUE, results = FALSE}
## fit GAM with cubic regression basis
fit_co2_cubic <- gam(co2 ~ s(c.month, k = 300, bs = "cr"),
                     method = 'REML',
                     data = co2s)
``` 


## Model summary {.smaller}  

```{r summary_co2_cubic}
summary(fit_co2_cubic)
```


## Predicting the future | Not a great model

```{r plot_prediction, fig.align = "center", fig.width = 8, fig.height = 4.5}

pd <- with(co2s, data.frame(c.month = 1:(nrow(co2s)+36)))
pd <- cbind(pd, predict(fit_co2_cubic, pd, se = TRUE))
pd <- transform(pd, upr = fit + (2*se.fit), lwr = fit - (2 * se.fit))

par(cex.lab = 1.4, cex.axis = 1.2)

p1 <- ggplot(pd, aes(x = c.month, y = fit)) +
  geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.2) +
  geom_line(data = co2s, aes(c.month, co2),
            linewidth = 1.5, col = 'dodgerblue') +
  geom_line(alpha = 0.5) +
  labs(x = "\nTime", y = "CO2 concentration\n")
p1 + theme_classic(base_size = 18)
```

## Another candidate model  

Let's now distinguish within-year variation from the long-term trend 

* cyclic spline on `month` (1:12) 

* cubic regression spline on `c.month` (1:507)

```{r fit_cyclic_cubic, echo = TRUE, results = TRUE}
## fit GAM with 
b2 <- gam(co2 ~ s(month, bs = "cc") + 
            s(c.month, bs = "cr", k = 300), 
          ### <b>
          knots = list(month = c(0.5, 12.5)),
          ### </b>
          method = 'REML',
          data = co2s)
```


## Cyclic & cubic splines {.smaller}  

```{r co2_summary}
summary(b2)
```


## Cyclic & cubic splines | Much more reasonable predictions

```{r plot_both_splines, fig.align = "center", fig.width = 8, fig.height = 4.5} 

nr <- nrow(co2s)
pd2 <- with(co2s, data.frame(c.month = 1:(nr+36),
                             month = rep(1:12, length.out = nr + 36)))
pd2 <- cbind(pd2, predict(b2, pd2, se = TRUE))
pd2 <- transform(pd2, upr = fit + (2*se.fit), lwr = fit - (2 * se.fit))

par(cex.lab = 1.4, cex.axis = 1.2)

p1 <- ggplot(pd2, aes(x = c.month, y = fit)) +
    geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.2) +
    geom_line(data = co2s, aes(c.month, co2),
              linewidth = 1.5, col = 'dodgerblue') +
    geom_line(alpha = 0.5) +
  labs(x = "\nTime", y = "CO2 concentration\n")
p1 + theme_classic(base_size = 18)
``` 


## A cautionary note  

Building GAMs requires more choices than building GLMs

<br>

It requires thinking hard about the system and the question 


## Options for interactions 

**main effects & interactions** `te(x1, x2)`

* separate smoothness parameter and separate marginal basis


## Options for interactions 

**main effects & interactions** `te(x1, x2)`

* separate smoothness parameter and separate marginal basis

**interactions only** `ti(x1, x2)`  

* when the main effect are specified elsewhere 


## Options for interactions 

**main effects & interactions** `te(x1, x2)`

* separate smoothness parameter and separate marginal basis

**interactions only** `ti(x1, x2)`  

* when the main effect are specified elsewhere 

**bivariate thin plate splines** `s(x1, x2, bs = 'tp')`

* single smoothness parameter, only use when units are the same (eg, over space)


## Options for interactions 

**main effects & interactions** `te(x1, x2)`

* separate smoothness parameter and separate marginal basis

**interactions only** `ti(x1, x2)`  

* when the main effect are specified elsewhere 

**bivariate thin plate splines** `s(x1, x2, bs = 'tp')`

* single smoothness parameter, only use when units are the same (eg, over space)

**smooth by different levels of a factor** `f + s(x1, by = f)`

* be sure to include the factor as a fixed effect

* akin to random regression slopes


## Mixed models | GAMMs

We can also fit random effects

* `mgcv::gam(..., `bs = "re")`

* `mgcv::gamm()` 

* `gamm4::gamm4()`

Thus, these models are equivalent: 

* `lme(y ~ 1 + (1|f), method = "REML")`

* `gam(y ~ s(f, bs = "re"), method = "REML")`


## Summary 

* GAMs using splines offer tremendous flexibility 

> * There is some tradeoff between flexibility and ease of understanding 

> * Many of our relationships in ecology aren't linear 

> * What are some examples? 

