--- title: "Fitting Generalized Additive Models" subtitle: "Analysis of Ecological and Environmental Data
QERM 514" author: "Mark Scheuerell" date: "3 June 2026" output: ioslides_presentation: wide: true css: ../lecture_slides.css --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE) library(mgcv) library(ggplot2) library(gratia) ``` ## Acknowledgments Gavin Simpson and colleagues have produced a lot of useful teaching material on GAMs that I've made use of here. For example: [Gavin's blog](https://fromthebottomoftheheap.net) [Webinar 1](https://www.youtube.com/live/sgw4cu8hrZM?si=pDEYp2OdKB7guJ8T&t=285) [Webinar 2](https://youtu.be/Ukfvd8akfco?si=g17GQv2ExA1KA0oO&t=8) ## Goals for today * Understand some diagnostics * Understand more about model selection techniques with GAMs * Become familiar with some of the flexibility of GAMs ## Last time A generalized additive model is an extension of the GLM with a predictor that is a sum of smoothed (and possibly unsmoothed) terms: $$ y_i = \beta_0 + \sum_j s_j(x_{j,i}) + \sum_k \beta_k x_{k,i} + \epsilon_i $$
And with splines, our smoothed terms are a sum of weighted basis functions: $$ s(x)=\sum_k \beta_k b_k(x) $$ ## Choosing a model Ask yourself: * data/error distribution? * predictors? * type of smoother(s)?
Check candidate models via diagnostics * Effect of choice of $k$ (number of knots)? ## An example Chicago air quality and death rate ([Peng and Welty (2004)](https://journal.r-project.org/articles/RN-2004-011/RN-2004-011.pdf); `{NMMAPSdata}`) `death`: total deaths per day `pm25median`: median particles < 2.5 mg per cubic m `o3median`: ozone in parts per billion `tmpd`: temperature in Fahrenheit ```{r chicago_air, echo = TRUE} ## get data data(chicago, package = "gamair") ## fit gam fit_gam_air <- mgcv::gam(death ~ s(pm25median) + s(o3median) + s(tmpd), family = "poisson", data = chicago, method = "REML") ``` ## An example Smoothed Terms ```{r summary_chicago_air, echo = FALSE} summary(fit_gam_air)$s.table ``` ## GAM interpretation * Effective degrees of freedom (`edf`) - degree of non-linearity, but with upper bound set by $k$ * Reference degrees of freedom (`Ref.df`) - used for calculating test statistic * `p-value` - null hypothesis is a linear effect (this can include a flat linear effect, i.e., no effect) * Biased low because they don't account for need to estimate $\lambda$ * Works well asymptotically * Doesn't work for unpenalized smooths (random effects) ## Model checking | Effect of $k$ on edf (for median particles <2.5 mg per $m^3$) ```{r plot_k_edf, echo = FALSE, fig.align = "center", fig.width = 8, fig.height = 4.5} ## empty matrix for results record <- matrix(NA, nrow = 20, ncol = 3) ## lop over values of k for(k in 1:20){ fit_gam_air <- mgcv::gam(death ~ s(pm25median, k = k + 3) + s(o3median) + s(tmpd), family = "poisson", data = chicago, method = "REML") record[k,1] <- k + 3 record[k,2] <- summary(fit_gam_air)$s.table[1, 1] } par(mai = c(0.9, 0.9, 0.1, 0.1), omi = c(0, 0, 0, 0), cex.lab = 1.4, cex.axis = 1.2) plot(record[,1], record[,2], las = 1, pch = 16, col = "dodgerblue", cex = 1.5, xlab = "k for pm2.5", ylab = "Effective degrees of freedom") ``` ## Model checking * Check whether $k$ is adequate - can set it very large to start, but this increases computation time * Do some iterative checking with larger values of $k$ for multiple parameters, effective degrees of freedom should stabilize ```{r ex_gam_check, echo = TRUE, eval = FALSE} ## check model fit gam.check(fit_gam_air) ``` ## Model checking | $k$ set to default for each variable {.smaller} ```{r check_default_k, echo = FALSE, results = TRUE, fig.show = "hide"} ## fit with default k's fit_gam_air <- mgcv::gam(death ~ s(pm25median) + s(o3median) + s(tmpd), family = "poisson", data = chicago, method = "REML") ## check p-values gam.check(fit_gam_air) ``` ## Model checking | Again but with $k = 20$ for all $x$ ```{r check_k_20, echo = TRUE, results = FALSE, fig.show = "hide"} ## fit with k = 20 for all fit_gam_air <- mgcv::gam(death ~ s(pm25median, k = 20) + s(o3median, k = 20) + s(tmpd, k = 20), family = "poisson", data = chicago, method = "REML") ## check p-values gam.check(fit_gam_air) ``` ## Model checking | $p$-value above $\alpha$ and edf is not close to $k$ {.smaller} ```{r, echo = FALSE, results = TRUE, fig.show = "hide"} ## fit with k = 20 for all fit_gam_air <- mgcv::gam(death ~ s(pm25median, k = 20) + s(o3median, k = 20) + s(tmpd, k = 20), family = "poisson", data = chicago, method = "REML") ## check p-values gam.check(fit_gam_air, rep=500) ``` ## Other model diagnostics | `gam.check()` provides some basic diagnostic plots ```{r plot_diagnostics, echo = FALSE, results = TRUE, fig.show = "show", fig.align = "center", fig.width = 8, fig.height = 4.5} par(omi = c(0, 0, 0, 0)) ## diagnostic plots gam.check(fit_gam_air) ``` ## Model selection * When we select $\lambda$ we are only penalizing wiggliness > * However, this only penalizes functions of $X$ with a second derivative > 0 > * Thus, linear relationships can't be removed from the model this way ## Model selection Use `select = TRUE` to apply penalties to the linear terms as well ```{r ex_select_TRUE, echo = TRUE, eval = FALSE} ## apply penalties to linear terms fit_gam_air <- mgcv::gam(death ~ s(pm25median, k = 20) + s(o3median, k = 20) + s(tmpd, k = 20), ### select = TRUE, ### family = "poisson", data = chicago, method = "REML") ``` ## Model selection Check to see if edf < 1 ```{r check_edf, echo = TRUE} ## check edf values summary(fit_gam_air)$s.table ``` ## Model selection | Use `gratia::draw()` to create partial effect plots for the smooths ```{r plot_partial_effects, echo = FALSE, fig.align = "center", fig.width = 8, fig.height = 4.5} ## partial effect plots gratia::draw(fit_gam_air, scales = "fixed") ``` ## Flexibility of GAMs We have _lots_ of options when fitting GAMs so we have to give very careful thought to our goal(s) ## Flexibility of GAMs We have _lots_ of options when fitting GAMs so we have to give very careful thought to our goal(s)
Just because we _can_ fit a GAM doesn't mean it will be interpretable or useful! ## Another example | Atmospheric CO₂ concentrations at the South Pole (1957 to 1999) ```{r ex_data_CO2, echo = FALSE, fig.align = "center", fig.width = 8, fig.height = 4.5} ## get CO2 data data(co2s, package ="gamair") par(mai = c(0.9, 0.9, 0.1, 0.1), omi = c(0, 0, 0, 0), cex.lab = 1.4, cex.axis = 1.2) plot(co2s$c.month, co2s$co2, type = "l", lwd = 3, col = "dodgerblue", ylab = "CO2 concentration", xlab = "Time") ``` ## Candidate model Cubic regression basis because have a lot of wiggliness in the data Let's make $k$ ***really*** large then ```{r fit_co2_cubic, echo = TRUE, results = FALSE} ## fit GAM with cubic regression basis fit_co2_cubic <- gam(co2 ~ s(c.month, k = 300, bs = "cr"), method = 'REML', data = co2s) ``` ## Model summary {.smaller} ```{r summary_co2_cubic} summary(fit_co2_cubic) ``` ## Predicting the future | Not a great model ```{r plot_prediction, fig.align = "center", fig.width = 8, fig.height = 4.5} pd <- with(co2s, data.frame(c.month = 1:(nrow(co2s)+36))) pd <- cbind(pd, predict(fit_co2_cubic, pd, se = TRUE)) pd <- transform(pd, upr = fit + (2*se.fit), lwr = fit - (2 * se.fit)) par(cex.lab = 1.4, cex.axis = 1.2) p1 <- ggplot(pd, aes(x = c.month, y = fit)) + geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.2) + geom_line(data = co2s, aes(c.month, co2), linewidth = 1.5, col = 'dodgerblue') + geom_line(alpha = 0.5) + labs(x = "\nTime", y = "CO2 concentration\n") p1 + theme_classic(base_size = 18) ``` ## Another candidate model Let's now distinguish within-year variation from the long-term trend * cyclic spline on `month` (1:12) * cubic regression spline on `c.month` (1:507) ```{r fit_cyclic_cubic, echo = TRUE, results = TRUE} ## fit GAM with b2 <- gam(co2 ~ s(month, bs = "cc") + s(c.month, bs = "cr", k = 300), ### knots = list(month = c(0.5, 12.5)), ### method = 'REML', data = co2s) ``` ## Cyclic & cubic splines {.smaller} ```{r co2_summary} summary(b2) ``` ## Cyclic & cubic splines | Much more reasonable predictions ```{r plot_both_splines, fig.align = "center", fig.width = 8, fig.height = 4.5} nr <- nrow(co2s) pd2 <- with(co2s, data.frame(c.month = 1:(nr+36), month = rep(1:12, length.out = nr + 36))) pd2 <- cbind(pd2, predict(b2, pd2, se = TRUE)) pd2 <- transform(pd2, upr = fit + (2*se.fit), lwr = fit - (2 * se.fit)) par(cex.lab = 1.4, cex.axis = 1.2) p1 <- ggplot(pd2, aes(x = c.month, y = fit)) + geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.2) + geom_line(data = co2s, aes(c.month, co2), linewidth = 1.5, col = 'dodgerblue') + geom_line(alpha = 0.5) + labs(x = "\nTime", y = "CO2 concentration\n") p1 + theme_classic(base_size = 18) ``` ## A cautionary note Building GAMs requires more choices than building GLMs
It requires thinking hard about the system and the question ## Options for interactions **main effects & interactions** `te(x1, x2)` * separate smoothness parameter and separate marginal basis ## Options for interactions **main effects & interactions** `te(x1, x2)` * separate smoothness parameter and separate marginal basis **interactions only** `ti(x1, x2)` * when the main effect are specified elsewhere ## Options for interactions **main effects & interactions** `te(x1, x2)` * separate smoothness parameter and separate marginal basis **interactions only** `ti(x1, x2)` * when the main effect are specified elsewhere **bivariate thin plate splines** `s(x1, x2, bs = 'tp')` * single smoothness parameter, only use when units are the same (eg, over space) ## Options for interactions **main effects & interactions** `te(x1, x2)` * separate smoothness parameter and separate marginal basis **interactions only** `ti(x1, x2)` * when the main effect are specified elsewhere **bivariate thin plate splines** `s(x1, x2, bs = 'tp')` * single smoothness parameter, only use when units are the same (eg, over space) **smooth by different levels of a factor** `f + s(x1, by = f)` * be sure to include the factor as a fixed effect * akin to random regression slopes ## Mixed models | GAMMs We can also fit random effects * `mgcv::gam(..., `bs = "re")` * `mgcv::gamm()` * `gamm4::gamm4()` Thus, these models are equivalent: * `lme(y ~ 1 + (1|f), method = "REML")` * `gam(y ~ s(f, bs = "re"), method = "REML")` ## Summary * GAMs using splines offer tremendous flexibility > * There is some tradeoff between flexibility and ease of understanding > * Many of our relationships in ecology aren't linear > * What are some examples?