3 June 2026

Acknowledgments

Goals for today

  • Understand some diagnostics

  • Understand more about model selection techniques with GAMs

  • Become familiar with some of the flexibility of GAMs

Last time

A generalized additive model is an extension of the GLM with a predictor that is a sum of smoothed (and possibly unsmoothed) terms:

\[ y_i = \beta_0 + \sum_j s_j(x_{j,i}) + \sum_k \beta_k x_{k,i} + \epsilon_i \]


And with splines, our smoothed terms are a sum of weighted basis functions:

\[ s(x)=\sum_k \beta_k b_k(x) \]

Choosing a model

Ask yourself:

  • data/error distribution?

  • predictors?

  • type of smoother(s)?


Check candidate models via diagnostics

  • Effect of choice of \(k\) (number of knots)?

An example

Chicago air quality and death rate (Peng and Welty (2004); {NMMAPSdata})

death: total deaths per day

pm25median: median particles < 2.5 mg per cubic m

o3median: ozone in parts per billion

tmpd: temperature in Fahrenheit

## get data
data(chicago, package = "gamair")

## fit gam
fit_gam_air <- mgcv::gam(death ~ s(pm25median) + 
                           s(o3median) + 
                           s(tmpd), 
                         family = "poisson", 
                         data = chicago, method = "REML")

An example

Smoothed Terms

##                    edf   Ref.df     Chi.sq      p-value
## s(pm25median) 3.518573 4.440907  20.860574 0.0006254809
## s(o3median)   3.430259 4.305665   6.373377 0.2230510318
## s(tmpd)       4.801666 5.878771 244.698414 0.0000000000

GAM interpretation

  • Effective degrees of freedom (edf) - degree of non-linearity, but with upper bound set by \(k\)

  • Reference degrees of freedom (Ref.df) - used for calculating test statistic

  • p-value - null hypothesis is a linear effect (this can include a flat linear effect, i.e., no effect)

    • Biased low because they don’t account for need to estimate \(\lambda\)

    • Works well asymptotically

    • Doesn’t work for unpenalized smooths (random effects)

Model checking

Effect of \(k\) on edf (for median particles <2.5 mg per \(m^3\))

Model checking

  • Check whether \(k\) is adequate - can set it very large to start, but this increases computation time

  • Do some iterative checking with larger values of \(k\) for multiple parameters, effective degrees of freedom should stabilize

## check model fit
gam.check(fit_gam_air) 

Model checking

\(k\) set to default for each variable

## 
## Method: REML   Optimizer: outer newton
## full convergence after 4 iterations.
## Gradient range [-7.763319e-07,1.839545e-08]
## (score 2867.802 & scale 1).
## Hessian positive definite, eigenvalue range [0.2056908,1.115369].
## Model rank =  28 / 28 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                 k'  edf k-index p-value  
## s(pm25median) 9.00 3.52    0.99    0.38  
## s(o3median)   9.00 3.43    0.97    0.19  
## s(tmpd)       9.00 4.80    0.94    0.06 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model checking

Again but with \(k = 20\) for all \(x\)

## fit with k = 20 for all
fit_gam_air <- mgcv::gam(death ~ s(pm25median, k = 20) + 
                           s(o3median, k = 20) + 
                           s(tmpd, k = 20), 
                         family = "poisson", 
                         data = chicago, method = "REML")

## check p-values
gam.check(fit_gam_air) 

Model checking

\(p\)-value above \(\alpha\) and edf is not close to \(k\)

## 
## Method: REML   Optimizer: outer newton
## full convergence after 3 iterations.
## Gradient range [-1.827205e-06,5.041726e-07]
## (score 2867.819 & scale 1).
## Hessian positive definite, eigenvalue range [0.1852984,1.155305].
## Model rank =  58 / 58 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                  k'   edf k-index p-value  
## s(pm25median) 19.00  3.57    0.99   0.350  
## s(o3median)   19.00  3.59    0.97   0.215  
## s(tmpd)       19.00  5.01    0.94   0.075 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Other model diagnostics

gam.check() provides some basic diagnostic plots

## 
## Method: REML   Optimizer: outer newton
## full convergence after 3 iterations.
## Gradient range [-1.827205e-06,5.041726e-07]
## (score 2867.819 & scale 1).
## Hessian positive definite, eigenvalue range [0.1852984,1.155305].
## Model rank =  58 / 58 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                  k'   edf k-index p-value  
## s(pm25median) 19.00  3.57    0.99    0.32  
## s(o3median)   19.00  3.59    0.97    0.24  
## s(tmpd)       19.00  5.01    0.94    0.06 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model selection

  • When we select \(\lambda\) we are only penalizing wiggliness
  • However, this only penalizes functions of \(X\) with a second derivative > 0
  • Thus, linear relationships can’t be removed from the model this way

Model selection

Use select = TRUE to apply penalties to the linear terms as well

## apply penalties to linear terms 
fit_gam_air <- mgcv::gam(death ~ s(pm25median, k = 20) + 
                           s(o3median, k = 20) + 
                           s(tmpd, k = 20), 
                         select = TRUE,
                         family = "poisson",
                         data = chicago, method = "REML")

Model selection

Check to see if edf < 1

## check edf values
summary(fit_gam_air)$s.table
##                    edf   Ref.df     Chi.sq     p-value
## s(pm25median) 3.567278 4.543784  20.830454 0.000704636
## s(o3median)   3.591463 4.540236   7.006517 0.201616686
## s(tmpd)       5.006420 6.297518 243.866273 0.000000000

Model selection

Use gratia::draw() to create partial effect plots for the smooths

Flexibility of GAMs

We have lots of options when fitting GAMs

so we have to give very careful thought to our goal(s)

Flexibility of GAMs

We have lots of options when fitting GAMs

so we have to give very careful thought to our goal(s)


Just because we can fit a GAM doesn’t mean it will be interpretable or useful!

Another example

Atmospheric CO2 concentrations at the South Pole (1957 to 1999)

Candidate model

Cubic regression basis because have a lot of wiggliness in the data

Let’s make \(k\) really large then

## fit GAM with cubic regression basis
fit_co2_cubic <- gam(co2 ~ s(c.month, k = 300, bs = "cr"),
                     method = 'REML',
                     data = co2s)

Model summary

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## co2 ~ s(c.month, k = 300, bs = "cr")
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.382e+02  4.444e-03   76111   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##              edf Ref.df     F p-value    
## s(c.month) 205.3  241.4 44646  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =      1   Deviance explained =  100%
## -REML = 36.492  Scale est. = 0.0084334  n = 427

Predicting the future

Not a great model

Another candidate model

Let’s now distinguish within-year variation from the long-term trend

  • cyclic spline on month (1:12)

  • cubic regression spline on c.month (1:507)

## fit GAM with 
b2 <- gam(co2 ~ s(month, bs = "cc") + 
            s(c.month, bs = "cr", k = 300), 
          knots = list(month = c(0.5, 12.5)),
          method = 'REML',
          data = co2s)

Cyclic & cubic splines

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## co2 ~ s(month, bs = "cc") + s(c.month, bs = "cr", k = 300)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.382e+02  5.331e-03   63447   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                edf Ref.df       F p-value    
## s(month)     7.266    8.0   165.1  <2e-16 ***
## s(c.month) 107.309  131.8 56450.2  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =      1   Deviance explained =  100%
## -REML = -103.74  Scale est. = 0.012136  n = 427

Cyclic & cubic splines

Much more reasonable predictions

A cautionary note

Building GAMs requires more choices than building GLMs


It requires thinking hard about the system and the question

Options for interactions

main effects & interactions te(x1, x2)

  • separate smoothness parameter and separate marginal basis

Options for interactions

main effects & interactions te(x1, x2)

  • separate smoothness parameter and separate marginal basis

interactions only ti(x1, x2)

  • when the main effect are specified elsewhere

Options for interactions

main effects & interactions te(x1, x2)

  • separate smoothness parameter and separate marginal basis

interactions only ti(x1, x2)

  • when the main effect are specified elsewhere

bivariate thin plate splines s(x1, x2, bs = 'tp')

  • single smoothness parameter, only use when units are the same (eg, over space)

Options for interactions

main effects & interactions te(x1, x2)

  • separate smoothness parameter and separate marginal basis

interactions only ti(x1, x2)

  • when the main effect are specified elsewhere

bivariate thin plate splines s(x1, x2, bs = 'tp')

  • single smoothness parameter, only use when units are the same (eg, over space)

smooth by different levels of a factor f + s(x1, by = f)

  • be sure to include the factor as a fixed effect

  • akin to random regression slopes

Mixed models

GAMMs

We can also fit random effects

  • mgcv::gam(...,bs = “re”)`

  • mgcv::gamm()

  • gamm4::gamm4()

Thus, these models are equivalent:

  • lme(y ~ 1 + (1|f), method = "REML")

  • gam(y ~ s(f, bs = "re"), method = "REML")

Summary

  • GAMs using splines offer tremendous flexibility
  • There is some tradeoff between flexibility and ease of understanding
  • Many of our relationships in ecology aren’t linear
  • What are some examples?