Gavin Simpson and colleagues have produced a lot of useful teaching material on GAMs that I’ve made use of here. For example:
3 June 2026
Gavin Simpson and colleagues have produced a lot of useful teaching material on GAMs that I’ve made use of here. For example:
Understand some diagnostics
Understand more about model selection techniques with GAMs
Become familiar with some of the flexibility of GAMs
A generalized additive model is an extension of the GLM with a predictor that is a sum of smoothed (and possibly unsmoothed) terms:
\[ y_i = \beta_0 + \sum_j s_j(x_{j,i}) + \sum_k \beta_k x_{k,i} + \epsilon_i \]
And with splines, our smoothed terms are a sum of weighted basis functions:
\[ s(x)=\sum_k \beta_k b_k(x) \]
Ask yourself:
data/error distribution?
predictors?
type of smoother(s)?
Check candidate models via diagnostics
Chicago air quality and death rate (Peng and Welty (2004); {NMMAPSdata})
death: total deaths per day
pm25median: median particles < 2.5 mg per cubic m
o3median: ozone in parts per billion
tmpd: temperature in Fahrenheit
## get data
data(chicago, package = "gamair")
## fit gam
fit_gam_air <- mgcv::gam(death ~ s(pm25median) +
s(o3median) +
s(tmpd),
family = "poisson",
data = chicago, method = "REML")
Smoothed Terms
## edf Ref.df Chi.sq p-value ## s(pm25median) 3.518573 4.440907 20.860574 0.0006254809 ## s(o3median) 3.430259 4.305665 6.373377 0.2230510318 ## s(tmpd) 4.801666 5.878771 244.698414 0.0000000000
Effective degrees of freedom (edf) - degree of non-linearity, but with upper bound set by \(k\)
Reference degrees of freedom (Ref.df) - used for calculating test statistic
p-value - null hypothesis is a linear effect (this can include a flat linear effect, i.e., no effect)
Biased low because they don’t account for need to estimate \(\lambda\)
Works well asymptotically
Doesn’t work for unpenalized smooths (random effects)
Check whether \(k\) is adequate - can set it very large to start, but this increases computation time
Do some iterative checking with larger values of \(k\) for multiple parameters, effective degrees of freedom should stabilize
## check model fit gam.check(fit_gam_air)
## ## Method: REML Optimizer: outer newton ## full convergence after 4 iterations. ## Gradient range [-7.763319e-07,1.839545e-08] ## (score 2867.802 & scale 1). ## Hessian positive definite, eigenvalue range [0.2056908,1.115369]. ## Model rank = 28 / 28 ## ## Basis dimension (k) checking results. Low p-value (k-index<1) may ## indicate that k is too low, especially if edf is close to k'. ## ## k' edf k-index p-value ## s(pm25median) 9.00 3.52 0.99 0.38 ## s(o3median) 9.00 3.43 0.97 0.19 ## s(tmpd) 9.00 4.80 0.94 0.06 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## fit with k = 20 for all
fit_gam_air <- mgcv::gam(death ~ s(pm25median, k = 20) +
s(o3median, k = 20) +
s(tmpd, k = 20),
family = "poisson",
data = chicago, method = "REML")
## check p-values
gam.check(fit_gam_air)
## ## Method: REML Optimizer: outer newton ## full convergence after 3 iterations. ## Gradient range [-1.827205e-06,5.041726e-07] ## (score 2867.819 & scale 1). ## Hessian positive definite, eigenvalue range [0.1852984,1.155305]. ## Model rank = 58 / 58 ## ## Basis dimension (k) checking results. Low p-value (k-index<1) may ## indicate that k is too low, especially if edf is close to k'. ## ## k' edf k-index p-value ## s(pm25median) 19.00 3.57 0.99 0.350 ## s(o3median) 19.00 3.59 0.97 0.215 ## s(tmpd) 19.00 5.01 0.94 0.075 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
gam.check() provides some basic diagnostic plots## ## Method: REML Optimizer: outer newton ## full convergence after 3 iterations. ## Gradient range [-1.827205e-06,5.041726e-07] ## (score 2867.819 & scale 1). ## Hessian positive definite, eigenvalue range [0.1852984,1.155305]. ## Model rank = 58 / 58 ## ## Basis dimension (k) checking results. Low p-value (k-index<1) may ## indicate that k is too low, especially if edf is close to k'. ## ## k' edf k-index p-value ## s(pm25median) 19.00 3.57 0.99 0.32 ## s(o3median) 19.00 3.59 0.97 0.24 ## s(tmpd) 19.00 5.01 0.94 0.06 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Use select = TRUE to apply penalties to the linear terms as well
## apply penalties to linear terms
fit_gam_air <- mgcv::gam(death ~ s(pm25median, k = 20) +
s(o3median, k = 20) +
s(tmpd, k = 20),
select = TRUE,
family = "poisson",
data = chicago, method = "REML")
Check to see if edf < 1
## check edf values summary(fit_gam_air)$s.table
## edf Ref.df Chi.sq p-value ## s(pm25median) 3.567278 4.543784 20.830454 0.000704636 ## s(o3median) 3.591463 4.540236 7.006517 0.201616686 ## s(tmpd) 5.006420 6.297518 243.866273 0.000000000
gratia::draw() to create partial effect plots for the smoothsWe have lots of options when fitting GAMs
so we have to give very careful thought to our goal(s)
We have lots of options when fitting GAMs
so we have to give very careful thought to our goal(s)
Just because we can fit a GAM doesn’t mean it will be interpretable or useful!
Cubic regression basis because have a lot of wiggliness in the data
Let’s make \(k\) really large then
## fit GAM with cubic regression basis
fit_co2_cubic <- gam(co2 ~ s(c.month, k = 300, bs = "cr"),
method = 'REML',
data = co2s)
## ## Family: gaussian ## Link function: identity ## ## Formula: ## co2 ~ s(c.month, k = 300, bs = "cr") ## ## Parametric coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.382e+02 4.444e-03 76111 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Approximate significance of smooth terms: ## edf Ref.df F p-value ## s(c.month) 205.3 241.4 44646 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## R-sq.(adj) = 1 Deviance explained = 100% ## -REML = 36.492 Scale est. = 0.0084334 n = 427
Let’s now distinguish within-year variation from the long-term trend
cyclic spline on month (1:12)
cubic regression spline on c.month (1:507)
## fit GAM with
b2 <- gam(co2 ~ s(month, bs = "cc") +
s(c.month, bs = "cr", k = 300),
knots = list(month = c(0.5, 12.5)),
method = 'REML',
data = co2s)
## ## Family: gaussian ## Link function: identity ## ## Formula: ## co2 ~ s(month, bs = "cc") + s(c.month, bs = "cr", k = 300) ## ## Parametric coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.382e+02 5.331e-03 63447 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Approximate significance of smooth terms: ## edf Ref.df F p-value ## s(month) 7.266 8.0 165.1 <2e-16 *** ## s(c.month) 107.309 131.8 56450.2 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## R-sq.(adj) = 1 Deviance explained = 100% ## -REML = -103.74 Scale est. = 0.012136 n = 427
Building GAMs requires more choices than building GLMs
It requires thinking hard about the system and the question
main effects & interactions te(x1, x2)
main effects & interactions te(x1, x2)
interactions only ti(x1, x2)
main effects & interactions te(x1, x2)
interactions only ti(x1, x2)
bivariate thin plate splines s(x1, x2, bs = 'tp')
main effects & interactions te(x1, x2)
interactions only ti(x1, x2)
bivariate thin plate splines s(x1, x2, bs = 'tp')
smooth by different levels of a factor f + s(x1, by = f)
be sure to include the factor as a fixed effect
akin to random regression slopes
We can also fit random effects
mgcv::gam(...,bs = “re”)`
mgcv::gamm()
gamm4::gamm4()
Thus, these models are equivalent:
lme(y ~ 1 + (1|f), method = "REML")
gam(y ~ s(f, bs = "re"), method = "REML")