- Understand the 3 elements of a generalized linear model
- Understand how to identify the proper distribution for a generalized linear model
- Understand the concept of a link function
8 May 2020
Given the prevalence of discrete data in ecology (and elsewhere), we seek a means for modeling them
GLMs were developed by Nelder & Wedderburn in the 1970s
They include (as special cases):
In particular, GLMs can explicitly model discrete data as outcomes
What is the distributional form of the random process(es) in my data?
The Poisson distribution is perhaps the best known
It gives the probability of a given number of events occurring in a fixed interval of time or space
It’s unique in that it has one parameter \(\lambda\) to describe both the mean and variance
\[ y_i \sim \text{Poisson}(\lambda) \] \[ \text{Mean}(y) = \text{Var}(y) = \lambda \]
Ratios (fractions) are also very important in ecology
They convey proportions such as
survivors / tagged individuals
infected / susceptible
student emails / total emails
The simplest ratio has as denominator of 1 & and numerator of either 0 or 1
For an individual, this can represent
present (1/1) or absent (0/1)
alive (1/1) or dead (0/1)
mature (1/1) or immature (0/1)
The Bernoulli distribution describes the probability of a single “event” \(y_i\) occurring
\[ y_i \sim \text{Bernoulli}(p) \]
where
\[ \text{Mean}(y) = p ~ ~ ~ ~ ~ \text{Var}(y) = p(1 - p) \]
The binomial distribution is closely related to the Bernoulli
It describes the number of \(k\) “successes” in a sequence of \(n\) independent Bernoulli “trials”
For example, the number of heads in 4 coin tosses
For a population, these could be
\(k\) survivors out of \(n\) tagged individuals
\(k\) infected individuals out of \(n\) susceptible individuals
\(k\) counts of allele A in \(n\) total chromosomes
Three important components
Are they counts, proportions?
Three important components
Distribution of the data
Link function \(g\)
Specifies the relationship between mean of the distribution and some linear predictor(s)
Three important components
Distribution of the data
Link function \(g\)
Linear predictor \(\eta\)
\(\eta = \mathbf{X} \boldsymbol{\beta}\)
Distribution | Link function | Mean function |
---|---|---|
Normal | \(1(\mu) = \mathbf{X} \boldsymbol{\beta}\) | \(\mu = \mathbf{X} \boldsymbol{\beta}\) |
Poisson | \(\log (\mu) = \mathbf{X} \boldsymbol{\beta}\) | \(\mu = \exp (\mathbf{X} \boldsymbol{\beta})\) |
Binomial | \(\log \left( \frac{\mu}{1 - \mu} \right) = \mathbf{X} \boldsymbol{\beta}\) | \(\mu = \frac{\exp (\mathbf{X} \boldsymbol{\beta})}{1 + \exp (\mathbf{X} \boldsymbol{\beta})}\) |
Where did we find these link functions?
For the exponential family of distributions (eg, Normal, Gamma, Poisson) we can write out the distribution of \(y\) as
\[
f(y; \theta, \phi) = \exp \left( \frac{y \theta - b(\theta)}{a(\phi)} - c(y, \phi)\right)
\]
\(\theta\) is the conanical parameter of interest
\(\phi\) is a scale (variance) parameter
\[ f(y; \theta, \phi) = \exp \left( \frac{y \theta - b(\theta)}{a(\phi)} - c(y, \phi)\right) \]
We seek some canonical function \(g\) that connects \(\eta\), \(\mu\), and \(\theta\) such that
\[ g(\mu) = \eta \\ \eta \equiv \theta \]
\[ f(y; \theta, \phi) = \exp \left( \frac{y \theta - b(\theta)}{a(\phi)} - c(y, \phi)\right) \\ \Downarrow \\ f(y; \mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma}} \exp \left( \frac{(y - \mu)^2}{2 \sigma^2} \right)\\ \]
with \(\theta = \mu\) and \(\phi = \sigma^2\)
\(a(\phi) = \phi ~~~~~~ b(\theta) = \frac{\theta^2}{2} ~~~~~~ c(y, \phi) = - \frac{\tfrac{y^2}{\phi} + \log (2 \pi \phi)}{2}\)
\[ f(y; \theta, \phi) = \exp \left( \frac{y \theta - b(\theta)}{a(\phi)} - c(y, \phi)\right) \\ \Downarrow \\ f(y; \mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma}} \exp \left( \frac{(y - \mu)^2}{2 \sigma^2} \right)\\ \]
with \(\theta = 1(\mu)\) and \(\phi = \sigma^2\)
\(a(\phi) = \phi ~~~~~~ b(\theta) = \frac{\theta^2}{2} ~~~~~~ c(y, \phi) = - \frac{\tfrac{y^2}{\phi} + \log (2 \pi \phi)}{2}\)
\[ f(y; \theta, \phi) = \exp \left( \frac{y \theta - b(\theta)}{a(\phi)} - c(y, \phi)\right) \\ \Downarrow \\ f(y; \mu) = \frac{\exp (- \mu) \mu^y}{y!} \\ \]
with \(\theta = \log (\mu)\) and \(\phi = 1\)
\(a(\phi) = 1 ~~~~~~ b(\theta) = \exp (\theta) ~~~~~~ c(y, \phi) = - \log (y!)\)
For the binomial distribution there are several possible link functions
logit
probit
complimentary log-log
The word generalized means these models are broadly applicable
For example, GLMs include linear regression models
\[ y_i = \alpha + \beta x_i + \epsilon_i \\ \epsilon_i \sim \text{N}(0,\sigma^2) \]
\[ y_i = \alpha + \beta x_i + \epsilon_i \\ \epsilon_i \sim \text{N}(0,\sigma^2) \\ \Downarrow \\ y_i = \mu_i + \epsilon_i \\ \mu_i = \alpha + \beta x_i \\ \epsilon_i \sim \text{N}(0,\sigma^2) \]
\[ y_i = \mu_i + \epsilon_i \\ \mu_i = \alpha + \beta x_i \\ \epsilon_i \sim \text{N}(0,\sigma^2) \\ \Downarrow \\ y_i = \epsilon_i \\ \mu_i = \alpha + \beta x_i \\ \epsilon_i \sim \text{N}(\mu_i,\sigma^2) \]
\[ y_i = \epsilon_i \\ \mu_i = \alpha + \beta x_i \\ \epsilon_i \sim \text{N}(\mu_i,\sigma^2) \\ \Downarrow \\ y_i \sim \text{N}(\mu_i,\sigma^2) \\ \mu_i = \alpha + \beta x_i \]
\[ y_i \sim \text{N}(\mu_i,\sigma^2) \\ \mu_i = \alpha + \beta x_i \\ \Downarrow \\ y_i \sim \text{N}(\mu_i,\sigma^2) \\ 1(\mu_i) = \mu_i \\ \mu_i = \alpha + \beta x_i \]
\[ y_i \sim \text{N}(\mu_i,\sigma^2) \\ 1(\mu_i) = \mu_i \\ \mu_i = \alpha + \beta x_i \\ \Downarrow \\ \begin{aligned} \text{data distribution:} & ~~ y_i \sim \text{N}(\mu_i,\sigma^2) \\ \\ \text{link function:} & ~~ 1(\mu_i) = \mu_i \\ \\ \text{linear predictor:} & ~~ \mu_i = \alpha + \beta x_i \end{aligned} \]
Log-density of live trees per unit area \(y_i\) as a function of fire intensity \(F_i\)
\[ \begin{aligned} \text{data distribution:} & ~~ y_i \sim \text{N}(\mu_i,\sigma^2) \\ \\ \text{link function:} & ~~ 1(\mu_i) = \mu_i \\ \\ \text{linear predictor:} & ~~ \mu_i = \alpha + \beta F_i \end{aligned} \]
We have been considering (log) density itself as a response
\[ \text{Density}_i = f (\text{Count}_i, \text{Area}_i) \\ \Downarrow \\ \text{Density}_i = \frac{\text{Count}_i}{\text{Area}_i} \\ \]
We have been considering (log) density itself as a response
\[ \text{Density}_i = f (\text{Count}_i, \text{Area}_i) \\ \Downarrow \\ \text{Density}_i = \frac{\text{Count}_i}{\text{Area}_i} \\ \]
With GLMs, we can shift our focus to
\[ \text{Count}_i = f (\text{Area}_i) \]
Counts of live trees \(y_i\) as a function of area surveyed \(A_i\) and fire intensity \(F_i\)
\[ \begin{aligned} \text{data distribution:} & ~~ y_i \sim \text{Poisson}(\lambda_i) \\ \\ \text{link function:} & ~~ \text{log}(\lambda_i) = \mu_i \\ \\ \text{linear predictor:} & ~~ \mu_i = \alpha + \beta_1 A_i + \beta_2 F_i \end{aligned} \]
Probability of spotting a sparrow \(p_i\) as a function of vegetation height \(H_i\)
\[ \begin{aligned} \text{data distribution:} & ~~ y_i \sim \text{Bernoulli}(p_i) \\ \\ \text{link function:} & ~~ \text{logit}(p_i) = \text{log}\left(\frac{p_i}{1-p_i}\right) = \mu_i \\ \\ \text{linear predictor:} & ~~ \mu_i = \alpha + \beta H_i \end{aligned} \]
Survival of salmon from parr to smolt \(s_i\) as a function of water temperature \(T_i\)
\[ \begin{aligned} \text{data distribution:} & ~~ y_i \sim \text{Binomial}(N_i, s_i) \\ \\ \text{link function:} & ~~ \text{logit}(s_i) = \text{log}\left(\frac{s_i}{1-s_i}\right) = \mu_i \\ \\ \text{linear predictor:} & ~~ \mu_i = \alpha + \beta T_i \end{aligned} \]
There are three important components to GLMs
Distribution of the data
Link function \(g\)
Linear predictor \(\eta\)