- Understand the concept of a likelihood function
- Understand the difference between probability and likelihood
- Understand maximum likelihood estimation
- Understand the characteristics of maximum likelihood estimates
24 April 2020
What is maximum likelihood estimation?
A method used to estimate the parameter(s) of a model given some data
As the name suggests, the goal is to maximize the likelihood
Here we are referring to the likelihood of some parameters given some data, which can be written as
\[ \mathcal{L}(\theta | y) ~~ \text{or} ~~ \mathcal{L}(\boldsymbol{\theta} | \mathbf{y}) \]
Here we are referring to the likelihood of some parameters given some data, which can be written as
\[ \mathcal{L}(\theta | y) ~~ \text{or} ~~ \mathcal{L}(\boldsymbol{\theta} | \mathbf{y}) \]
We’ll write this as
\[ \mathcal{L}(y; \theta) ~~ \text{or} ~~ \mathcal{L}(\mathbf{y}; \boldsymbol{\theta}) \]
to avoid confusion with the “|” meaning conditional probability
Let’s define the likelihood function to be
\[ \mathcal{L}(y; \theta) = f_{\theta}(y) \]
where \(f_{\theta}(y)\) is a model for \(y\) with parameter(s) \(\theta\)
For discrete data, \(f_{\theta}(y)\) is the probability mass function (pmf)
For discrete data, \(f_{\theta}(y)\) is the probability mass function (pmf)
For continuous data, \(f_{\theta}(y)\) is the probability density function (pdf)
The pmf of pdf can be for any distribution
Let’s begin with the pdf for a Gaussian (normal) distribution
\[ f(y ; \mu, \sigma^{2}) \sim \text{N}(\mu, \sigma^{2}) \\ ~ \\ f(y ; \mu, \sigma^{2}) = \left( \frac{1}{2 \pi \sigma^{2}} \right)^{1/2} \exp \left[ - \frac{(y - \mu)^2}{2 \sigma^2} \right] \]
Note that \(f(y ; \mu, \sigma^{2})\) is not a probability!
The pdf gives you densities for given values of \(y\), \(\mu\) & \(\sigma^{2}\)
It’s only constraint is
\[ \int^{+\infty}_{-\infty} f(y) dy = 1 \]
For example, many densities of \(\text{Beta}(\alpha, \beta)\) > 1
Probability is linked to possible results
Possible results are mutually exclusive and exhaustive
Probability is linked to possible results
Likelihood is linked to hypotheses
Hypotheses are neither mutually exclusive nor exhaustive
What does it mean to maximize \(\mathcal{L}(y; \theta)\)?
We want to find the parameter(s) \(\theta\) of our model \(f_{\theta}(y)\) which are most likely to have generated our observed data \(y\)
More formally, we can write this as
\[ \begin{aligned} \hat{\theta} &= \max_{\theta} \mathcal{L}(y; \theta)) \\ &= \max_{\theta} f_{\theta}(y) \end{aligned} \]
In practice, we have multiple observations \(y = \{y_1, y_2, \dots, y_n\}\), so we need the joint distribution for \(y\)
\[ \hat{\theta} = \max_{\theta} f_{\theta}(y_1, y_2, \dots, y_n) \]
Remember independent and identically distributed (IID) errors?
If the data \(Y\) are independent, we can make use of
\[ f_{\theta}(y_1, y_2, \dots, y_n) = \prod_{i = 1}^n f_{\theta}(y_i) \]
The joint probability of all of the \(y_i\) is the product of their marginal probabilities
If the data \(Y\) are identically distributed, we can use the same distribution and parameterization for \(f_{\theta}(y)\)
If the data \(Y\) are both independent and identically distributed, then we have
\[ \hat{\theta} = \max_{\theta} \prod_{i = 1}^n f_{\theta}(y_i) \]
(This assumption isn’t necessary, but it makes our lives easier)
The value(s) of \(\hat{\theta}\) that maximizes the likelihood function is/are called the maximum likelihod estimate(s) (MLE) of \(\theta\)
Let’s begin with a simple example of coin tossing
Assume we have a “fair” coin with equal chance of coming up heads or tails
\(\Pr(H) = \Pr(T)\)
If we flip the coin 2 times, what is the probability that we get exactly 1 heads?
Our 4 possible outcomes are
2 of 4 flips are heads, so \(\Pr(H = 1) = 2/4 = 0.5\)
Let’s think about this in terms of the probabilities
\(\Pr(H = 1) = 0.25 + 0.25 = 0.5\)
We can generalize this by
\[ \begin{aligned} \Pr(H = 1) &= \Pr(H)(1 - \Pr(H)) + (1 - \Pr(H)) \Pr(H) \\ &= 2 [\Pr(H) (1 - \Pr(H))] \end{aligned} \]
Now consider the probability of exactly 1 heads in 3 coin tosses
\(\{H, H, H\}\) X
\(\{H, H, T\}\) X
\(\{H, T, H\}\) X
\(\{T, H, H\}\) X
\(\{H, T, T\}\) ✓
\(\{T, H, T\}\) ✓
\(\{T, T, H\}\) ✓
\(\{T, T, T\}\) X
\[ \begin{aligned} \Pr(H = 1) &= \Pr(H) (1 - \Pr(H)) (1 - \Pr(H)) \\ & ~~~~~ + (1 - \Pr(H)) \Pr(H) (1 - \Pr(H)) \\ & ~~~~~ + (1 - \Pr(H)) (1 - \Pr(H)) \Pr(H) \\ &= 3 [\Pr(H) (1 - \Pr(H))^2] \end{aligned} \]
Let’s define \(k\) to be the number of “successes” out of \(n\) “trials” and \(p\) to be the probability of a success
We can generalize our probability statement to be
\[ \Pr(k; n, p) = \left( \begin{array}{c} n \\ k \end{array} \right) p^k (1 - p)^{n - k} \\ ~ \\ \left( \begin{array}{c} n \\ k \end{array} \right) = \frac{n!}{k!(n - k)!} \]
What is the probability of getting 1 heads in 3 tosses?
## trials n <- 3 ## successes k <- 1 ## probability of success p <- 0.5 ## Pr(k = 1) choose(n, k) * p^k * (1 - p)^(n-k)
## [1] 0.375
What is the probability of getting 1 heads in 3 tosses?
## trials n <- 3 ## successes k <- 1 ## probability of success p <- 0.5 ## Pr(k = 1) dbinom(k, n, p)
## [1] 0.375
What if we don’t know what \(p\) is?
For example, we tag 100 juvenile fish in June and 20 are alive the following year
What is the probability of surviving?
We need to find \(p\) that maximizes the likelihood
\[ \mathcal{L}(k; n, p) = \left( \begin{array}{c} n \\ k \end{array} \right) p^k (1 - p)^{n - k} \\ \Downarrow \\ \max_p \mathcal{L}(20; 100, p) = \left( \begin{array}{c} 100 \\ 20 \end{array} \right) p^{20} (1 - p)^{100 - 20} \]
Let’s try some different values for \(p\)
\[ \small{ \mathcal{L}(20; 100, 0.3) = \left( \begin{array}{c} 100 \\ 20 \end{array} \right) 0.3^{20} (1 - 0.3)^{100 - 20} \approx 0.0076 \\ \mathcal{L}(20; 100, 0.25) = \left( \begin{array}{c} 100 \\ 20 \end{array} \right) 0.25^{20} (1 - 0.25)^{100 - 20} \approx 0.049 \\ \mathcal{L}(20; 100, 0.2) = \left( \begin{array}{c} 100 \\ 20 \end{array} \right) 0.2^{20} (1 - 0.2)^{100 - 20} \approx 0.099 \\ \mathcal{L}(20; 100, 0.15) = \left( \begin{array}{c} 100 \\ 20 \end{array} \right) 0.15^{20} (1 - 0.15)^{100 - 20} \approx 0.040 } \]
The maximum likelihood occurs at \(p = 0.2\)
In practice, finding the MLE is not so trivial
We will use numerical optimization methods to find the MLE
Let’s return to our general statement for the MLE
\[ \hat{\theta} = \max_{\theta} \prod_{i = 1}^n f_{\theta}(y_i) \]
If the densities are small and/or \(n\) is large, the product will become increasingly tiny
To address this, we can make use of the logarithm function, which has 2 nice properties:
it’s a monotonically increasing function
\(\log (ab) = \log(a) + \log(b)\)
We thereby transform our likelihood into a log-likelihood
\[ \begin{aligned} \hat{\theta} &= \max_{\theta} \prod_{i = 1}^n f_{\theta}(y_i) \\ &= \max_{\theta} \sum_{i = 1}^n \log f_{\theta}(y_i) \end{aligned} \]
If the data \(y\) are both independent and identically distributed, we can average over the log-likelihoods and remove the dependency on the number of observations
\[ \begin{aligned} \hat{\theta} &= \max_{\theta} \sum_{i = 1}^n \log f_{\theta}(y_i) \\ &= \max_{\theta} \frac{1}{n} \sum_{i = 1}^n \log f_{\theta}(y_i) \end{aligned} \]
Lastly, we have been focused on minimizing functions, so we’ll minimize the negative log-likelihood
\[ \hat{\theta} = \max_{\theta} \frac{1}{n} \sum_{i = 1}^n \log f_{\theta}(y_i) \\ \Downarrow \\ \hat{\theta}= \min_{\theta} -\frac{1}{n} \sum_{i = 1}^n \log f_{\theta}(y_i) \]
Let’s return to the pdf for a normal distribution
\[ f(y ; \mu, \sigma^{2}) = \left( \frac{1}{2 \pi \sigma^{2}} \right)^{1/2} \exp \left[ - \frac{(y - \mu)^2}{2 \sigma^2} \right] \\ \]
Let’s return to the pdf for a normal distribution
\[ f(y ; \mu, \sigma^{2}) = \left( \frac{1}{2 \pi \sigma^{2}} \right)^{1/2} \exp \left[ - \frac{(y - \mu)^2}{2 \sigma^2} \right] \\ \Downarrow \\ \begin{align} f(y_1, \dots, y_n ; \mu, \sigma^{2}) &= \prod_{i = 1}^n f(y_i ; \mu, \sigma^{2}) \\ &= \left( \frac{1}{2 \pi \sigma^{2}} \right)^{n/2} \exp \left[ - \frac{\sum_{i = 1}^n (y_i - \mu)^2}{2 \sigma^2} \right] \end{align} \]
The log-likelihood is then
\[ f(y ; \mu, \sigma^{2}) = \left( \frac{1}{2 \pi \sigma^{2}} \right)^{1/2} \exp \left[ - \frac{(y - \mu)^2}{2 \sigma^2} \right]\\ \Downarrow \\ \log f(y ; \mu, \sigma^{2}) = -\frac{n}{2} \log(2 \pi \sigma^{2}) -\frac{1}{2 \sigma^2} \sum_{i = 1}^n (y_i - \mu)^2 \]
What values of \(\mu\) and \(\sigma\) maximize the log-likelihood?
We need to take some derivatives!
\[ \frac{\partial}{\partial \mu} \log f(y ; \mu, \sigma^{2}) = 0 - \frac{-2 n (\bar{y} - \mu)}{2 \sigma^2} = 0 \\ \Downarrow \\ \frac{-2 n (\bar{y} - \mu)}{2 \sigma^2} = 0 \\ \Downarrow \\ \hat{\mu} = \bar{y} = \frac{1}{n} \sum_{i = 1}^n y_i \]
\[ \frac{\partial}{\partial \sigma} \log f(y ; \mu, \sigma^{2}) = -\frac{n}{\sigma} - \frac{1}{\sigma^3} \sum_{i = 1}^n (y_i - \mu) = 0 \\ \Downarrow \\ \frac{n}{\sigma} = \frac{1}{\sigma^3} \sum_{i = 1}^n (y_i - \mu) \\ \Downarrow \\ \hat{\sigma}^2 = \frac{1}{n} \sum_{i = 1}^n (y_i - \bar{y}) \\ \]
Recall from earlier lectures that we defined
\[ \hat{\sigma}^2 = \frac{1}{n-1} \sum_{i = 1}^n (y_i - \bar{y}) \\ \]
but our MLE is
\[ \hat{\sigma}^2_{MLE} = \frac{1}{n} \sum_{i = 1}^n (y_i - \bar{y}) \\ \]
Hence, our MLE for the variance is biased low
\[ \begin{aligned} (n-1) \hat{\sigma}^2 &= \sum_{i = 1}^n (y_i - \bar{y}) \\ n \hat{\sigma}^2_{MLE} &= \sum_{i = 1}^n (y_i - \bar{y}) \end{aligned} \\ \Downarrow \\ \hat{\sigma}^2_{MLE} = \frac{n - 1}{n} \hat{\sigma}^2 \]
Asymptotically, as \(n \rightarrow \infty\)
Invariance: if \(\hat{\theta}\) is MLE of \(\theta\) then \(f(\hat{\theta})\) is MLE of \(f(\theta)\)
For cases where \(\mathbf{y} \sim \text{N}(\mathbf{X} \boldsymbol{\beta}, \Sigma)\) then
\(\hat{\boldsymbol{\beta}} = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y}\)
is also the MLE for \(\boldsymbol{\beta}\)
Maximum likelihood estimation is much more general than least squares, which means we can use it for
mixed effects models
generalized linear models
Bayesian inference