Maximum Likelihood Estimation and Likelihood Ratio Testing

1 / 44

Etherpad

https://etherpad.wikimedia.org/p/607-mle-2020

2 / 44

Applying Different Styles of Inference

Null Hypothesis Testing: What's the probability that things are not influencing our data?
- Deductive
Model Comparison: Comparison of alternate hypotheses
- Deductive or Inductive
Cross-Validation: How good are you at predicting new data?
- Deductive
Probabilistic Inference: What's our degree of belief in a data?
- Inductive

3 / 44

Applying Different Styles of Inference

Null Hypothesis Testing: What's the probability that things are not influencing our data?
- Deductive

Model Comparison: Comparison of alternate hypotheses
- Deductive or Inductive
Cross-Validation: How good are you at predicting new data?
- Deductive

Probabilistic Inference: What's our degree of belief in a data?
- Inductive

4 / 44

To Get There, We Need To Understand Likelihood and Deviance

5 / 44

A Likely Lecture

Introduction to Likelihood
Maximum Likelihood Estimation
Maximum Likelihood and Linear Regression
Comparing Hypotheses with Likelihood

6 / 44

Likelihood: how well data support a given hypothesis.7 / 44

Likelihood: how well data support a given hypothesis.Note: Each and every parameter choice IS a hypothesis7 / 44

Likelihood Defined

$\Large L(H | D) = p(D | H)$

Where the D is the data and H is the hypothesis (model) including a both a data generating process with some choice of parameters (often called $\theta$ ). The error generating process is inherent in the choice of probability distribution used for calculation.

8 / 44

Thinking in Terms of Models and Likelihood

First we have a Data Generating Process
- This is our hypothesis about how the world works
- $\hat{y}_i = \beta_0 + \beta_1 x_i$

9 / 44

Thinking in Terms of Models and Likelihood

First we have a Data Generating Process
- This is our hypothesis about how the world works
- $\hat{y}_i = \beta_0 + \beta_1 x_i$

Then we have a likelihood of the data given this hypothesis
- This allows us to calculate the likelihood of observing our data given the hypothesis
- Called the Likelihood Function
- $y_{i} = N(\hat{y}_i, \sigma)$

9 / 44

All Kinds of Likelihood functionsProbability density functions are the most common  
10 / 44

All Kinds of Likelihood functions

Probability density functions are the most common
But, hey, $\sum(y_{i} - \hat{y}_i)^2$ is one as well

10 / 44

All Kinds of Likelihood functions

Probability density functions are the most common
But, hey, $\sum(y_{i} - \hat{y}_i)^2$ is one as well
Extremely flexible

10 / 44

All Kinds of Likelihood functions

Probability density functions are the most common
But, hey, $\sum(y_{i} - \hat{y}_i)^2$ is one as well
Extremely flexible
The key is a function that can find a minimum or maximum value, depending on your parameters

10 / 44

Likelihood of a Single Value

What is the likelihood of a value of 1.5 given a hypothesized Normal distribution where the mean is 0 and the SD is 1?

11 / 44

Likelihood of a Single Value

What is the likelihood of a value of 1.5 given a hypothesized Normal distribution where the mean is 0 and the SD is 1.

12 / 44

Likelihood of a Single Value

What is the likelihood of a value of 1.5 given a hypothesized Normal distribution where the mean is 0 and the SD is 1.

$\mathcal{L}(\mu = 0, \sigma = 1 | Data = 1.5) = dnorm(1.5, \mu = 0, \sigma = 1)$

12 / 44

A Likely Lecture

Introduction to Likelihood
Maximum Likelihood Estimation
Maximum Likelihood and Linear Regression
Comparing Hypotheses with Likelihood

13 / 44

The Maximum Likelihood Estimate is the value at which p(D|θ) - our likelihood function -  is highest.14 / 44

The Maximum Likelihood Estimate is the value at which p(D|θ) - our likelihood function -  is highest.To find it, we search across various values of θ14 / 44

MLE for Multiple Data Points

Let's say this is our data:

 [1]  3.37697212  3.30154837  1.90197683  1.86959410  0.20346568  3.72057350
 [7]  3.93912102  2.77062225  4.75913135  3.11736679  2.14687718  3.90925918
[13]  4.19637296  2.62841610  2.87673977  4.80004312  4.70399588 -0.03876461
[19]  0.71102505  3.05830349

15 / 44

MLE for Multiple Data Points

Let's say this is our data:

 [1]  3.37697212  3.30154837  1.90197683  1.86959410  0.20346568  3.72057350
 [7]  3.93912102  2.77062225  4.75913135  3.11736679  2.14687718  3.90925918
[13]  4.19637296  2.62841610  2.87673977  4.80004312  4.70399588 -0.03876461
[19]  0.71102505  3.05830349

We know that the data comes from a normal population with a $\sigma$ of 1.... but we want to get the MLE of the mean.

15 / 44

MLE for Multiple Data Points

Let's say this is our data:

 [1]  3.37697212  3.30154837  1.90197683  1.86959410  0.20346568  3.72057350
 [7]  3.93912102  2.77062225  4.75913135  3.11736679  2.14687718  3.90925918
[13]  4.19637296  2.62841610  2.87673977  4.80004312  4.70399588 -0.03876461
[19]  0.71102505  3.05830349

We know that the data comes from a normal population with a $\sigma$ of 1.... but we want to get the MLE of the mean.

$p(D|\theta) = \prod p(D_i|\theta)$

15 / 44

MLE for Multiple Data Points

Let's say this is our data:

 [1]  3.37697212  3.30154837  1.90197683  1.86959410  0.20346568  3.72057350
 [7]  3.93912102  2.77062225  4.75913135  3.11736679  2.14687718  3.90925918
[13]  4.19637296  2.62841610  2.87673977  4.80004312  4.70399588 -0.03876461
[19]  0.71102505  3.05830349

We know that the data comes from a normal population with a $\sigma$ of 1.... but we want to get the MLE of the mean.

$p(D|\theta) = \prod p(D_i|\theta)$

= $\prod dnorm(D_i, \mu, \sigma = 1)$

15 / 44

Likelihood At Different Choices of Mean, Visually

16 / 44

The Likelihood Surface

MLE = 2.896

17 / 44

The Log-Likelihood Surface

We use Log-Likelihood as it is not subject to rounding error, and approximately $\chi^2$ distributed.

18 / 44

The $\chi^2$ Distribution

Distribution of sums of squares of k data points drawn from N(0,1)
k = Degrees of Freedom
Measures goodness of fit
A large probability density indicates a match between the squared difference of an observation and expectation

19 / 44

The $\chi^2$ Distribution, Visually

20 / 44

Hey, Look, it's the Standard Error!

The 68% CI of a $\chi^2$ distribution is 0.49, so....

21 / 44

Hey, Look, it's the 95% CI!

The 95% CI of a $\chi^2$ distribution is 1.92, so....

22 / 44

The Deviance: -2 * Log-Likelihood

Measure of fit. Smaller deviance = closer to perfect fit
We are minimizing now, just like minimizing sums of squares
Point deviance residuals have meaning
Point deviance of linear regression = mean square error!

23 / 44

A Likely Lecture

Introduction to Likelihood
Maximum Likelihood Estimation
Maximum Likelihood and Linear Regression
Comparing Hypotheses with Likelihood

24 / 44

Putting MLE Into Practice with Pufferfish

Pufferfish are toxic/harmful to predators
Batesian mimics gain protection from predation - why?
Evolved response to appearance?
Researchers tested with mimics varying in toxic pufferfish resemblance

25 / 44

This is our fit relationship

26 / 44

Likelihood Function for Linear Regression

Will often see:

`

$\large L(\theta | D) = \prod_{i=1}^n p(y_i\; | \; x_i;\ \beta_0, \beta_1, \sigma)$ `

27 / 44

Likelihood Function for Linear Regression: What it Means

$L(\theta | Data) = \prod_{i=1}^n \mathcal{N}(Visits_i\; |\; \beta_{0} + \beta_{1} Resemblance_i, \sigma)$

where $\beta_{0}, \beta_{1}, \sigma$ are elements of $\theta$

28 / 44

The Log-Likelihood Surface from Grid Sampling

29 / 44

Searching Likelihood Space: AlgorithmsGrid sampling tooooo slow

Newtown-Raphson (algorithmicly implemented in nlm and
optim with the BFGS method) uses derivativesgood for smooth surfaces & good start values  

Nelder-Mead Simplex (optim’s default)good for rougher surfaces, but slower  

Maaaaaany more....

30 / 44

Searching Likelihood Space: Algorithms

Grid sampling tooooo slow

Newtown-Raphson (algorithmicly implemented in nlm and optim with the BFGS method) uses derivatives
- good for smooth surfaces & good start values

Nelder-Mead Simplex (optim’s default)
- good for rougher surfaces, but slower

Maaaaaany more....

30 / 44

Quantiative Model of Process Using Likelihood

Likelihood:
$Visits_i \sim \mathcal{N}(\hat{Visits_i}, \sigma)$

Data Generating Process:
$\hat{Visits_i} = \beta_{0} + \beta_{1} Resemblance_i$

31 / 44

Fit Your Model!

puffer_glm <- glm(predators ~ resemblance, 
                  data = puffer,
                  family = gaussian(link = "identity"))

32 / 44

Fit Your Model!

puffer_glm <- glm(predators ~ resemblance, 
                  data = puffer,
                  family = gaussian(link = "identity"))

GLM stands for Generalized Linear Model

32 / 44

Fit Your Model!

puffer_glm <- glm(predators ~ resemblance, 
                  data = puffer,
                  family = gaussian(link = "identity"))

GLM stands for Generalized Linear Model
We specify the error distribution and a 1:1 link between our data generating process and the value plugged into the error generating process

32 / 44

Fit Your Model!

puffer_glm <- glm(predators ~ resemblance, 
                  data = puffer,
                  family = gaussian(link = "identity"))

GLM stands for Generalized Linear Model
We specify the error distribution and a 1:1 link between our data generating process and the value plugged into the error generating process
If we had specified "log" would be akin to a log transformation.... sort of

32 / 44

The Same Diagnostics

33 / 44

Well Behaved Likelihood Profiles

To get a profile for a single paramter, we calculate the MLE of all other parameters at different estimates of our parameter of interest
This should produce a nice quadratic curve, as we saw before
This is how we get our CI and SE (although we usually assume a quadratic distribution for speed)
BUT - with more complex models, we can get weird valleys, multiple optima, etc.
Common sign of a poorly fitting model - other diagnostics likely to fail as well

34 / 44

But - What do the Likelihood Profiles Look Like?

35 / 44

Are these nice symmetric slices?

Sometimes Easier to See with a Straight Line

tau = signed sqrt of difference from deviance

36 / 44

Evaluate Coefficients

term	estimate	std.error	statistic	p.value
(Intercept)	1.925	1.506	1.278	0.218
resemblance	2.989	0.571	5.232	0.000

Test Statistic is a Wald Z-Test Assuming a well behaved quadratic Confidence Interval

37 / 44

A Likely Lecture

Introduction to Likelihood
Maximum Likelihood Estimation
Maximum Likelihood and Linear Regression
Comparing Hypotheses with Likelihood

38 / 44

Applying Different Styles of Inference

Null Hypothesis Testing: What's the probability that things are not influencing our data?
- Deductive

Model Comparison: Comparison of alternate hypotheses
- Deductive or Inductive

Cross-Validation: How good are you at predicting new data?
- Deductive
Probabilistic Inference: What's our degree of belief in a data?
- Inductive

39 / 44

Can Compare p(data | H) for alternate Parameter Values

Compare $p(D|\theta_{1})$ versus $p(D|\theta_{2})$

Likelihood Ratios

$\LARGE G = \frac{L(H_1 | D)}{L(H_2 | D)}$

G is the ratio of Maximum Likelihoods from each model

40 / 44

Can Compare p(data | H) for alternate Parameter Values

Compare $p(D|\theta_{1})$ versus $p(D|\theta_{2})$

Likelihood Ratios

$\LARGE G = \frac{L(H_1 | D)}{L(H_2 | D)}$

G is the ratio of Maximum Likelihoods from each model
Used to compare goodness of fit of different models/hypotheses

40 / 44

Can Compare p(data | H) for alternate Parameter Values

Compare $p(D|\theta_{1})$ versus $p(D|\theta_{2})$

Likelihood Ratios

$\LARGE G = \frac{L(H_1 | D)}{L(H_2 | D)}$

G is the ratio of Maximum Likelihoods from each model
Used to compare goodness of fit of different models/hypotheses
Most often, $\theta$ = MLE versus $\theta$ = 0

40 / 44

Can Compare p(data | H) for alternate Parameter Values

Compare $p(D|\theta_{1})$ versus $p(D|\theta_{2})$

Likelihood Ratios

$\LARGE G = \frac{L(H_1 | D)}{L(H_2 | D)}$

G is the ratio of Maximum Likelihoods from each model
Used to compare goodness of fit of different models/hypotheses
Most often, $\theta$ = MLE versus $\theta$ = 0
$-2 log(G)$ is $\chi^2$ distributed

40 / 44

Likelihood Ratio TestA new test statistic: D=−2log(G)  
41 / 44

Likelihood Ratio Test

A new test statistic: $D = -2 log(G)$
$= 2 [Log(L(H_2 | D)) - Log(L(H_1 | D))]$

41 / 44

Likelihood Ratio Test

A new test statistic: $D = -2 log(G)$
$= 2 [Log(L(H_2 | D)) - Log(L(H_1 | D))]$
It's $\chi^2$ distributed!
- DF = Difference in # of Parameters

41 / 44

Likelihood Ratio Test

A new test statistic: $D = -2 log(G)$
$= 2 [Log(L(H_2 | D)) - Log(L(H_1 | D))]$
It's $\chi^2$ distributed!
- DF = Difference in # of Parameters

If $H_1$ is the Null Model, we have support for our alternate model

41 / 44

Likelihood Ratio Test for Regression

We compare our slope + intercept to a model fit with only an intercept!
Note, models must have the SAME response variable

int_only <- glm(predators ~ 1, data = puffer)

42 / 44

Likelihood Ratio Test for Regression

We compare our slope + intercept to a model fit with only an intercept!
Note, models must have the SAME response variable

int_only <- glm(predators ~ 1, data = puffer)

We then use Analysis of Deviance (ANODEV)

42 / 44

Our First ANODEV

Analysis of Deviance Table
Model 1: predators ~ 1
Model 2: predators ~ resemblance
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1        19     422.95                          
2        18     167.80  1   255.15 1.679e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

43 / 44

When to Use Likelihood?

Great for complex models (beyond lm)
Great for anything with an objective function you can minimize
AND, even lm has a likelihood!
Ideal for model comparison
As we will see, Deviance has many uses...

44 / 44

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Maximum Likelihood Estimation and Likelihood Ratio Testing

Etherpad

https://etherpad.wikimedia.org/p/607-mle-2020

Applying Different Styles of Inference

Applying Different Styles of Inference

To Get There, We Need To Understand Likelihood and Deviance

A Likely Lecture

Likelihood: how well data support a given hypothesis.

Likelihood: how well data support a given hypothesis.

Note: Each and every parameter choice IS a hypothesis

Likelihood Defined

Thinking in Terms of Models and Likelihood

Thinking in Terms of Models and Likelihood

All Kinds of Likelihood functions

All Kinds of Likelihood functions

All Kinds of Likelihood functions

All Kinds of Likelihood functions

Likelihood of a Single Value

Likelihood of a Single Value

Likelihood of a Single Value

A Likely Lecture

The Maximum Likelihood Estimate is the value at which p(D|θ)p(D | \theta) - our likelihood function - is highest.

The Maximum Likelihood Estimate is the value at which p(D|θ)p(D | \theta) - our likelihood function - is highest.

To find it, we search across various values of θ\theta

MLE for Multiple Data Points

MLE for Multiple Data Points

MLE for Multiple Data Points

MLE for Multiple Data Points

Likelihood At Different Choices of Mean, Visually

The Likelihood Surface

The Log-Likelihood Surface

The χ2\chi^2 Distribution

The χ2\chi^2 Distribution, Visually

Hey, Look, it's the Standard Error!

Hey, Look, it's the 95% CI!

The Deviance: -2 * Log-Likelihood

A Likely Lecture

Putting MLE Into Practice with Pufferfish

This is our fit relationship

Likelihood Function for Linear Regression

Likelihood Function for Linear Regression: What it Means

The Log-Likelihood Surface from Grid Sampling

Searching Likelihood Space: Algorithms

Searching Likelihood Space: Algorithms

Quantiative Model of Process Using Likelihood

Fit Your Model!

Fit Your Model!

Fit Your Model!

Fit Your Model!

The Same Diagnostics

Well Behaved Likelihood Profiles

But - What do the Likelihood Profiles Look Like?

Are these nice symmetric slices?

Sometimes Easier to See with a Straight Line

Evaluate Coefficients

A Likely Lecture

Applying Different Styles of Inference

Can Compare p(data | H) for alternate Parameter Values

Likelihood Ratios

Can Compare p(data | H) for alternate Parameter Values

Likelihood Ratios

Can Compare p(data | H) for alternate Parameter Values

Likelihood Ratios

Can Compare p(data | H) for alternate Parameter Values

Likelihood Ratios

Likelihood Ratio Test

Likelihood Ratio Test

Likelihood Ratio Test

Likelihood Ratio Test

Likelihood Ratio Test for Regression

Likelihood Ratio Test for Regression

Our First ANODEV

When to Use Likelihood?

Etherpad

https://etherpad.wikimedia.org/p/607-mle-2020

Help

The Maximum Likelihood Estimate is the value at which $p(D | \theta)$ - our likelihood function - is highest.

The Maximum Likelihood Estimate is the value at which $p(D | \theta)$ - our likelihood function - is highest.

To find it, we search across various values of $\theta$

The $\chi^2$ Distribution

The $\chi^2$ Distribution, Visually