Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Maximum Likelihood Estimation and Likelihood Ratio Testing

1 / 44

Etherpad



https://etherpad.wikimedia.org/p/607-mle-2020

2 / 44

Applying Different Styles of Inference

  • Null Hypothesis Testing: What's the probability that things are not influencing our data?

    • Deductive
  • Model Comparison: Comparison of alternate hypotheses

    • Deductive or Inductive
  • Cross-Validation: How good are you at predicting new data?

    • Deductive
  • Probabilistic Inference: What's our degree of belief in a data?

    • Inductive
3 / 44

Applying Different Styles of Inference

  • Null Hypothesis Testing: What's the probability that things are not influencing our data?
    • Deductive
  • Model Comparison: Comparison of alternate hypotheses

    • Deductive or Inductive
  • Cross-Validation: How good are you at predicting new data?

    • Deductive
  • Probabilistic Inference: What's our degree of belief in a data?
    • Inductive
4 / 44

To Get There, We Need To Understand Likelihood and Deviance

5 / 44

A Likely Lecture

  1. Introduction to Likelihood

  2. Maximum Likelihood Estimation

  3. Maximum Likelihood and Linear Regression

  4. Comparing Hypotheses with Likelihood

6 / 44

Likelihood: how well data support a given hypothesis.

7 / 44

Likelihood: how well data support a given hypothesis.

Note: Each and every parameter choice IS a hypothesis

7 / 44

Likelihood Defined



L(H|D)=p(D|H)

Where the D is the data and H is the hypothesis (model) including a both a data generating process with some choice of parameters (often called θ). The error generating process is inherent in the choice of probability distribution used for calculation.

8 / 44

Thinking in Terms of Models and Likelihood

  • First we have a Data Generating Process

    • This is our hypothesis about how the world works

    • ˆyi=β0+β1xi

9 / 44

Thinking in Terms of Models and Likelihood

  • First we have a Data Generating Process

    • This is our hypothesis about how the world works

    • ˆyi=β0+β1xi

  • Then we have a likelihood of the data given this hypothesis

    • This allows us to calculate the likelihood of observing our data given the hypothesis

    • Called the Likelihood Function

    • yi=N(ˆyi,σ)

9 / 44

All Kinds of Likelihood functions

  • Probability density functions are the most common
10 / 44

All Kinds of Likelihood functions

  • Probability density functions are the most common

  • But, hey, (yiˆyi)2 is one as well

10 / 44

All Kinds of Likelihood functions

  • Probability density functions are the most common

  • But, hey, (yiˆyi)2 is one as well

  • Extremely flexible

10 / 44

All Kinds of Likelihood functions

  • Probability density functions are the most common

  • But, hey, (yiˆyi)2 is one as well

  • Extremely flexible

  • The key is a function that can find a minimum or maximum value, depending on your parameters

10 / 44

Likelihood of a Single Value

What is the likelihood of a value of 1.5 given a hypothesized Normal distribution where the mean is 0 and the SD is 1?

11 / 44

Likelihood of a Single Value

What is the likelihood of a value of 1.5 given a hypothesized Normal distribution where the mean is 0 and the SD is 1.

12 / 44

Likelihood of a Single Value

What is the likelihood of a value of 1.5 given a hypothesized Normal distribution where the mean is 0 and the SD is 1.

L(μ=0,σ=1|Data=1.5)=dnorm(1.5,μ=0,σ=1)

12 / 44

A Likely Lecture

  1. Introduction to Likelihood

  2. Maximum Likelihood Estimation

  3. Maximum Likelihood and Linear Regression

  4. Comparing Hypotheses with Likelihood

13 / 44

The Maximum Likelihood Estimate is the value at which p(D|θ) - our likelihood function - is highest.

14 / 44

The Maximum Likelihood Estimate is the value at which p(D|θ) - our likelihood function - is highest.

To find it, we search across various values of θ

14 / 44

MLE for Multiple Data Points

Let's say this is our data:

[1] 3.37697212 3.30154837 1.90197683 1.86959410 0.20346568 3.72057350
[7] 3.93912102 2.77062225 4.75913135 3.11736679 2.14687718 3.90925918
[13] 4.19637296 2.62841610 2.87673977 4.80004312 4.70399588 -0.03876461
[19] 0.71102505 3.05830349
15 / 44

MLE for Multiple Data Points

Let's say this is our data:

[1] 3.37697212 3.30154837 1.90197683 1.86959410 0.20346568 3.72057350
[7] 3.93912102 2.77062225 4.75913135 3.11736679 2.14687718 3.90925918
[13] 4.19637296 2.62841610 2.87673977 4.80004312 4.70399588 -0.03876461
[19] 0.71102505 3.05830349

We know that the data comes from a normal population with a σ of 1.... but we want to get the MLE of the mean.

15 / 44

MLE for Multiple Data Points

Let's say this is our data:

[1] 3.37697212 3.30154837 1.90197683 1.86959410 0.20346568 3.72057350
[7] 3.93912102 2.77062225 4.75913135 3.11736679 2.14687718 3.90925918
[13] 4.19637296 2.62841610 2.87673977 4.80004312 4.70399588 -0.03876461
[19] 0.71102505 3.05830349

We know that the data comes from a normal population with a σ of 1.... but we want to get the MLE of the mean.

p(D|θ)=p(Di|θ)

15 / 44

MLE for Multiple Data Points

Let's say this is our data:

[1] 3.37697212 3.30154837 1.90197683 1.86959410 0.20346568 3.72057350
[7] 3.93912102 2.77062225 4.75913135 3.11736679 2.14687718 3.90925918
[13] 4.19637296 2.62841610 2.87673977 4.80004312 4.70399588 -0.03876461
[19] 0.71102505 3.05830349

We know that the data comes from a normal population with a σ of 1.... but we want to get the MLE of the mean.

p(D|θ)=p(Di|θ)

    = dnorm(Di,μ,σ=1)

15 / 44

Likelihood At Different Choices of Mean, Visually

16 / 44

The Likelihood Surface

MLE = 2.896

17 / 44

The Log-Likelihood Surface

We use Log-Likelihood as it is not subject to rounding error, and approximately χ2 distributed.

18 / 44

The χ2 Distribution

  • Distribution of sums of squares of k data points drawn from N(0,1)

  • k = Degrees of Freedom

  • Measures goodness of fit

  • A large probability density indicates a match between the squared difference of an observation and expectation

19 / 44

The χ2 Distribution, Visually

20 / 44

Hey, Look, it's the Standard Error!

The 68% CI of a χ2 distribution is 0.49, so....

21 / 44

Hey, Look, it's the 95% CI!

The 95% CI of a χ2 distribution is 1.92, so....

22 / 44

The Deviance: -2 * Log-Likelihood

  • Measure of fit. Smaller deviance = closer to perfect fit
  • We are minimizing now, just like minimizing sums of squares
  • Point deviance residuals have meaning
  • Point deviance of linear regression = mean square error!

23 / 44

A Likely Lecture

  1. Introduction to Likelihood

  2. Maximum Likelihood Estimation

  3. Maximum Likelihood and Linear Regression

  4. Comparing Hypotheses with Likelihood

24 / 44

Putting MLE Into Practice with Pufferfish

  • Pufferfish are toxic/harmful to predators

  • Batesian mimics gain protection from predation - why?

  • Evolved response to appearance?

  • Researchers tested with mimics varying in toxic pufferfish resemblance

25 / 44

This is our fit relationship

26 / 44

Likelihood Function for Linear Regression




Will often see:

`L(θ|D)=ni=1p(yi|xi; β0,β1,σ)`
27 / 44

Likelihood Function for Linear Regression: What it Means



L(θ|Data)=ni=1N(Visitsi|β0+β1Resemblancei,σ)

where β0,β1,σ are elements of θ

28 / 44

The Log-Likelihood Surface from Grid Sampling

29 / 44

Searching Likelihood Space: Algorithms

  • Grid sampling tooooo slow
  • Newtown-Raphson (algorithmicly implemented in nlm and optim with the BFGS method) uses derivatives
    • good for smooth surfaces & good start values
  • Nelder-Mead Simplex (optim’s default)
    • good for rougher surfaces, but slower
  • Maaaaaany more....
30 / 44

Searching Likelihood Space: Algorithms

  • Grid sampling tooooo slow
  • Newtown-Raphson (algorithmicly implemented in nlm and optim with the BFGS method) uses derivatives
    • good for smooth surfaces & good start values
  • Nelder-Mead Simplex (optim’s default)
    • good for rougher surfaces, but slower
  • Maaaaaany more....

30 / 44

Quantiative Model of Process Using Likelihood



Likelihood:
VisitsiN(^Visitsi,σ)



Data Generating Process:
^Visitsi=β0+β1Resemblancei

31 / 44

Fit Your Model!

puffer_glm <- glm(predators ~ resemblance,
data = puffer,
family = gaussian(link = "identity"))
32 / 44

Fit Your Model!

puffer_glm <- glm(predators ~ resemblance,
data = puffer,
family = gaussian(link = "identity"))
  • GLM stands for Generalized Linear Model
32 / 44

Fit Your Model!

puffer_glm <- glm(predators ~ resemblance,
data = puffer,
family = gaussian(link = "identity"))
  • GLM stands for Generalized Linear Model

  • We specify the error distribution and a 1:1 link between our data generating process and the value plugged into the error generating process

32 / 44

Fit Your Model!

puffer_glm <- glm(predators ~ resemblance,
data = puffer,
family = gaussian(link = "identity"))
  • GLM stands for Generalized Linear Model

  • We specify the error distribution and a 1:1 link between our data generating process and the value plugged into the error generating process

  • If we had specified "log" would be akin to a log transformation.... sort of

32 / 44

The Same Diagnostics

33 / 44

Well Behaved Likelihood Profiles

  • To get a profile for a single paramter, we calculate the MLE of all other parameters at different estimates of our parameter of interest

  • This should produce a nice quadratic curve, as we saw before

  • This is how we get our CI and SE (although we usually assume a quadratic distribution for speed)

  • BUT - with more complex models, we can get weird valleys, multiple optima, etc.

  • Common sign of a poorly fitting model - other diagnostics likely to fail as well

34 / 44

But - What do the Likelihood Profiles Look Like?

35 / 44

Are these nice symmetric slices?

Sometimes Easier to See with a Straight Line

tau = signed sqrt of difference from deviance

36 / 44

Evaluate Coefficients

term estimate std.error statistic p.value
(Intercept) 1.925 1.506 1.278 0.218
resemblance 2.989 0.571 5.232 0.000


Test Statistic is a Wald Z-Test Assuming a well behaved quadratic Confidence Interval

37 / 44

A Likely Lecture

  1. Introduction to Likelihood

  2. Maximum Likelihood Estimation

  3. Maximum Likelihood and Linear Regression

  4. Comparing Hypotheses with Likelihood

38 / 44

Applying Different Styles of Inference

  • Null Hypothesis Testing: What's the probability that things are not influencing our data?
    • Deductive
  • Model Comparison: Comparison of alternate hypotheses
    • Deductive or Inductive
  • Cross-Validation: How good are you at predicting new data?

    • Deductive
  • Probabilistic Inference: What's our degree of belief in a data?

    • Inductive
39 / 44

Can Compare p(data | H) for alternate Parameter Values

Compare p(D|θ1) versus p(D|θ2)

Likelihood Ratios


G=L(H1|D)L(H2|D)

  • G is the ratio of Maximum Likelihoods from each model
40 / 44

Can Compare p(data | H) for alternate Parameter Values

Compare p(D|θ1) versus p(D|θ2)

Likelihood Ratios


G=L(H1|D)L(H2|D)

  • G is the ratio of Maximum Likelihoods from each model
  • Used to compare goodness of fit of different models/hypotheses
40 / 44

Can Compare p(data | H) for alternate Parameter Values

Compare p(D|θ1) versus p(D|θ2)

Likelihood Ratios


G=L(H1|D)L(H2|D)

  • G is the ratio of Maximum Likelihoods from each model
  • Used to compare goodness of fit of different models/hypotheses
  • Most often, θ = MLE versus θ = 0
40 / 44

Can Compare p(data | H) for alternate Parameter Values

Compare p(D|θ1) versus p(D|θ2)

Likelihood Ratios


G=L(H1|D)L(H2|D)

  • G is the ratio of Maximum Likelihoods from each model
  • Used to compare goodness of fit of different models/hypotheses
  • Most often, θ = MLE versus θ = 0

  • 2log(G) is χ2 distributed

40 / 44

Likelihood Ratio Test

  • A new test statistic: D=2log(G)
41 / 44

Likelihood Ratio Test

  • A new test statistic: D=2log(G)

  • =2[Log(L(H2|D))Log(L(H1|D))]

41 / 44

Likelihood Ratio Test

  • A new test statistic: D=2log(G)

  • =2[Log(L(H2|D))Log(L(H1|D))]

  • It's χ2 distributed!

    • DF = Difference in # of Parameters
41 / 44

Likelihood Ratio Test

  • A new test statistic: D=2log(G)

  • =2[Log(L(H2|D))Log(L(H1|D))]

  • It's χ2 distributed!

    • DF = Difference in # of Parameters
  • If H1 is the Null Model, we have support for our alternate model
41 / 44

Likelihood Ratio Test for Regression

  • We compare our slope + intercept to a model fit with only an intercept!

  • Note, models must have the SAME response variable

int_only <- glm(predators ~ 1, data = puffer)
42 / 44

Likelihood Ratio Test for Regression

  • We compare our slope + intercept to a model fit with only an intercept!

  • Note, models must have the SAME response variable

int_only <- glm(predators ~ 1, data = puffer)
  • We then use Analysis of Deviance (ANODEV)
42 / 44

Our First ANODEV

Analysis of Deviance Table
Model 1: predators ~ 1
Model 2: predators ~ resemblance
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 19 422.95
2 18 167.80 1 255.15 1.679e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
43 / 44

When to Use Likelihood?

  • Great for complex models (beyond lm)

  • Great for anything with an objective function you can minimize

  • AND, even lm has a likelihood!

  • Ideal for model comparison

  • As we will see, Deviance has many uses...

44 / 44

Etherpad



https://etherpad.wikimedia.org/p/607-mle-2020

2 / 44
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow