class: center, middle # Maximum Likelihood Estimation and Likelihood Ratio Testing<br> ![:scale 55%](images/13/hey_gurl_liklihood.jpeg) --- class: center, middle # Etherpad <br><br> <center><h3>https://etherpad.wikimedia.org/p/607-mle-2020</h3></center> --- # Applying Different Styles of Inference - **Null Hypothesis Testing**: What's the probability that things are not influencing our data? - Deductive - **Model Comparison**: Comparison of alternate hypotheses - Deductive or Inductive - **Cross-Validation**: How good are you at predicting new data? - Deductive - **Probabilistic Inference**: What's our degree of belief in a data? - Inductive --- # Applying Different Styles of Inference .grey[ - **Null Hypothesis Testing**: What's the probability that things are not influencing our data? - Deductive ] - **Model Comparison**: Comparison of alternate hypotheses - Deductive or Inductive - **Cross-Validation**: How good are you at predicting new data? - Deductive .grey[ - **Probabilistic Inference**: What's our degree of belief in a data? - Inductive ] --- # To Get There, We Need To Understand Likelihood and Deviance .center[ ![](./images/mmi/bjork-on-phone-yes-i-am-all-about-the-deviance-let-us-make-it-shrink-our-parameters.jpg) ] --- # A Likely Lecture 1. Introduction to Likelihood 2. Maximum Likelihood Estimation 3. Maximum Likelihood and Linear Regression 4. Comparing Hypotheses with Likelihood --- class: middle # Likelihood: how well data support a given hypothesis. -- ### Note: Each and every parameter choice IS a hypothesis --- # Likelihood Defined <br><br> `$$\Large L(H | D) = p(D | H)$$` Where the D is the data and H is the hypothesis (model) including a both a data generating process with some choice of parameters (often called `\(\theta\)`). The error generating process is inherent in the choice of probability distribution used for calculation. --- # Thinking in Terms of Models and Likelihood - First we have a **Data Generating Process** - This is our hypothesis about how the world works - `\(\hat{y}_i = \beta_0 + \beta_1 x_i\)` -- - Then we have a likelihood of the data given this hypothesis - This allows us to calculate the likelihood of observing our data given the hypothesis - Called the **Likelihood Function** - `\(y_{i} = N(\hat{y}_i, \sigma)\)` --- # All Kinds of Likelihood functions - Probability density functions are the most common -- - But, hey, `\(\sum(y_{i} - \hat{y}_i)^2\)` is one as well -- - Extremely flexible -- - The key is a function that can find a minimum or maximum value, depending on your parameters --- # Likelihood of a Single Value What is the likelihood of a value of 1.5 given a hypothesized Normal distribution where the mean is 0 and the SD is 1? <img src="mle_cv_files/figure-html/norm_lik-1.png" style="display: block; margin: auto;" /> --- # Likelihood of a Single Value What is the likelihood of a value of 1.5 given a hypothesized Normal distribution where the mean is 0 and the SD is 1. <img src="mle_cv_files/figure-html/norm_lik_2-1.png" style="display: block; margin: auto;" /> -- `$$\mathcal{L}(\mu = 0, \sigma = 1 | Data = 1.5) = dnorm(1.5, \mu = 0, \sigma = 1)$$` --- # A Likely Lecture 1. Introduction to Likelihood 2. .red[Maximum Likelihood Estimation] 3. Maximum Likelihood and Linear Regression 4. Comparing Hypotheses with Likelihood --- class: middle ## The Maximum Likelihood Estimate is the value at which `\(p(D | \theta)\)` - our likelihood function - is highest. -- #### To find it, we search across various values of `\(\theta\)` --- # MLE for Multiple Data Points Let's say this is our data: ``` [1] 3.37697212 3.30154837 1.90197683 1.86959410 0.20346568 3.72057350 [7] 3.93912102 2.77062225 4.75913135 3.11736679 2.14687718 3.90925918 [13] 4.19637296 2.62841610 2.87673977 4.80004312 4.70399588 -0.03876461 [19] 0.71102505 3.05830349 ``` -- We know that the data comes from a normal population with a `\(\sigma\)` of 1.... but we want to get the MLE of the mean. -- `\(p(D|\theta) = \prod p(D_i|\theta)\)` -- = `\(\prod dnorm(D_i, \mu, \sigma = 1)\)` --- # Likelihood At Different Choices of Mean, Visually <img src="mle_cv_files/figure-html/ml_search-1.png" style="display: block; margin: auto;" /> --- # The Likelihood Surface <img src="mle_cv_files/figure-html/lik_mean_surf-1.png" style="display: block; margin: auto;" /> MLE = 2.896 --- # The Log-Likelihood Surface We use Log-Likelihood as it is not subject to rounding error, and approximately `\(\chi^2\)` distributed. <img src="mle_cv_files/figure-html/loglik_surf-1.png" style="display: block; margin: auto;" /> --- # The `\(\chi^2\)` Distribution - Distribution of sums of squares of k data points drawn from N(0,1) - k = Degrees of Freedom - Measures goodness of fit - A large probability density indicates a match between the squared difference of an observation and expectation --- # The `\(\chi^2\)` Distribution, Visually <img src="mle_cv_files/figure-html/chisq_dist-1.png" style="display: block; margin: auto;" /> --- # Hey, Look, it's the Standard Error! The 68% CI of a `\(\chi^2\)` distribution is 0.49, so.... <img src="mle_cv_files/figure-html/loglik_zoom-1.png" style="display: block; margin: auto;" /> --- # Hey, Look, it's the 95% CI! The 95% CI of a `\(\chi^2\)` distribution is 1.92, so.... <img src="mle_cv_files/figure-html/ll_ci-1.png" style="display: block; margin: auto;" /> --- # The Deviance: -2 * Log-Likelihood - Measure of fit. Smaller deviance = closer to perfect fit - We are minimizing now, just like minimizing sums of squares - Point deviance residuals have meaning - Point deviance of linear regression = mean square error! <img src="mle_cv_files/figure-html/show_dev-1.png" style="display: block; margin: auto;" /> --- # A Likely Lecture 1. Introduction to Likelihood 2. Maximum Likelihood Estimation 3. .red[Maximum Likelihood and Linear Regression] 4. Comparing Hypotheses with Likelihood --- # Putting MLE Into Practice with Pufferfish .pull-left[ - Pufferfish are toxic/harmful to predators <br> - Batesian mimics gain protection from predation - why? <br><br> - Evolved response to appearance? <br><br> - Researchers tested with mimics varying in toxic pufferfish resemblance ] .pull-right[ ![:scale 80%](./images/11/puffer_mimics.jpg) ] --- # This is our fit relationship <img src="mle_cv_files/figure-html/puffershow-1.png" style="display: block; margin: auto;" /> --- # Likelihood Function for Linear Regression <br><br><br> <center>Will often see:<br><br> `\(\large L(\theta | D) = \prod_{i=1}^n p(y_i\; | \; x_i;\ \beta_0, \beta_1, \sigma)\)` </center> --- # Likelihood Function for Linear Regression: What it Means <br><br> `$$L(\theta | Data) = \prod_{i=1}^n \mathcal{N}(Visits_i\; |\; \beta_{0} + \beta_{1} Resemblance_i, \sigma)$$` <br><br> where `\(\beta_{0}, \beta_{1}, \sigma\)` are elements of `\(\theta\)` --- # The Log-Likelihood Surface from Grid Sampling <img src="mle_cv_files/figure-html/reg_lik_surf-1.png" style="display: block; margin: auto;" /> --- # Searching Likelihood Space: Algorithms .pull-left[ - Grid sampling tooooo slow - Newtown-Raphson (algorithmicly implemented in `nlm` and `optim` with the `BFGS` method) uses derivatives - good for smooth surfaces & good start values - Nelder-Mead Simplex (`optim`’s default) - good for rougher surfaces, but slower - Maaaaaany more.... ] -- .pull-right[ ![](images/mle/likelihood_boromir.jpg) ] --- # Quantiative Model of Process Using Likelihood <br><br> **Likelihood:** `\(Visits_i \sim \mathcal{N}(\hat{Visits_i}, \sigma)\)` <br><br><br> **Data Generating Process:** `\(\hat{Visits_i} = \beta_{0} + \beta_{1} Resemblance_i\)` --- # Fit Your Model! ```r puffer_glm <- glm(predators ~ resemblance, data = puffer, family = gaussian(link = "identity")) ``` -- - GLM stands for **Generalized Linear Model** -- - We specify the error distribution and a 1:1 link between our data generating process and the value plugged into the error generating process -- - If we had specified "log" would be akin to a log transformation.... sort of --- # The *Same* Diagnostics .pull-left[ <img src="mle_cv_files/figure-html/diag1-1.png" style="display: block; margin: auto;" /> ] .pull-right[ <img src="mle_cv_files/figure-html/diag2-1.png" style="display: block; margin: auto;" /> ] --- # Well Behaved Likelihood Profiles - To get a profile for a single paramter, we calculate the MLE of all other parameters at different estimates of our parameter of interest - This *should* produce a nice quadratic curve, as we saw before - This is how we get our CI and SE (although we usually assume a quadratic distribution for speed) - BUT - with more complex models, we can get weird valleys, multiple optima, etc. - Common sign of a poorly fitting model - other diagnostics likely to fail as well --- # But - What do the Likelihood Profiles Look Like? <img src="mle_cv_files/figure-html/profileR-1.png" style="display: block; margin: auto;" /> --- # Are these nice symmetric slices? ### Sometimes Easier to See with a Straight Line tau = signed sqrt of difference from deviance <img src="mle_cv_files/figure-html/profile-1.png" style="display: block; margin: auto;" /> --- # Evaluate Coefficients <table class="table table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 1.925 </td> <td style="text-align:right;"> 1.506 </td> <td style="text-align:right;"> 1.278 </td> <td style="text-align:right;"> 0.218 </td> </tr> <tr> <td style="text-align:left;"> resemblance </td> <td style="text-align:right;"> 2.989 </td> <td style="text-align:right;"> 0.571 </td> <td style="text-align:right;"> 5.232 </td> <td style="text-align:right;"> 0.000 </td> </tr> </tbody> </table> <br> Test Statistic is a Wald Z-Test Assuming a well behaved quadratic Confidence Interval --- # A Likely Lecture 1. Introduction to Likelihood 2. Maximum Likelihood Estimation 3. Maximum Likelihood and Linear Regression 4. .red[Comparing Hypotheses with Likelihood] --- # Applying Different Styles of Inference .grey[ - **Null Hypothesis Testing**: What's the probability that things are not influencing our data? - Deductive ] - **Model Comparison**: Comparison of alternate hypotheses - Deductive or Inductive .grey[ - **Cross-Validation**: How good are you at predicting new data? - Deductive - **Probabilistic Inference**: What's our degree of belief in a data? - Inductive ] --- # Can Compare p(data | H) for alternate Parameter Values <img src="mle_cv_files/figure-html/likelihoodDemo3-1.png" style="display: block; margin: auto;" /> Compare `\(p(D|\theta_{1})\)` versus `\(p(D|\theta_{2})\)` ## Likelihood Ratios <br> `$$\LARGE G = \frac{L(H_1 | D)}{L(H_2 | D)}$$` - G is the ratio of *Maximum Likelihoods* from each model -- - Used to compare goodness of fit of different models/hypotheses -- - Most often, `\(\theta\)` = MLE versus `\(\theta\)` = 0 -- - `\(-2 log(G)\)` is `\(\chi^2\)` distributed --- # Likelihood Ratio Test - A new test statistic: `\(D = -2 log(G)\)` -- - `\(= 2 [Log(L(H_2 | D)) - Log(L(H_1 | D))]\)` -- - It's `\(\chi^2\)` distributed! - DF = Difference in # of Parameters -- - If `\(H_1\)` is the Null Model, we have support for our alternate model --- # Likelihood Ratio Test for Regression - We compare our slope + intercept to a model fit with only an intercept! - Note, models must have the SAME response variable ```r int_only <- glm(predators ~ 1, data = puffer) ``` -- - We then use *Analysis of Deviance* (ANODEV) --- # Our First ANODEV ``` Analysis of Deviance Table Model 1: predators ~ 1 Model 2: predators ~ resemblance Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 19 422.95 2 18 167.80 1 255.15 1.679e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- # When to Use Likelihood? .pull-left[ - Great for **complex models** (beyond lm) - Great for anything with an **objective function** you can minimize - AND, even lm has a likelihood! - Ideal for **model comparison** - As we will see, Deviance has many uses... ] .pull-right[ ![](./images/mle/spell_likelihood.jpg) ]