Likelihood!

1. Likelihood of a Single Data Point

At its core, with likelihood we are searching parameter space and attempting to maximize our likelihood of a model given the data. Let’s begin with a simple one parameter model of how the world works and a single data point. Sure, it’s trivial, but it demonstrates the core properties of likelihood.

Let’s assume we are sampling a field of flowers. As this is count data, we have assumed a Poisson distribution. We throw our plot down, and find 10 flowers. What is the MLE of \(\lambda\), the mean (and variance) of our POisson distribution?

1.1 The Likelihood Function

As we’ve assumed our data is Poisson distributed, our **likelihood function* is that of a Poisson distribution. Now, we could write out the formula for the probability of a data point given a Poisson distribution (note L(H|D = p(D|H))), but, hey, these are just the probability density functions of each distribution! So, in this case, ’dpois(x, lambda)`.

Now, how do we search parameter space? We’ve got a Poisson distribution and an observation of 10. A reasonable set of values could be from 0 to 20. Why not!

At this point, we can simply vectorize our search for a maximum likelihood.

dpois(x = 10, lambda = 0:20)

##  [1] 0.000000e+00 1.013777e-07 3.818985e-05 8.101512e-04 5.292477e-03
##  [6] 1.813279e-02 4.130309e-02 7.098327e-02 9.926153e-02 1.185801e-01
## [11] 1.251100e-01 1.193781e-01 1.048373e-01 8.587015e-02 6.628184e-02
## [16] 4.861075e-02 3.409770e-02 2.299958e-02 1.498516e-02 9.466247e-03
## [21] 5.816307e-03

Well, great. We can squint at that output and find our Maximum Likelihood. But that’s a pain in the butt. We can instead use R intelligently to grab estimates for us.

lik_pois_10 <- dpois(x = 10, lambda = 0:20)

#maximum likelihood
max(lik_pois_10)

## [1] 0.12511

Yay! We have a likelihood. But…what’s the estimate? Now, we can use which to figure this all out to get the index of the value of the vector of lambdas. But this gets…clunky. Here’s an example, just for your delictation.

#Index of the MLE
idx_mle <- which(lik_pois_10 == max(lik_pois_10))

idx_mle

## [1] 11

#OK, what's the MLE?
c(0:20)[idx_mle]

## [1] 10

And, of course, it would have been better practice to have a vector of lambda values to work on, etc., but, woof. This gets ugly and code-y fast.

1.2 MLE with `dplyr`

Why not keep all of our information in one nice tidy place and use dplyr to get easy answers? We can make a data frame with a set of lambda values, calculate the likelihood, and then filter down to the MLE for our answer.

library(dplyr)

pois_10 <- data.frame(lambda_vals = 0:20) %>%
  mutate(likelihood = dpois(x = 10, lambda = lambda_vals)) 

#Get the MLE
pois_10 %>%
  filter(likelihood == max(likelihood))

##   lambda_vals likelihood
## 1          10    0.12511

We can even use this to make a nice plot.

ggplot(pois_10, aes(x=lambda_vals, y=likelihood)) +
  geom_point()

1.3 Log-Likelihood

So, notice that we have a lot of values close to 0? Lots of room for rounding error and weird peaks. And while this distribtion was nicely shaped, due to these issues, the log-likelihood often looks much nicer (and leads to good test statistic performance).

How do we add that? We simply add the log=TRUE argument to any probability density function.

pois_10 <- pois_10 %>%
  mutate(log_likelihood = dpois(x = 10, lambda = lambda_vals, log=TRUE))



ggplot(pois_10, aes(x=lambda_vals, y=log_likelihood)) +
  geom_point()

Note, I’m not logging after the fact. That’s because we’ll get different estimates if we log within the function versus after a really really small value is spit out.

2. Likelihood of a Data Set

2.1 Integrating Likelihood over Many Data Points

Here’s the beauty of a data set. The only two differences between the workflow for 1 point and many is first, that you use either prod() (for likelihood) or sum() (for log-likelihood) to get the total value. Second, as the density functions don’t take kindly to a vector of data and a vector of parameters, we’ll use rowwise() to iterate over rows, but ungroup() after for other operations.

Let’s try this for a random dataset that we generate from a Poisson distribution with a lambda of 10.

Note the tiny tiny changes in the calculation of the likelihood and log-likelihood.

set.seed(607)
pois_data <- rpois(10, lambda=10)

pois_mle <- data.frame(lambda_vals = 0:20) %>%
  rowwise() %>%
  mutate(likelihood = prod(dpois(x = pois_data, lambda = lambda_vals)),
         log_likelihood = sum(dpois(x = pois_data, lambda = lambda_vals, log=TRUE))) %>%
  ungroup()

#Plot the surface
qplot(lambda_vals, log_likelihood, data=pois_mle)

#Get the MLE
pois_mle %>%
  filter(log_likelihood == max(log_likelihood))

## # A tibble: 1 × 3
##   lambda_vals   likelihood log_likelihood
##         <int>        <dbl>          <dbl>
## 1           9 7.405967e-11      -23.32615

This example also shows why we use Log-Likelihood over Likelihood. Let’s look at the first few rows of the output:

pois_mle

## # A tibble: 21 × 3
##    lambda_vals   likelihood log_likelihood
##          <int>        <dbl>          <dbl>
## 1            0 0.000000e+00           -Inf
## 2            1 4.363144e-60     -136.68191
## 3            2 6.130478e-38      -85.68496
## 4            3 8.721281e-27      -60.00403
## 5            4 3.910608e-20      -44.68801
## 6            5 5.989363e-16      -35.05138
## 7            6 2.525721e-13      -29.00708
## 8            7 8.928081e-12      -25.44182
## 9            8 5.141675e-11      -23.69106
## 10           9 7.405967e-11      -23.32615
## # ... with 11 more rows

Notice that we cannot differentiate the first four values using likelihood because they are so small? But the difference is clear for Log Likelihood.

2.2 Confidence Intervals

OK, given our assumption that the CI of each parameter is 1.92 away from the Maximum Log-Likelhood. Again, pretty straightfoward to get with filtering.

pois_mle %>%
  filter(log_likelihood > (max(log_likelihood) - 1.92) )

## # A tibble: 3 × 3
##   lambda_vals   likelihood log_likelihood
##         <int>        <dbl>          <dbl>
## 1           8 5.141675e-11      -23.69106
## 2           9 7.405967e-11      -23.32615
## 3          10 3.575165e-11      -24.05442

So, our CI is between 8 and 11.

2.3 Examples

You have run 5 trials of flipping 20 coins. The number of heads in each trial is: 11, 10, 8, 9, 7. What’s the maximum likelihood estimate of the probability getting heads?

3. Likelihood with Two Parameters

3.1 Two parameters and crossing!

The great thing about multiple parameters is that, heck, it’s no different than one parameter. Let’s use the seals example from last week and look at estimating the mean and SD of the seal age distribution. TO start with, what’s the range of values we should choose to test? Let’s look at the data!

seals <- read.csv("data/06/17e8ShrinkingSeals Trites 1996.csv")

hist(seals$age.days)

Eyeballing this, I’d say the mean is between 3700 and 3800 and the SD is…oh, between 1280 and 1310

Fair enough. We can use crossing in tidyr to generate a set of parameters to test, and then literally nothing is different from the workflow before. Let’s just look at the Log Likelihood.

seal_dist <- crossing(m = seq(3720, 3740, by = 0.1), 
                      s=seq(1280, 1310, by = 0.1)) %>%
  rowwise() %>%
  mutate(log_lik = sum(dnorm(seals$age.days, mean = m, sd = s, log=TRUE))) %>%
  ungroup()

Notice that took a bit That’s because we searched 60501 combinations of values. And we didn’t even do that with high precision! And I gave you a really narrow band of possible values! If anything can convince you that we need an algorithmic solution rather than something brute force - I hope this does!

What’s the MLE?

seal_dist %>%
  filter(log_lik == max(log_lik))

## # A tibble: 1 × 3
##        m      s   log_lik
##    <dbl>  <dbl>     <dbl>
## 1 3730.2 1293.4 -82964.19

And can we plot a super-sexy contour plot of the surface?

ggplot(seal_dist, aes(x=m, y=s, z=log_lik)) +
  geom_contour(aes(color = ..level..))

Note that odd color thing enables you to see the contours.

3.2 Profile Likelihoods for CIs

So, how do we get profile likelihoods from this mess?

Simply put, for each variable we want the profile of, we look at the MLE estimate at each value of the other parameter. So, group by the other parameter and filter out the MLE. So for the mean -

seal_sd_profile <- seal_dist %>%
  group_by(s) %>%
  filter(log_lik == max(log_lik)) %>%
  ungroup()

qplot(s, log_lik, data=seal_sd_profile)

OK, nice profile.

So, what about the CI? Unfortunately, even after filtering, we have a lot of values. We want to just look at the first and last. Weirdly, filter lets us use row_number as something to work on. So, let’s take advantage of that here!

See the CI

seal_sd_profile %>%
  filter(log_lik > max(log_lik) - 1.92) %>%
    filter(row_number()==1 | row_number()==n())

## # A tibble: 2 × 3
##        m     s   log_lik
##    <dbl> <dbl>     <dbl>
## 1 3730.2  1280 -82965.25
## 2 3730.2  1310 -82965.74

So, now we see the CI of our SD is from 1280 to 1310 - at least from this grid.

3.3 Exercises

Using the seal data, get the mean and SD of the seal lengths. Actually use mean and sd to derive reasonable parameter ranges likelihood surface.
What is the profile estimate of the CIs of the mean?

4. Linear Regression and Likelihood: `bbmle`

So, you want to fit a model with multiple parameters using Likelihood? We’ve seen that brute force is not the way. We need algorithms to search parameter space. R has multiple optimizer functions that seek to fund the minimum value of a function (likelihood or otherwise). bbmle is Ben Bolker’s package that seeks to create an easy interface to these functions. You can tweak to your heart’s delight all of the details of the optimizer, but at the end of the day, much of the code is fairly straightforward.

4.1 Seal Regression with `mle2`

The function at the core of bbmle is mle2. Let’s fit the seal regression as before, and look at what is different.

seals <- read.csv("data/06/17e8ShrinkingSeals Trites 1996.csv")

seal_mle <- mle2(length.cm ~ dnorm(b0 + b1 * age.days, resid_sd),
                 start = list(b0 = 5, b1 = 5, resid_sd = 500),
                 data = seals)

First, whoah! We had to specify dnorm to say that this was a normal error distribution! In essence, we were specifying our likelihood function here explicitly. That’s new.

Second, we had to name our coefficients. Explicitly. And write out our model. No more canned routines for us.

Third, we had to specify a list of start values. Remember, we’re using an algorithm. So where you you start? This opens up a host of problems - you can fall into a local minimum. So, always best to start with different sets of start values to make sure that’s not a problem. Particularly for more complex models.

Last, that should have thrown a bunch of warnings for you - these warnings indicate that the algorithm tried to explore non-possible values (negative SDs, actually). This happens.

Try different start values. Do you get different errors or warnings?

4.2 Testing Assumptions

So now that we have this fit object, how do we test our assumptions. Let’s remember what they are:

No relationship between fitted and residual values
Residuals follow normal distribution
The surface is peaked approximately symetrically

Let’s break these down one by one. There is no nice plot() method here, so we’ll have to do it by hand.

First, fitted and residual. We can extract both of these from the model object and plot them using predict() and residuals(). Note, using predict() with no new data merely get the fitted values.

fitted <- predict(seal_mle)
res <- residuals(seal_mle)

qplot(fitted, res)

If we wanted to add a spline, we could have used stat_smooth() with no method argument.

Second, qq! Well, we have our residuals, so we can use qqnorm() and `qqline() as we did with t-tests. Or plot a histogram.

qqnorm(res)
qqline(res)

hist(res)

Again, looks good!

Now, what about looking at the surface? We don’t have a nice way of visualizing a 4-D surface, so, instead we look at the profile likelihoods of each parameter. For that, we use the profile function which gets a profile of any function that can be accessed with the logLik() function. Note, this might take some time.

plot(profile(seal_mle))

This largely looks ok…except the prfile for b0 looks somewhat suspicious. Welcome to the world of fiddling with likelihood. One issue here might be that it’s not searching a wide enough region. We can manually set the region to search using the std.err argument. I’ll try setting that to 0.05, just to give it an area to search.

plot(profile(seal_mle, which="b0", std.err=0.05))

Voila. Explore ?profile for other tricks here. It can get a little hairy.

4.3 Coefficients and CIs

OK, now that we’ve done this and feel comfortable with our fit, what are our coefficients?

summary(seal_mle)

## Maximum likelihood estimation
## 
## Call:
## mle2(minuslogl = length.cm ~ dnorm(b0 + b1 * age.days, resid_sd), 
##     start = list(b0 = 5, b1 = 5, resid_sd = 500), data = seals)
## 
## Coefficients:
##            Estimate Std. Error    z value     Pr(z)    
## b0       1.1577e+02 5.2086e-08 2.2226e+09 < 2.2e-16 ***
## b1       2.3705e-03 1.4634e-05 1.6199e+02 < 2.2e-16 ***
## resid_sd 5.6799e+00 4.0853e-02 1.3903e+02 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## -2 log L: 61003.05

#or

broom::tidy(seal_mle)

##       term     estimate    std.error    statistic p.value
## 1       b0 1.157668e+02 5.208578e-08 2.222618e+09       0
## 2       b1 2.370544e-03 1.463357e-05 1.619935e+02       0
## 3 resid_sd 5.679915e+00 4.085321e-02 1.390323e+02       0

First, let’s see how that compares to an lm() fit.

broom::tidy(lm(length.cm ~ age.days, data=seals))

##          term     estimate    std.error statistic p.value
## 1 (Intercept) 1.157668e+02 1.763751e-01  656.3668       0
## 2    age.days 2.370559e-03 4.467317e-05   53.0645       0

Not bad. Pretty close.

Note, the SEs are based on the assumption that the profiles are well behaved. If we want to get into more detail and change that assumption, we use confint().

#based off of profile
confint(seal_mle)

##                2.5 %      97.5 %
## b0                NA          NA
## b1       0.002283002 0.002458117
## resid_sd 5.600775767 5.760936489

#based off of apprxoimation
confint(seal_mle, method="uniroot")

##             2.5 %   97.5 %
## b0             NA       NA
## b1             NA       NA
## resid_sd 5.600778 5.760932

#based off of quadratic assumption
confint(seal_mle, method="quad")

##                 2.5 %       97.5 %
## b0       1.157668e+02 1.157668e+02
## b1       2.341863e-03 2.399225e-03
## resid_sd 5.599844e+00 5.759986e+00

You can see the last method prodiced similar results to searching the profile - and didn’t fall prey to the profile problem we hit earlier.

4.4 Model Comparison

So, how do we compare models? Unfortunately, there’s not an automated way. This is likelihood, so you have to specify your null or alternate models. You have to think. In this case, a null model would only have an intercept - a mean - term.

seal_mle_null <- mle2(length.cm ~ dnorm(b0, resid_sd),
                 start = list(b0 = 5, resid_sd = 500),
                 data = seals)

We can now compare these models using a LRT.

anova(seal_mle, seal_mle_null)

## Likelihood Ratio Tests
## Model 1: seal_mle, length.cm~dnorm(b0+b1*age.days,resid_sd)
## Model 2: seal_mle_null, length.cm~dnorm(b0,resid_sd)
##   Tot Df Deviance  Chisq Df Pr(>Chisq)    
## 1      3    61003                         
## 2      2    63475 2471.6  1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Yes! They are different! If we wanted to do an AIC comparison

AIC(seal_mle)

## [1] 61009.05

AIC(seal_mle_null)

## [1] 63478.69

Quite different! Even after adjusting for parsiony, we need Age to understand what is going on here.

4.6 Algorithms

So, one thing that has been annoying up until now is these warnings we get due to our algorithm putting a negative value on our SD. Now, there are a number of different algorithms out there. Some even have constraints!

Now, when I started developing this example, I began with a resid_sd startvalue of 5 - close to the real one. I immediately got convergence errors and a note about setting SDs to 1. If I summarized it, I would get no SE for some coefficients

summary(mle2(length.cm ~ dnorm(b0 + b1*age.days, resid_sd),
                 start = list(b0 = 5, b1=5, resid_sd = 5),
                 data = seals))

## Maximum likelihood estimation
## 
## Call:
## mle2(minuslogl = length.cm ~ dnorm(b0 + b1 * age.days, resid_sd), 
##     start = list(b0 = 5, b1 = 5, resid_sd = 5), data = seals)
## 
## Coefficients:
##             Estimate  Std. Error z value  Pr(z)
## b0       -1.8489e+00          NA      NA     NA
## b1        2.4710e-02  2.3809e+01   0.001 0.9992
## resid_sd  1.1218e+07          NA      NA     NA
## 
## -2 log L: 331546.9

How do I solve this? One idea is to try different methods or optimizers. First, let’s try simulated annealing, which searches parameter space, but is slow.

#Simulated annealing
summary(mle2(length.cm ~ dnorm(b0 + b1*age.days, resid_sd),
                 start = list(b0 = 5, b1 = 5, resid_sd = 5),
                 data = seals, method="SANN"))

## Maximum likelihood estimation
## 
## Call:
## mle2(minuslogl = length.cm ~ dnorm(b0 + b1 * age.days, resid_sd), 
##     start = list(b0 = 5, b1 = 5, resid_sd = 5), method = "SANN", 
##     data = seals)
## 
## Coefficients:
##            Estimate Std. Error  z value     Pr(z)    
## b0       8.4738e+00 8.0567e-04 10517.78 < 2.2e-16 ***
## b1       2.8356e-02 2.3753e-05  1193.78 < 2.2e-16 ***
## resid_sd 9.2150e+00 1.4122e-02   652.54 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## -2 log L: 205155.7

A second option is to use a method that includes a constraint. L-BFGS-B allows for box constraints (set a lower and upper bound), so, we can set the lower bound of resid_sd to something very small - close to 0 even.

The bobyqa algorithm in the optimx package also allows for box constraints.

#"L-BFGS-B
summary(mle2(length.cm ~ dnorm(b0 + b1*age.days, resid_sd),
                 start = list(b0 = 5, b1=5, resid_sd = 5),
                 data = seals, method="L-BFGS-B", 
                 lower = c(b0 = -Inf, b1 = -Inf, resid_sd = 0.0001),
                 upper = c(b0 = Inf, b1 = Inf, resid_sd = Inf)))

## Maximum likelihood estimation
## 
## Call:
## mle2(minuslogl = length.cm ~ dnorm(b0 + b1 * age.days, resid_sd), 
##     start = list(b0 = 5, b1 = 5, resid_sd = 5), method = "L-BFGS-B", 
##     data = seals, lower = c(b0 = -Inf, b1 = -Inf, resid_sd = 1e-04), 
##     upper = c(b0 = Inf, b1 = Inf, resid_sd = Inf))
## 
## Coefficients:
##            Estimate Std. Error    z value     Pr(z)    
## b0       1.1577e+02 1.8689e-07 6.1945e+08 < 2.2e-16 ***
## b1       2.3705e-03 1.4634e-05 1.6199e+02 < 2.2e-16 ***
## resid_sd 5.6799e+00 4.0853e-02 1.3903e+02 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## -2 log L: 61003.05

#bobyqa
library(optimx)
summary(mle2(length.cm ~ dnorm(b0 + b1*age.days, resid_sd),
                 start = list(b0 = 5, b1=5, resid_sd = 5),
                 data = seals, 
                 optimizer = "optimx", method="bobyqa", 
                 lower = c(b0 = -Inf, b1 = -Inf, resid_sd = 0.0001),
                 upper = c(b0 = Inf, b1 = Inf, resid_sd = Inf)))

## Maximum likelihood estimation
## 
## Call:
## mle2(minuslogl = length.cm ~ dnorm(b0 + b1 * age.days, resid_sd), 
##     start = list(b0 = 5, b1 = 5, resid_sd = 5), method = "bobyqa", 
##     optimizer = "optimx", data = seals, lower = c(b0 = -Inf, 
##         b1 = -Inf, resid_sd = 1e-04), upper = c(b0 = Inf, b1 = Inf, 
##         resid_sd = Inf))
## 
## Coefficients:
##            Estimate Std. Error z value     Pr(z)    
## b0       5.8249e+01 3.3505e-02 1738.49 < 2.2e-16 ***
## b1       1.6140e-02 4.9428e-05  326.53 < 2.2e-16 ***
## resid_sd 1.8935e+01 1.1907e-01  159.02 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## -2 log L: 85053.96

OK, that works!

Which optimizer is right for you I will leave as a further topic for your research as you get into progressively complex likelihood methods.

4.5 Faded Examples

Let’s revisit the examples from our lm lab the other day.

A Fat Model

Fist, the relationship between how lean you are and how quickly you lose fat. Implement this to get a sense ot the general workflow for analysis

fat <- read.csv("./data/06/17q04BodyFatHeatLoss Sloan and Keatinge 1973 replica.csv")

#initial visualization to determine if lm is appropriate
fat_plot <- ggplot(data=fat, aes(x=leanness, y=lossrate)) + 
  geom_point()
fat_plot

#get some sensible start values
lm_coefs <- coef(lm(lossrate ~ leanness, data=fat))

fat_mod <- mle2(lossrate ~ dnorm(b0 + b1*leanness, resid_sd),
                start = list(b0 = lm_coefs[1], b1 =lm_coefs[2], resid_sd = 4),
                data = fat)
  
  

#assumptions
fat_fit <- predict(fat_mod)
fat_res <- residuals(fat_mod)

qplot(fat_fit, fat_res)

qqnorm(fat_res)
qqline(fat_res)

plot(profile(fat_mod))

#LRT test of model
fat_mod_null <- mle2(lossrate ~ dnorm(b0 , resid_sd),
                start = list(b0 = 1, resid_sd = 4),
                data = fat)
  
anova(fat_mod, fat_mod_null)

#t-tests of parameters
summary(fat_mod)

An Itchy Followup

For your first faded example, let’s look at the relationship between DEET and mosquito bites.

deet <- read.csv("./data/06/17q24DEETMosquiteBites.csv")

deet_plot <- ggplot(data=___, aes(x=dose, y=bites)) + 
  geom_point()

deet_plot

#get start values
deet_coef <- coef(lm(bites ~ dose, data=deet))

deet_mod <- <- mle2(___ ~ dnorm(b0 + b1*dose, resid_sd),
                start = list(b0 = deet_coef[1], b1 =deet_coef[2], resid_sd = 4),
                data = ____)
  

#assumptions
deet_fit <- predict(____)
deet_res <- residuals(____)

qplot(deet_fit, deet_res)

qqnorm(_____)
qqline(_____)

plot(profile(____))

#f-tests of model
deet_mod_null <- mle2(____ ~ dnorm(b0 + _____),
                start = list(b0 = 1, resid_sd = 4),
                data = deet)

anova(___, _____)

#t-tests of parameters
summary(___)

Long-Lived species and Home Ranges

Do longer lived species also have larger home ranges? Let’s test this!

zoo <- read.csv("./data/06/17q02ZooMortality Clubb and Mason 2003 replica.csv")

zoo_plot <- ggplot(data=___, aes(x=mortality, y=homerange)) + 
  ___()

___

zoo_coef <- coef(___(___~___, data=___))


zoo_mod <- <- mle2(___ ~ dnorm(____ + ____*mortality, ____),
                start = list(____ = zoo_coef[1], ____ =zoo_coef[2], ____ = 4),
                data = ____)

#assumptions
zoo_fit <- ____(____)
zoo_res <- ____(____)

qplot(____, ____)

qqnorm(____)
qqline(____)

plot(_____(____))

#f-tests of model
zoo_mod_null <- <- mle2(___ ~ dnorm(____, ____),
                start = list(____ = 1, ____ = 4),
                data = ____)
anova(___,_____)

#z-tests of parameters
_____(___)