Likelihood!

1. Likelihood of a Single Data Point

At its core, with likelihood we are searching parameter space and attempting to maximize our likelihood of a model given the data. Let’s begin with a simple one parameter model of how the world works and a single data point. Sure, it’s trivial, but it demonstrates the core properties of likelihood.

Let’s assume we are sampling a field of flowers. As this is count data, we have assumed a Poisson distribution. We throw our plot down, and find 10 flowers. What is the MLE of \(\lambda\), the mean (and variance) of our POisson distribution?

1.1 The Likelihood Function

As we’ve assumed our data is Poisson distributed, our **likelihood function* is that of a Poisson distribution. Now, we could write out the formula for the probability of a data point given a Poisson distribution (note L(H|D = p(D|H))), but, hey, these are just the probability density functions of each distribution! So, in this case, ’dpois(x, lambda)`.

Now, how do we search parameter space? We’ve got a Poisson distribution and an observation of 10. A reasonable set of values could be from 0 to 20. Why not!

At this point, we can simply vectorize our search for a maximum likelihood.

dpois(x = 10, lambda = 0:20)

##  [1] 0.000000e+00 1.013777e-07 3.818985e-05 8.101512e-04 5.292477e-03
##  [6] 1.813279e-02 4.130309e-02 7.098327e-02 9.926153e-02 1.185801e-01
## [11] 1.251100e-01 1.193781e-01 1.048373e-01 8.587015e-02 6.628184e-02
## [16] 4.861075e-02 3.409770e-02 2.299958e-02 1.498516e-02 9.466247e-03
## [21] 5.816307e-03

Well, great. We can squint at that output and find our Maximum Likelihood. But that’s a pain in the butt. We can instead use R intelligently to grab estimates for us.

lik_pois_10 <- dpois(x = 10, lambda = 0:20)

#maximum likelihood
max(lik_pois_10)

## [1] 0.12511

Yay! We have a likelihood. But…what’s the estimate? Now, we can use which to figure this all out to get the index of the value of the vector of lambdas. But this gets…clunky. Here’s an example, just for your delictation.

#Index of the MLE
idx_mle <- which(lik_pois_10 == max(lik_pois_10))

idx_mle

## [1] 11

#OK, what's the MLE?
c(0:20)[idx_mle]

## [1] 10

And, of course, it would have been better practice to have a vector of lambda values to work on, etc., but, woof. This gets ugly and code-y fast.

1.2 MLE with `dplyr`

Why not keep all of our information in one nice tidy place and use dplyr to get easy answers? We can make a data frame with a set of lambda values, calculate the likelihood, and then filter down to the MLE for our answer.

library(dplyr)

pois_10 <- data.frame(lambda_vals = 0:20) %>%
  mutate(likelihood = dpois(x = 10, lambda = lambda_vals)) 

#Get the MLE
pois_10 %>%
  filter(likelihood == max(likelihood))

##   lambda_vals likelihood
## 1          10    0.12511

We can even use this to make a nice plot.

ggplot(pois_10, aes(x=lambda_vals, y=likelihood)) +
  geom_point()

1.3 Log-Likelihood

So, notice that we have a lot of values close to 0? Lots of room for rounding error and weird peaks. And while this distribtion was nicely shaped, due to these issues, the log-likelihood often looks much nicer (and leads to good test statistic performance).

How do we add that? We simply add the log=TRUE argument to any probability density function.

pois_10 <- pois_10 %>%
  mutate(log_likelihood = dpois(x = 10, lambda = lambda_vals, log=TRUE))



ggplot(pois_10, aes(x=lambda_vals, y=log_likelihood)) +
  geom_point()

Note, I’m not logging after the fact. That’s because we’ll get different estimates if we log within the function versus after a really really small value is spit out.

1.4 A Real Likelihood “Function”

This is all well and good, but, as we move into more complex models, and more data, we’re not going to be able to include a simple dpois() statement or otherwise in our mutate. Instead, we need to think of the model as a whole, and write out a likelihood generating function. This will have other advantages, too, as we do things like profile, etc. The general struction of a likelihood function should be something like as follows:

my_likelihood_function <- function(observations, predictors, parameters){
  
  #a data generating process that makes
  #fitted values of your response variable
  
  #a likelihood function that compares observations 
  #to predicted values
  
}

For example, in the case of our Poisson data, we can write something like so:

pois_likelihood <- function(y, lambda){
  # Data Generating Process
  y_hat <- lambda
  
  #likelihood function
  sum(dpois(y, lambda = y_hat, log = TRUE))
  
  
}

Why include the sum? Well, in this case, it makes the function scalable to look at many data points, as we will see below! If we were looking at the straight likelihood, we’d use prod() instead. But, as we’ve talked about, go with `dat log-likelihood!

2. Likelihood of a Data Set

2.1 Integrating Likelihood over Many Data Points

Here’s the beauty of a data set. The only two differences between the workflow for 1 point and many is first, that you use either prod() (for likelihood) or sum() (for log-likelihood) to get the total value. Second, as the density functions don’t take kindly to a vector of data and a vector of parameters, we’ll use rowwise() to iterate over rows, but ungroup() after for other operations. As we’ve written our likelihood functoin, as well, we’re really in good shape to just nail this.

Let’s try this for a random dataset that we generate from a Poisson distribution with a lambda of 10.

Note the tiny tiny changes in the calculation of the likelihood and log-likelihood.

set.seed(607)
pois_data <- rpois(10, lambda=10)

pois_mle <- data.frame(lambda_vals = 0:20) %>%
  rowwise() %>%
  mutate(log_likelihood = pois_likelihood(y = pois_data, lambda = lambda_vals)) %>%
  ungroup()

#Plot the surface
qplot(lambda_vals, log_likelihood, data=pois_mle)

#Get the MLE
pois_mle %>%
  filter(log_likelihood == max(log_likelihood))

## # A tibble: 1 x 2
##   lambda_vals log_likelihood
##         <int>          <dbl>
## 1           9          -23.3

2.2 Confidence Intervals

OK, given our assumption that the CI of each parameter is 1.92 away from the Maximum Log-Likelhood. Again, pretty straightfoward to get with filtering.

pois_mle %>%
  filter(log_likelihood > (max(log_likelihood) - 1.92) )

## # A tibble: 3 x 2
##   lambda_vals log_likelihood
##         <int>          <dbl>
## 1           8          -23.7
## 2           9          -23.3
## 3          10          -24.1

So, our CI is between 8 and 10.

2.3 Examples

You have run 5 trials of flipping 20 coins. The number of heads in each trial is: 11, 10, 8, 9, 7. What’s the maximum likelihood estimate of the probability getting heads? Use dbinom() here for the binomial distribution.

3. Likelihood with Two Parameters

3.1 Two parameters and crossing!

The great thing about multiple parameters is that, heck, it’s no different than one parameter. Let’s use the seals example from last week and look at estimating the mean and SD of the seal age distribution. To start with, what’s the range of values we should choose to test? Let’s look at the data!

seals <- read.csv("data/17e8ShrinkingSeals Trites 1996.csv")

hist(seals$age.days)

Eyeballing this, I’d say the mean is between 3720 and 3740 and the SD is…oh, between 1280 and 1310.

OK, we’re going to need a likelihood function for this - one that takes a mean and a SD. Let’s do this!

norm_likelihood <- function(obs, mean_est, sd_est){
  
  #data generating process
  est <- mean_est
  
  #log likelihood
  sum(dnorm(obs, mean = est, sd = sd_est, log = TRUE))
  
}

Now, we can use crossing in tidyr to generate a set of parameters to test, and then literally nothing is different from the workflow before. Let’s just look at the Log Likelihood.

seal_dist <- crossing(m = seq(3720, 3740, by = 0.1), 
                      s=seq(1280, 1310, by = 0.1)) %>%
  rowwise() %>%
  mutate(log_lik = norm_likelihood(obs = seals$age.days, mean_est = m, sd_est = s)) %>% 
  ungroup()

Notice that took a bit That’s because we searched 60501 combinations of values. And we didn’t even do that with high precision! And I gave you a really narrow band of possible values! If anything can convince you that we need an algorithmic solution rather than something brute force - I hope this does!

What’s the MLE?

seal_dist %>%
  filter(log_lik == max(log_lik))

## # A tibble: 1 x 3
##       m     s log_lik
##   <dbl> <dbl>   <dbl>
## 1 3730. 1293. -82964.

And can we plot a super-sexy contour plot of the surface?

ggplot(seal_dist, aes(x=m, y=s, z=log_lik)) +
  geom_contour(aes(color = ..level..))

Note that odd color thing enables you to see the contours. Yeah, I don’t like it, either.

3.2 Profile Likelihoods for CIs

So, how do we get profile likelihoods from this mess?

Simply put, for each variable we want the profile of, we look at the MLE estimate at each value of the other parameter. So, group by the other parameter and filter out the MLE. So for the mean -

seal_sd_profile <- seal_dist %>%
  group_by(s) %>%
  filter(log_lik == max(log_lik)) %>%
  ungroup()

qplot(s, log_lik, data=seal_sd_profile, geom = "line")

OK, nice profile.

So, what about the CI? Unfortunately, even after filtering, we have a lot of values. We want to just look at the first and last. Weirdly, filter lets us use row_number as something to work on. So, let’s take advantage of that here!

##See the CI

seal_sd_profile %>%
  filter(log_lik > max(log_lik) - 1.92) %>%
    filter(row_number()==1 | row_number()==n())

## # A tibble: 2 x 3
##       m     s log_lik
##   <dbl> <dbl>   <dbl>
## 1 3730.  1280 -82965.
## 2 3730.  1310 -82966.

So, now we see the CI of our SD is from 1280 to 1310 - at least from this grid.

3.3 Exercises

Using the seal data, get the mean and SD of the seal lengths. Actually use mean and sd to derive reasonable parameter ranges likelihood surface so you don’t take forever.
What is the profile estimate of the CIs of the mean?

4. Linear Regression and Likelihood: `glm`

So, you want to fit a model with multiple parameters using Likelihood? We’ve seen that brute force is not the way. We need algorithms to search parameter space. R has multiple optimizer functions that seek to fund the minimum value of a function (likelihood or otherwise). Many require you to write your own likelihood function in order to fit, and this are incredibly flexible. But, for many workaday needs, we will use the function glm() which stands for Generalized Linear Model (more on that at the end of the course) and uses the iteratively reweighted least squares algorithm.

4.1 Seal Regression with `glm`

The great thing about glm() is that it is incredibly similar to lm() only - now you have to specify what distribution you’re drawing from, in order to get the likelihood correct. You should also specify the shape of the curve, as many different curves are possible now. Let’s take a look.

seals <- read.csv("./data/17e8ShrinkingSeals Trites 1996.csv")

seal_mle <- glm(length.cm ~ age.days,
                family = gaussian(link = "identity"),
                data = seals)

This looks pretty similar! Only, notice now we have this family argument for a distribution. We’re going to go all gaussian all the time for now. We also specify what’s called a link function. Basically, how do we translate linear predictors into the shape of a curve. For now, we can just use "identity" which means a 1:1 translation between our linear predictors and the shape of the curve.

4.2 Testing Assumptions

So now that we have this fit object, how do we test our assumptions. Let’s remember what they are:

No relationship between fitted and residual values
Residuals follow normal distribution
The surface is peaked approximately symetrically

Let’s break these down one by one. There is no nice plot() method here, so we’ll have to do it by hand.

First, fitted and residual. We can extract both of these from the model object and plot them using predict() and residuals(). Note, using predict() with no new data merely get the fitted values.

fitted <- predict(seal_mle)
res <- residuals(seal_mle)

qplot(fitted, res)

If we wanted to add a spline, we could have used stat_smooth() with no method argument.

Second, qq! Well, we have our residuals, so we can use qqnorm() and `qqline() as we did with t-tests. Or plot a histogram.

qqnorm(res)
qqline(res)

hist(res)

Again, looks good!

4.3 The Profile

Now, what about looking at the surface? We don’t have a nice way of visualizing a 4-D surface, so, instead we look at the profile likelihoods of each parameter. For that, we use the profile function which gets a profile of any function that can be accessed with the logLik() function. Note, this might take some time.

library(MASS)
plot(profile(seal_mle))

Well that’s not what we expect. What is going on here? In essence, profile.glm() takes the profile of a parameter, subtracts the optimized deviance (thus centering it at 0), and takes a signed square root. This produces a straight line from which it’s much easier to see deviations. Which is great in terms of diagnostics!

It’s not great if you want to see CIs and such. For that, we can use the profileModel library. Let’s take a look.

library(profileModel)

prof <- profileModel(seal_mle,
                     objective = "ordinaryDeviance")

## Profiling for parameter (Intercept) ... Done
## Profiling for parameter age.days ... Done

Here we see that we can request how we want the deviance transformed. In this case, not at all. Let’s see how this looks.

plot(prof)

Much nicer. To see the sampling grid used:

plot(prof, print.grid.points = TRUE)

We can also look at the CIs here naturally.

prof <- profileModel(seal_mle,
                     objective = "ordinaryDeviance",
                     quantile = qchisq(0.95, 1))

## Preliminary iteration .. Done
## 
## Profiling for parameter (Intercept) ... Done
## Profiling for parameter age.days ... Done

plot(prof)

4.4 Coefficients and CIs

OK, now that we’ve done this and feel comfortable with our fit, what are our coefficients?

summary(seal_mle)

## 
## Call:
## glm(formula = length.cm ~ age.days, family = gaussian(link = "identity"), 
##     data = seals)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -19.8700   -3.8279    0.0304    3.7541   21.6874  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.158e+02  1.764e-01  656.37   <2e-16 ***
## age.days    2.371e-03  4.467e-05   53.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 32.26809)
## 
##     Null deviance: 402668  on 9664  degrees of freedom
## Residual deviance: 311807  on 9663  degrees of freedom
## AIC: 61009
## 
## Number of Fisher Scoring iterations: 2

#or

broom::tidy(seal_mle)

## # A tibble: 2 x 5
##   term         estimate std.error statistic p.value
##   <chr>           <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept) 116.      0.176         656.        0
## 2 age.days      0.00237 0.0000447      53.1       0

First, let’s see how that compares to an lm() fit.

broom::tidy(lm(length.cm ~ age.days, data=seals))

## # A tibble: 2 x 5
##   term         estimate std.error statistic p.value
##   <chr>           <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept) 116.      0.176         656.        0
## 2 age.days      0.00237 0.0000447      53.1       0

Not bad. Pretty close. Note that glm() uses something called a dispersion parameter which is a general name for additional factors related to spread in the data. For a Gaussian model, it’s just the residual variance.

Note, the SEs are based on the assumption that the profiles are well behaved. If we want to get into more detail and change that assumption, we use confint().

#based off of profile
confint(seal_mle)

##                    2.5 %       97.5 %
## (Intercept) 1.154211e+02 1.161124e+02
## age.days    2.283001e-03 2.458117e-03

4.5 Model Comparison

So, how do we compare models? Unfortunately, there’s not an automated way. This is likelihood, so you have to specify your null or alternate models. You have to think. In this case, a null model would only have an intercept - a mean - term.

seal_mle_null <- glm(length.cm ~ 1,
                family = gaussian(link = "identity"),
                data = seals)

We can now compare these models using a LRT. Note that we use the anova() function, and now have to specify what kind of test we want

anova(seal_mle_null, seal_mle, test = "LRT")

## Analysis of Deviance Table
## 
## Model 1: length.cm ~ 1
## Model 2: length.cm ~ age.days
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      9664     402668                          
## 2      9663     311807  1    90862 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Yes! They are different!

4.6 Plotting a fit glm

To plot, we can use the same strategy as before, only, now we need to specify two things to stat_smooth(). First, our method is now glm, but second, we need to tell it about the family and link we’re using. stat_smooth takes a list of arguments that are passed to the method. So, let’s try it out.

ggplot(seals, 
       mapping = aes(x = age.days, y = length.cm)) +
  geom_point() +
  stat_smooth(method = "glm", method.args = list(family = gaussian(link="identity")))

Works!

4.7 Faded Examples

Let’s revisit the examples from our lm lab the other day.

A Fat Model

Fist, the relationship between how lean you are and how quickly you lose fat. Implement this to get a sense ot the general workflow for analysis

fat <- read.csv("../data/17q04BodyFatHeatLoss Sloan and Keatinge 1973 replica.csv")

#initial visualization to determine if lm is appropriate
fat_plot <- ggplot(data=fat, aes(x=leanness, y=lossrate)) + 
  geom_point()
fat_plot

#fit that model
fat_mod <- glm(lossrate ~ leanness, 
               family = gaussian(link = "identity"), 
               data=fat)
  
  
#assumptions
fat_fit <- predict(fat_mod)
fat_res <- residuals(fat_mod)

qplot(fat_fit, fat_res)

qqnorm(fat_res)
qqline(fat_res)

plot(profile(fat_mod))

#LRT test of model
fat_mod_null <- glm(lossrate ~ 1, 
               family = gaussian(link = "identity"), 
               data=fat)
  
anova(fat_mod_null, fat_mod, test = "LRT")

#t-tests of parameters
summary(fat_mod)

An Itchy Followup

For your first faded example, let’s look at the relationship between DEET and mosquito bites.

deet <- read.csv("../data/17q24DEETMosquiteBites.csv")

deet_plot <- ggplot(data=___, aes(x=dose, y=bites)) + 
  geom_point()

deet_plot

#fit that model
deet_mod <-  glm(bites ~ dose, 
                   family = gaussian(link = "______")
                   data=___)
  

#assumptions
deet_fit <- predict(____)
deet_res <- residuals(____)

qplot(deet_fit, deet_res)

qqnorm(_____)
qqline(_____)

plot(profile(____))

#f-tests of model
deet_mod_null <- glm(bites ~ ___, 
                   family = gaussian(link = "______")
                   data=___)

anova(___, _____, test = "LRT")

#t-tests of parameters
summary(___)

Long-Lived species and Home Ranges

Do longer lived species also have larger home ranges? Let’s test this!

zoo <- read.csv("../data/17q02ZooMortality Clubb and Mason 2003 replica.csv")

zoo_plot <- ggplot(data=___, aes(x=mortality, y=homerange)) + 
  ___()

#fit that model
zoo_mod <- glm(___~___,
              family = ____(____ = _____),
                data = ____)

#assumptions
zoo_fit <- ____(____)
zoo_res <- ____(____)

qplot(____, ____)

qqnorm(____)
qqline(____)

plot(_____(____))

#LRT-tests of model
zoo_mod_null <- <- glm(___~___,
              family = ____(____ = _____),
                data = ____)

anova(___, _____, ________)

#z-tests of parameters
_____(___)

For this last one, I want you to also also try something. Try changing link = "identity" to link = "log". What do you see? How does this compare to the log transformed model? Try also to ggplot the results with the two different links!