2. Logistic Regression

Let’s take a look at our mouse infection by cryptosporidium example. Note that the data is bounded at 0 and 1.

mouse <- read_csv("lab/data/cryptoDATA.csv") %>%
  mutate(Porportion = Y/N)

mouse_plot <- ggplot(mouse,
       aes(x = Dose, y = Porportion)) +
  geom_point()

mouse_plot

That’s because while N is our total number of mice per sample, we cannot have more than N infections! In essence, each mouse is like a coin flip. Did it get infected or not. And when we have many coin flips - Bernouli trials - that leads to a binomial distribution.

2.1 The Binomial Distribution

A binomial distribution has two parameters. The probability of success and the number of coin flips. The resulting distribution is always bounded between 0 and 1. So, consider 15 coin flips with different probabilities.

bin_tibble <- tibble(outcome = rep(0:15, 2),
                     probability = c(rep(0.3, 16), rep(0.8, 16)),
                     dens = dbinom(outcome, probability, size = 15))

ggplot(bin_tibble,
       aes(x = outcome, y = dens, fill = factor(probability))) +
  geom_col(position = position_dodge())

We could have just as easily have rescaled the x-axis to between 0 and 1 for porportion.

Or, consider the same probability, but a different number of coin flips.

bin_tibble <- tibble(outcome = rep(0:15, 2),
                     size = c(rep(15, 16), rep(30, 16)),
                     dens = dbinom(outcome, 0.4, size = size))

ggplot(bin_tibble,
       aes(x = outcome, y = dens, fill = factor(size))) +
  geom_col(position = position_dodge())

You can see both parameters matter for the shape of the distribution.

2.2 Binomial Logistic Regression

With binomial logistic regression, we are essentially estimating the probability of getting heads. We then supply - not estimate - the number of trials as a weights argument. The canonical link - the logit link - is perfect here, as it describes a logistic curve that saturates both at 0 and 1. There are other links we can use. If none of our values get to 0 or 1, linear might even be appropriate. Or log. There are other useful ones. The probit link, for example, describes the saturating shape of the cummulative function for a normal distribution, and has real meaning with respect to survivorship - although it approaches 0 and 1 very quickly. There are a wide variety of others - see here (page 32) or here for more info. Also, try saying “Scobit” without feeling silly.

We can parameterize binomial logistic regressions one of two ways in R - both are equivalent, as they expand the numbers out to number of successes and total number of trials. The first is a bit combersome, as you have to cbind the number of successes and failures.

mouse_glm_cbind <- glm(cbind(Y, N - Y) ~ Dose,
                 family = binomial(link = "logit"),
                 data = mouse)

The second formualation uses weights for number of trials.

mouse_glm <- glm(Porportion ~ Dose,
                 weights = N,
                 family = binomial(link = "logit"),
                 data = mouse)

These two models are identical.

From this point forward, the worflow is as usual - assumptions, analysis, and visualization.

check_model(mouse_glm)

binned_residuals(mouse_glm, check = "binned_residuals", n_bins = 8) |> plot()
## Warning: Computation failed in `stat_smooth()`:
## A term has fewer unique covariate combinations than specified maximum degrees of freedom

res_bin <- simulateResiduals(mouse_glm)

plotQQunif(res_bin)

plotResiduals(res_bin)

Looks mostly good!

Dose matters!

summary(mouse_glm)
## 
## Call:
## glm(formula = Porportion ~ Dose, family = binomial(link = "logit"), 
##     data = mouse, weights = N)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9532  -1.2442   0.2327   1.5531   3.6013  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.407769   0.148479  -9.481   <2e-16 ***
## Dose         0.013468   0.001046  12.871   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 434.34  on 67  degrees of freedom
## Residual deviance: 200.51  on 66  degrees of freedom
## AIC: 327.03
## 
## Number of Fisher Scoring iterations: 4
r2(mouse_glm)
## # R2 for Logistic Regression
##   Tjur's R2: 0.010

Note that the dispersion parameter is 1, just like the poisson.

ggplot(mouse,
       aes(x= Dose, y = Porportion,
                 weight = N)) +
  geom_point() +
  stat_smooth(method = "glm", 
              method.args = list(family = binomial))
## `geom_smooth()` using formula 'y ~ x'

1.3 Examples

  1. With passing.csv, how does Grade influence the probability of passing?

  2. With UCBAdmissions.csv, how does gender and department affect the probability of being admitted? You might have to do some data manipulation first. Then, you should be able to analyze this using all of the tools you have developed for analyzing categorical data.

3. Overdispersion: Negative Binomial and Quasipoisson

What if we massively fail our test of overdispersion? Consider this data on seals haulouts.

seals <- read_delim("lab/data/SSEAK98_FINAL.txt", delim = "\t")

There are a lot of predictors here, but let’s just look at Year.

seal_plot <- ggplot(seals,
       aes(x = TideHtMeters, y = Number )) +
  geom_point()

seal_plot

seal_pois <- glm(Number ~ TideHtMeters, 
                 family = poisson(link = "log"),
                 data = seals)

seal_pois_res <- simulateResiduals(seal_pois)
plot(seal_pois_res)
## DHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details

check_overdispersion(seal_pois) |> plot()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

3.2 The Quasipoisson

A first guess might be that the variance and mean, while scaling linearly, aren’t linked by a factor of 1. Maybe the variance increases at twice the rate of the mean! Or something else. This type of model is called a Quasipoisson model. It’s a bit of a hack, but, in essene, you fit a poisson model, and then algorithmically futz with a parameter to see what scaling coefficient you would need to actually encompass 95% of the CI.

seal_qpois <- glm(Number ~ TideHtMeters, 
                 family = quasipoisson(link = "log"),
                 data = seals)

Now, we don’t have good diagnostics, as this isn’t a formal (or real) distribution. But, we can look at what that dispersion parameter is.

summary(seal_qpois)
## 
## Call:
## glm(formula = Number ~ TideHtMeters, family = quasipoisson(link = "log"), 
##     data = seals)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -12.012   -8.991   -4.196    2.110   61.868  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.24790    0.04077 104.200   <2e-16 ***
## TideHtMeters  0.01105    0.05408   0.204    0.838    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 149.6154)
## 
##     Null deviance: 162030  on 1609  degrees of freedom
## Residual deviance: 162024  on 1608  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6

We can see it’s nearly 150! That’s huge! Note also that the coefficients and p-values are not different from the poisson.

3.3 The Negative Binomial

That scaling coefficient was large. Maybe instead of a linear scaling, it’s a squared mean-variance relationship. This corresponds to the negative binomial - a distribution of the number of successes before a specified number of failures. Like the gamma, there a squared relationship between mean and variance, making it flexible for overdispersion. It’s not part of glm, but there’s a function for it - glm.nb in the MASS package. It doesn’t require a family, but instead we can give it a link function only.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
seal_nb <-  glm.nb(Number ~ TideHtMeters, 
                   link = "log",
                 data = seals)

How does this stack up on the diagnostic front.

check_model(seal_nb)

seal_nb_res <- simulateResiduals(seal_nb)
plotQQunif(seal_nb_res)
## DHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details

Bingo! A beautiful fit! We can now go forward with the coefficients produced by this model with confidence.

4.3 Quasi or NB

So….which to use? Often, by looking at the data, you can intuit whether the relationship between fitted values and the variance in residuals is linear or nonlinear. But, sometimes, you just need to test it out. Fortunately, performance gives us a check_overdispersion() function to evaluate this (you used to have to do it by hand!)

check_overdispersion(seal_pois) |> plot()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

This looks roughly linear after a point, and so the QP would seem to be the way to go! If it’s more curveliniear, or squared, we’d go with NB here.

3.4 Exercise

  1. Load up the kelp_holdfast.csv data and explore the predictors of number of kelp stipes. Should you use a NB or Quasipoisson error with the model you’ve chosen? What does your chosen model say?

  2. Try out other predictors in the seal data. Does including more predictors change which error distribution you should use? Sometimes, dispersion is caused by unaccounted and correlated predictors!

4. Beta Regression

Finally, we often have data that is bounded, but is not drawn from a binomial - i.e., there are not independent “coin flips”, if you will. This data can easily be accomodated by the beta distirbution. The Beta is often used to describe a distribution of probabilities (an overdispersed binomial glm is actually beta binomial - but more on that another time). But, it’s also useful for things like porportional data, percent cover, and more. All of these can be logit transformed, but a beta provides a more natural fit. We use the package betareg for beta regression.

Consider this analysis of porportion of sodium intake from different supplements when you exercise with different gym teachers. 2300 is your RDA, so we’re standardizing to that.

sodium <- read_csv("lab/data/sodium_intake.csv")
## Rows: 60 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Instructor, Supplement
## dbl (2): Sodium, Porportion_Sodium_Intake
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ggplot(sodium,
       aes(x = Supplement, y = Porportion_Sodium_Intake,
           fill = Instructor)) +
  geom_boxplot()

Now, let’s look at this using beta regression.

library(betareg)

sodium_beta <- betareg(Porportion_Sodium_Intake ~ Supplement + Instructor,
                       data = sodium)

Beta regression isn’t standard in DHARMa yet, so, we’re a bit more by the seat of our pants. Performance similarly isn’t happy about it. We can use another library, though! glmmTMB although it can be a pain in the butt to install

library(glmmTMB)
## Warning in checkMatrixPackageVersion(): Package version inconsistency detected.
## TMB was built with Matrix version 1.4.1
## Current Matrix version is 1.5.1
## Please re-install 'TMB' from source using install.packages('TMB', type = 'source') or ask CRAN for a binary version of 'TMB' matching CRAN's 'Matrix' package
sodium_beta_tmb <- glmmTMB(Porportion_Sodium_Intake ~ Supplement + Instructor,
                           family = beta_family(link = "logit"),
                       data = sodium)
check_model(sodium_beta_tmb)
## `check_outliers()` does not yet support models of class `glmmTMB`.

sodium_res <- simulateResiduals(sodium_beta_tmb)
plotQQunif(sodium_beta_tmb)  

Looks good!

We can then go forward with all of our usual tests and visualizations. For example -

library(emmeans)
emmeans(sodium_beta, specs = ~Supplement) |>
  contrast("pairwise") |>
  confint(adjust = "none")
##  contrast estimate     SE  df asymp.LCL asymp.UCL
##  A - B    -0.06902 0.0224 Inf   -0.1129  -0.02518
##  A - C     0.00228 0.0225 Inf   -0.0418   0.04635
##  A - D    -0.04866 0.0224 Inf   -0.0926  -0.00471
##  B - C     0.07130 0.0224 Inf    0.0275   0.11514
##  B - D     0.02036 0.0223 Inf   -0.0234   0.06408
##  C - D    -0.05094 0.0224 Inf   -0.0949  -0.00699
## 
## Results are averaged over the levels of: Instructor 
## Confidence level used: 0.95

Or if we had a continuous covariate, we could get fitted values and error and put them into the model.

4.1 Exercises

  1. Try it out with some of my old data on docks looking at how predators shape the percent cover of bare space - dock_pred_div_byrnes.xls. You might have to add 0.01 to the 0 values to make it work. OR - if using glmmTMB, explore the zi argument!