Nonlinear Regression and Generalized Linear Models

1. Poisson and Gamma Regression - exploring Links

Let’s explore non-normal data with genearalized linear models Let’s start with some count data. If we look at the train collisons with motor vehicle data from Agresti (1996, page 83), we see an interesting pattern.

library(readr)
library(ggplot2)
library(dplyr)
library(broom)
library(performance)

train <- read_csv("lab/data/trains_agresti.csv")

train_plot <- ggplot(train,
       aes(x = km_train_travel,
           y = collisions)) +
  geom_point()

train_plot

Just looking at this, we can see that the variance differs with the predictor. Also, we are dealing with count data here - which has it’s own distribution. The Poisson. Still, what if we had soldiered on with an lm.

train_lm <- lm(collisions ~ km_train_travel, data = train)

check_model(train_lm)

Not good. Weird variance in the fitted-residual relationship. A bad qq plot. And a mismatch between predictions and observed. Part of this is because the response variable here is not normal in its residuals. It’s Poisson, as it’s count data.

1.1 Poisson Regression

A GLM with a Poisson error typically has a log link, although it can also have an identity link. To remind you, a poisson distribution is one for discrete data whose variance scales with it’s mean. This one parameter, \(\lambda\), encompasses both. So, for example,

pois_tib <- tibble(x = rep(0:40,2),
                   lambda = c(rep(5,41), rep(20, 41)),
                   dens = dpois(x, lambda = lambda))

ggplot(pois_tib, aes(x = x, y = dens, 
                     fill = factor(lambda))) +
  geom_col(position = position_dodge())

shows two Poisson distributions, one with a mean of 5, and one with a mean of 20. Note how their variances change.

A log link (e.g.. \(\hat{y} = e^{a + bx}\)) is a natural fit, as it can have no values < 0. An identity (i.e. linear) fit is appropriate in some situations, and still other link functions can work. See ?glm-link for more. To fit a GLM we merely use glm(), but specify both the correct error family and link function. We can leave the link out if we’re using the canonical link, but I prefer using it so that future me remembers what I was doing. So, in this case.

train_glm <- glm(collisions ~ km_train_travel, 
                family = poisson(link = "log"),
                data = train)

We can then re-evaluate model assumptions - including now overdispersion. Note, the qq plot below doesn’t really mean anything, as this isn’t normal.

check_model(train_glm)

So…whither residuals? Given that the residuals are not normal, a qqnorm plot hardly makes sense. And a residual-fitted relationship is still likely to look odd.

1.2 Quantile Residuals with GLMs

One way to evaluate GLMs, and many other model forms, is to look at their quantil residuals. As we discussed in lecture, we essentially use simulation to get an empirical distribution for each y value. We then ask what the quantile of the observed y value is. For any model - regardless of complexity - we would assume that the quantiles should be randomly distributed - i.e. according to a uniform distribution. We can therefore look at the qqunif plot, if you will. This is a lot of work - simulating, making empirical distributions, testing to see if we fall on a qqunif line. Fortunately, there is Florian Hartig’s excellent DHARMa package which has a beautiful and clear vignette.

So, to start with, let’s use DHARMa to make some simulated residuals

library(DHARMa)

res <- simulateResiduals(train_glm)

We can then plot these (and get a nonparametric test of fit).

plotQQunif(res)

Wonderful - a good fit. Ignore the outlier test, as we saw there aren’t any in a more detailed look.

We also can look at the prediction v. quantile residual plot. Curiously, here, we should see that different strata (different quantiles) all have nice straight parallel lines. If they don’t, then we might have a case of overdispersion - the variance does not scale with the mean in the way described by the distribution. More on that soon.

plotResiduals(res)

Personally I prefer what performance does.

check_overdispersion(train_glm) |> plot()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

1.3 Looking at glm results

Let’s look at model results.

summary(train_glm)

## 
## Call:
## glm(formula = collisions ~ km_train_travel, family = poisson(link = "log"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6727  -1.1972  -0.1673   0.3682   3.1449  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)   
## (Intercept)      3.524000   1.115971   3.158  0.00159 **
## km_train_travel -0.004780   0.002565  -1.863  0.06245 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 43.057  on 28  degrees of freedom
## Residual deviance: 39.402  on 27  degrees of freedom
## AIC: 135.07
## 
## Number of Fisher Scoring iterations: 5

Note we see standard glm outputs here, and can interpret coefficients as we would with any log transform. We also have a dispersion parameter which described the relationship between the mean and variance. For Poisson, it’s 1.

Last, we can plot. Because we have a single predictor, we don’t need to get fancy. stat_smooth can accomodate glms with one predictor.

train_plot +
  stat_smooth(method = glm,
              method.args = list(family = poisson(link = "log")))

## `geom_smooth()` using formula 'y ~ x'

1.4 Gamma Regression

Often our data is not discrete, though. For example, we might have continuous data that has a terrible QQ plot due to being overdispersed. Worse, the curve might be non-normal. Consider this clam data from Zuur.

clams <- read_delim("lab/data/clams.txt", delim = "\t") %>%
  mutate(MONTH = factor(MONTH))

## Rows: 398 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## dbl (5): MONTH, LENGTH, AFD, LNLENGTH, LNAFD
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

How does AFD (ash free dry mass) relate to month and length? It’s clear there’s something nonlinear here.

clam_plot <- ggplot(clams,
                    aes(x = LENGTH, y = AFD, color = MONTH)) +
  geom_point()

clam_plot

Now, it looks like we should just fit a log-transformed model, but…

clam_lm <- lm(log(AFD) ~ LENGTH + MONTH, data = clams)

check_model(clam_lm)

There are clear and obvious problems. Even a qq plot of log transfomed AFD is still not good - nor is a residual-fitted plot. One reason that the relationship might be non-normal is that we’re talking about growth - many multiplicative events that combine together. Gamma distributions are more typically used for data that is about time to some number of events - it’s excellent for rates, for example - but it can also be used here. Gamma glms take the inverse function (e.g. \(\hat{Y} = 1/(a + bx)\)) as their canonical link, but they often work well with log links as well.

clam_gamma <- glm(AFD ~ LENGTH + MONTH,
                  family = Gamma(link = "log"),
                  data = clams)

Does it blend? Let’s see!

check_model(clam_gamma)

Looks good! And

clam_res <- simulateResiduals(clam_gamma)
plotQQunif(clam_res)

plotResiduals(clam_res)

OK, maybe not as great. But largely this is due to the sparsity of high values, so, we’re OK.

We could use predict to plot, but I’m going to be lazy for a moment, and just plot each month separately here.

clam_plot +
  stat_smooth(method = glm,
              method.args = list(family = Gamma(link = "log"))) +
  facet_wrap(~MONTH)

## `geom_smooth()` using formula 'y ~ x'

We can also look at other properties.

#coefficients
tidy(clam_gamma)

## # A tibble: 7 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  -5.77     0.0563    -102.   1.55e-284
## 2 LENGTH        0.165    0.00243     68.1  6.90e-219
## 3 MONTH3       -0.469    0.0608      -7.71 1.04e- 13
## 4 MONTH4       -0.289    0.0303      -9.54 1.52e- 19
## 5 MONTH9       -0.387    0.0597      -6.48 2.76e- 10
## 6 MONTH11       0.0523   0.0476       1.10 2.72e-  1
## 7 MONTH12      -0.0335   0.0319      -1.05 2.94e-  1

#more info
summary(clam_gamma)

## 
## Call:
## glm(formula = AFD ~ LENGTH + MONTH, family = Gamma(link = "log"), 
##     data = clams)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.96706  -0.10582   0.01562   0.11773   0.73963  
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -5.765076   0.056279 -102.437  < 2e-16 ***
## LENGTH       0.165464   0.002431   68.066  < 2e-16 ***
## MONTH3      -0.468798   0.060795   -7.711 1.04e-13 ***
## MONTH4      -0.289252   0.030308   -9.544  < 2e-16 ***
## MONTH9      -0.386962   0.059712   -6.480 2.76e-10 ***
## MONTH11      0.052287   0.047573    1.099    0.272    
## MONTH12     -0.033517   0.031886   -1.051    0.294    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.04105819)
## 
##     Null deviance: 386.653  on 397  degrees of freedom
## Residual deviance:  18.885  on 391  degrees of freedom
## AIC: -2420.8
## 
## Number of Fisher Scoring iterations: 6

Oh, hey, what’s that dispersion parameter? Well, a Gamma usually takes two parameters - a shape and a rate. We can reparameterize a gamma such that a mean = shape/rate In this case, we parameterize the Gamma with that mean and the shape. The dispersion parameter is 1/shape.

But, to make this more understandable, the variance with a Gamma expands porportionally to the mean squared. Typically it’s shown as \(Var[Y] = \phi \mu^2\) where \(\phi\) is the dispersion parameter. Larger dispersion parameter, faster expansion of the variance.

Last, we can look at an \(R^2\) using Nagelkerke’s pseudo-\(R^2\)

# fit
r2(clam_gamma)

## # R2 for Generalized Linear Regression
##   Nagelkerke's R2: 0.970

1.5 Was it normal?

One question you may well have is - why Gamma and not normal here? We can fit a glm with a normal error and log link.

clam_glm_norm <- glm(AFD ~ LENGTH + MONTH,
                     family = gaussian(link = "log"),
                     data = clams)

One way to tell is to look for overdispersion. The small Gamma dispersion parameter suggests that, well, maybe that error is not expanding that quickly. So, let’s look.

norm_res <- simulateResiduals(clam_glm_norm)
plotQQunif(norm_res)

plotResiduals(norm_res)

We can see that the QQ is great, actually. And the predobs is not terrible (particularly compared to the above). This is some solid evidence that perhaps a normal error and log link is all that is required here.

1.6 Exercise

Open up the monarch_gardens.csv which describes the number of monarch butterflies in different types of gardens and analyze it with the appropriate family and link. Note, this is an ANOVA-style design, but with a glm. You can use emmeans and the link to draw conclusions. What does this data show you?
Take a look at a data set we have long used before, 16e2InbreedingWolves.csv describing how inbreeding coefficients influence the number of wolves born. This is Poisson count data. Which link function is better? Use AIC to compare. Do you meet assumptions?

2. Logistic Regression

Let’s take a look at our mouse infection by cryptosporidium example. Note that the data is bounded at 0 and 1.

mouse <- read_csv("lab/data/cryptoDATA.csv") %>%
  mutate(Porportion = Y/N)

mouse_plot <- ggplot(mouse,
       aes(x = Dose, y = Porportion)) +
  geom_point()

mouse_plot

That’s because while N is our total number of mice per sample, we cannot have more than N infections! In essence, each mouse is like a coin flip. Did it get infected or not. And when we have many coin flips - Bernouli trials - that leads to a binomial distribution.

2.1 The Binomial Distribution

A binomial distribution has two parameters. The probability of success and the number of coin flips. The resulting distribution is always bounded between 0 and 1. So, consider 15 coin flips with different probabilities.

bin_tibble <- tibble(outcome = rep(0:15, 2),
                     probability = c(rep(0.3, 16), rep(0.8, 16)),
                     dens = dbinom(outcome, probability, size = 15))

ggplot(bin_tibble,
       aes(x = outcome, y = dens, fill = factor(probability))) +
  geom_col(position = position_dodge())

We could have just as easily have rescaled the x-axis to between 0 and 1 for porportion.

Or, consider the same probability, but a different number of coin flips.

bin_tibble <- tibble(outcome = rep(0:15, 2),
                     size = c(rep(15, 16), rep(30, 16)),
                     dens = dbinom(outcome, 0.4, size = size))

ggplot(bin_tibble,
       aes(x = outcome, y = dens, fill = factor(size))) +
  geom_col(position = position_dodge())

You can see both parameters matter for the shape of the distribution.

2.2 Binomial Logistic Regression

With binomial logistic regression, we are essentially estimating the probability of getting heads. We then supply - not estimate - the number of trials as a weights argument. The canonical link - the logit link - is perfect here, as it describes a logistic curve that saturates both at 0 and 1. There are other links we can use. If none of our values get to 0 or 1, linear might even be appropriate. Or log. There are other useful ones. The probit link, for example, describes the saturating shape of the cummulative function for a normal distribution, and has real meaning with respect to survivorship - although it approaches 0 and 1 very quickly. There are a wide variety of others - see here (page 32) or here for more info. Also, try saying “Scobit” without feeling silly.

We can parameterize binomial logistic regressions one of two ways in R - both are equivalent, as they expand the numbers out to number of successes and total number of trials. The first is a bit combersome, as you have to cbind the number of successes and failures.

mouse_glm_cbind <- glm(cbind(Y, N - Y) ~ Dose,
                 family = binomial(link = "logit"),
                 data = mouse)

The second formualation uses weights for number of trials.

mouse_glm <- glm(Porportion ~ Dose,
                 weights = N,
                 family = binomial(link = "logit"),
                 data = mouse)

These two models are identical.

From this point forward, the worflow is as usual - assumptions, analysis, and visualization.

check_model(mouse_glm)

binned_residuals(mouse_glm, check = "binned_residuals", n_bins = 8) |> plot()

## Warning: Computation failed in `stat_smooth()`:
## A term has fewer unique covariate combinations than specified maximum degrees of freedom

res_bin <- simulateResiduals(mouse_glm)

plotQQunif(res_bin)

plotResiduals(res_bin)

Looks mostly good!

Dose matters!

summary(mouse_glm)

## 
## Call:
## glm(formula = Porportion ~ Dose, family = binomial(link = "logit"), 
##     data = mouse, weights = N)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9532  -1.2442   0.2327   1.5531   3.6013  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.407769   0.148479  -9.481   <2e-16 ***
## Dose         0.013468   0.001046  12.871   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 434.34  on 67  degrees of freedom
## Residual deviance: 200.51  on 66  degrees of freedom
## AIC: 327.03
## 
## Number of Fisher Scoring iterations: 4

r2(mouse_glm)

## # R2 for Logistic Regression
##   Tjur's R2: 0.010

Note that the dispersion parameter is 1, just like the poisson.

ggplot(mouse,
       aes(x= Dose, y = Porportion,
                 weight = N)) +
  geom_point() +
  stat_smooth(method = "glm", 
              method.args = list(family = binomial))

## `geom_smooth()` using formula 'y ~ x'

1.3 Examples

With passing.csv, how does Grade influence the probability of passing?
With UCBAdmissions.csv, how does gender and department affect the probability of being admitted? You might have to do some data manipulation first. Then, you should be able to analyze this using all of the tools you have developed for analyzing categorical data.

3. Overdispersion: Negative Binomial and Quasipoisson

What if we massively fail our test of overdispersion? Consider this data on seals haulouts.

seals <- read_delim("lab/data/SSEAK98_FINAL.txt", delim = "\t")

There are a lot of predictors here, but let’s just look at Year.

seal_plot <- ggplot(seals,
       aes(x = TideHtMeters, y = Number )) +
  geom_point()

seal_plot

seal_pois <- glm(Number ~ TideHtMeters, 
                 family = poisson(link = "log"),
                 data = seals)

seal_pois_res <- simulateResiduals(seal_pois)
plot(seal_pois_res)

## DHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details

check_overdispersion(seal_pois) |> plot()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

3.2 The Quasipoisson

A first guess might be that the variance and mean, while scaling linearly, aren’t linked by a factor of 1. Maybe the variance increases at twice the rate of the mean! Or something else. This type of model is called a Quasipoisson model. It’s a bit of a hack, but, in essene, you fit a poisson model, and then algorithmically futz with a parameter to see what scaling coefficient you would need to actually encompass 95% of the CI.

seal_qpois <- glm(Number ~ TideHtMeters, 
                 family = quasipoisson(link = "log"),
                 data = seals)

Now, we don’t have good diagnostics, as this isn’t a formal (or real) distribution. But, we can look at what that dispersion parameter is.

summary(seal_qpois)

## 
## Call:
## glm(formula = Number ~ TideHtMeters, family = quasipoisson(link = "log"), 
##     data = seals)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -12.012   -8.991   -4.196    2.110   61.868  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.24790    0.04077 104.200   <2e-16 ***
## TideHtMeters  0.01105    0.05408   0.204    0.838    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 149.6154)
## 
##     Null deviance: 162030  on 1609  degrees of freedom
## Residual deviance: 162024  on 1608  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6

We can see it’s nearly 150! That’s huge! Note also that the coefficients and p-values are not different from the poisson.

3.3 The Negative Binomial

That scaling coefficient was large. Maybe instead of a linear scaling, it’s a squared mean-variance relationship. This corresponds to the negative binomial - a distribution of the number of successes before a specified number of failures. Like the gamma, there a squared relationship between mean and variance, making it flexible for overdispersion. It’s not part of glm, but there’s a function for it - glm.nb in the MASS package. It doesn’t require a family, but instead we can give it a link function only.

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

seal_nb <-  glm.nb(Number ~ TideHtMeters, 
                   link = "log",
                 data = seals)

How does this stack up on the diagnostic front.

check_model(seal_nb)

seal_nb_res <- simulateResiduals(seal_nb)
plotQQunif(seal_nb_res)

## DHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details

Bingo! A beautiful fit! We can now go forward with the coefficients produced by this model with confidence.

4.3 Quasi or NB

So….which to use? Often, by looking at the data, you can intuit whether the relationship between fitted values and the variance in residuals is linear or nonlinear. But, sometimes, you just need to test it out. Fortunately, performance gives us a check_overdispersion() function to evaluate this (you used to have to do it by hand!)

check_overdispersion(seal_pois) |> plot()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

This looks roughly linear after a point, and so the QP would seem to be the way to go! If it’s more curveliniear, or squared, we’d go with NB here.

3.4 Exercise

Load up the kelp_holdfast.csv data and explore the predictors of number of kelp stipes. Should you use a NB or Quasipoisson error with the model you’ve chosen? What does your chosen model say?
Try out other predictors in the seal data. Does including more predictors change which error distribution you should use? Sometimes, dispersion is caused by unaccounted and correlated predictors!

4. Beta Regression

Finally, we often have data that is bounded, but is not drawn from a binomial - i.e., there are not independent “coin flips”, if you will. This data can easily be accomodated by the beta distirbution. The Beta is often used to describe a distribution of probabilities (an overdispersed binomial glm is actually beta binomial - but more on that another time). But, it’s also useful for things like porportional data, percent cover, and more. All of these can be logit transformed, but a beta provides a more natural fit. We use the package betareg for beta regression.

Consider this analysis of porportion of sodium intake from different supplements when you exercise with different gym teachers. 2300 is your RDA, so we’re standardizing to that.

sodium <- read_csv("lab/data/sodium_intake.csv")

## Rows: 60 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Instructor, Supplement
## dbl (2): Sodium, Porportion_Sodium_Intake
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

ggplot(sodium,
       aes(x = Supplement, y = Porportion_Sodium_Intake,
           fill = Instructor)) +
  geom_boxplot()

Now, let’s look at this using beta regression.

library(betareg)

sodium_beta <- betareg(Porportion_Sodium_Intake ~ Supplement + Instructor,
                       data = sodium)

Beta regression isn’t standard in DHARMa yet, so, we’re a bit more by the seat of our pants. Performance similarly isn’t happy about it. We can use another library, though! glmmTMB although it can be a pain in the butt to install

library(glmmTMB)

## Warning in checkMatrixPackageVersion(): Package version inconsistency detected.
## TMB was built with Matrix version 1.4.1
## Current Matrix version is 1.5.1
## Please re-install 'TMB' from source using install.packages('TMB', type = 'source') or ask CRAN for a binary version of 'TMB' matching CRAN's 'Matrix' package

sodium_beta_tmb <- glmmTMB(Porportion_Sodium_Intake ~ Supplement + Instructor,
                           family = beta_family(link = "logit"),
                       data = sodium)
check_model(sodium_beta_tmb)

## `check_outliers()` does not yet support models of class `glmmTMB`.

sodium_res <- simulateResiduals(sodium_beta_tmb)
plotQQunif(sodium_beta_tmb)

Looks good!

We can then go forward with all of our usual tests and visualizations. For example -

library(emmeans)
emmeans(sodium_beta, specs = ~Supplement) |>
  contrast("pairwise") |>
  confint(adjust = "none")

##  contrast estimate     SE  df asymp.LCL asymp.UCL
##  A - B    -0.06902 0.0224 Inf   -0.1129  -0.02518
##  A - C     0.00228 0.0225 Inf   -0.0418   0.04635
##  A - D    -0.04866 0.0224 Inf   -0.0926  -0.00471
##  B - C     0.07130 0.0224 Inf    0.0275   0.11514
##  B - D     0.02036 0.0223 Inf   -0.0234   0.06408
##  C - D    -0.05094 0.0224 Inf   -0.0949  -0.00699
## 
## Results are averaged over the levels of: Instructor 
## Confidence level used: 0.95

Or if we had a continuous covariate, we could get fitted values and error and put them into the model.

4.1 Exercises

Try it out with some of my old data on docks looking at how predators shape the percent cover of bare space - dock_pred_div_byrnes.xls. You might have to add 0.01 to the 0 values to make it work. OR - if using glmmTMB, explore the zi argument!