Let’s explore non-normal data with genearalized linear models Let’s start with some count data. If we look at the train collisons with motor vehicle data from Agresti (1996, page 83), we see an interesting pattern.
library(readr)
library(ggplot2)
library(dplyr)
library(broom)
library(performance)
train <- read_csv("lab/data/trains_agresti.csv")
train_plot <- ggplot(train,
aes(x = km_train_travel,
y = collisions)) +
geom_point()
train_plot
Just looking at this, we can see that the variance differs with the
predictor. Also, we are dealing with count data here - which has it’s
own distribution. The Poisson. Still, what if we had soldiered on with
an lm
.
train_lm <- lm(collisions ~ km_train_travel, data = train)
check_model(train_lm)
Not good. Weird variance in the fitted-residual relationship. A bad qq plot. And a mismatch between predictions and observed. Part of this is because the response variable here is not normal in its residuals. It’s Poisson, as it’s count data.
A GLM with a Poisson error typically has a log link, although it can also have an identity link. To remind you, a poisson distribution is one for discrete data whose variance scales with it’s mean. This one parameter, \(\lambda\), encompasses both. So, for example,
pois_tib <- tibble(x = rep(0:40,2),
lambda = c(rep(5,41), rep(20, 41)),
dens = dpois(x, lambda = lambda))
ggplot(pois_tib, aes(x = x, y = dens,
fill = factor(lambda))) +
geom_col(position = position_dodge())
shows two Poisson distributions, one with a mean of 5, and one with a mean of 20. Note how their variances change.
A log link (e.g.. \(\hat{y} = e^{a +
bx}\)) is a natural fit, as it can have no values < 0. An
identity (i.e. linear) fit is appropriate in some situations, and still
other link functions can work. See ?glm-link
for more. To
fit a GLM we merely use glm()
, but specify both the correct
error family and link function. We can leave the link out if we’re using
the canonical link, but I prefer using it so that future me remembers
what I was doing. So, in this case.
train_glm <- glm(collisions ~ km_train_travel,
family = poisson(link = "log"),
data = train)
We can then re-evaluate model assumptions - including now overdispersion. Note, the qq plot below doesn’t really mean anything, as this isn’t normal.
check_model(train_glm)
So…whither residuals? Given that the residuals are not normal, a
qqnorm
plot hardly makes sense. And a residual-fitted
relationship is still likely to look odd.
One way to evaluate GLMs, and many other model forms, is to look at
their quantil residuals. As we discussed in lecture, we essentially use
simulation to get an empirical distribution for each y value. We then
ask what the quantile of the observed y value is. For any model -
regardless of complexity - we would assume that the quantiles should be
randomly distributed - i.e. according to a uniform distribution. We can
therefore look at the qqunif
plot, if you will. This is a
lot of work - simulating, making empirical distributions, testing to see
if we fall on a qqunif line. Fortunately, there is Florian Hartig’s
excellent DHARMa package which has a beautiful and clear vignette.
So, to start with, let’s use DHARMa to make some simulated residuals
library(DHARMa)
res <- simulateResiduals(train_glm)
We can then plot these (and get a nonparametric test of fit).
plotQQunif(res)
Wonderful - a good fit. Ignore the outlier test, as we saw there aren’t any in a more detailed look.
We also can look at the prediction v. quantile residual plot. Curiously, here, we should see that different strata (different quantiles) all have nice straight parallel lines. If they don’t, then we might have a case of overdispersion - the variance does not scale with the mean in the way described by the distribution. More on that soon.
plotResiduals(res)
Personally I prefer what performance
does.
check_overdispersion(train_glm) |> plot()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Let’s look at model results.
summary(train_glm)
##
## Call:
## glm(formula = collisions ~ km_train_travel, family = poisson(link = "log"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6727 -1.1972 -0.1673 0.3682 3.1449
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.524000 1.115971 3.158 0.00159 **
## km_train_travel -0.004780 0.002565 -1.863 0.06245 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 43.057 on 28 degrees of freedom
## Residual deviance: 39.402 on 27 degrees of freedom
## AIC: 135.07
##
## Number of Fisher Scoring iterations: 5
Note we see standard glm outputs here, and can interpret coefficients as we would with any log transform. We also have a dispersion parameter which described the relationship between the mean and variance. For Poisson, it’s 1.
Last, we can plot. Because we have a single predictor, we don’t need
to get fancy. stat_smooth
can accomodate glms with one
predictor.
train_plot +
stat_smooth(method = glm,
method.args = list(family = poisson(link = "log")))
## `geom_smooth()` using formula 'y ~ x'
Often our data is not discrete, though. For example, we might have continuous data that has a terrible QQ plot due to being overdispersed. Worse, the curve might be non-normal. Consider this clam data from Zuur.
clams <- read_delim("lab/data/clams.txt", delim = "\t") %>%
mutate(MONTH = factor(MONTH))
## Rows: 398 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## dbl (5): MONTH, LENGTH, AFD, LNLENGTH, LNAFD
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
How does AFD (ash free dry mass) relate to month and length? It’s clear there’s something nonlinear here.
clam_plot <- ggplot(clams,
aes(x = LENGTH, y = AFD, color = MONTH)) +
geom_point()
clam_plot
Now, it looks like we should just fit a log-transformed model, but…
clam_lm <- lm(log(AFD) ~ LENGTH + MONTH, data = clams)
check_model(clam_lm)
There are clear and obvious problems. Even a qq plot of log transfomed AFD is still not good - nor is a residual-fitted plot. One reason that the relationship might be non-normal is that we’re talking about growth - many multiplicative events that combine together. Gamma distributions are more typically used for data that is about time to some number of events - it’s excellent for rates, for example - but it can also be used here. Gamma glms take the inverse function (e.g. \(\hat{Y} = 1/(a + bx)\)) as their canonical link, but they often work well with log links as well.
clam_gamma <- glm(AFD ~ LENGTH + MONTH,
family = Gamma(link = "log"),
data = clams)
Does it blend? Let’s see!
check_model(clam_gamma)
Looks good! And
clam_res <- simulateResiduals(clam_gamma)
plotQQunif(clam_res)
plotResiduals(clam_res)
OK, maybe not as great. But largely this is due to the sparsity of high values, so, we’re OK.
We could use predict to plot, but I’m going to be lazy for a moment, and just plot each month separately here.
clam_plot +
stat_smooth(method = glm,
method.args = list(family = Gamma(link = "log"))) +
facet_wrap(~MONTH)
## `geom_smooth()` using formula 'y ~ x'
We can also look at other properties.
#coefficients
tidy(clam_gamma)
## # A tibble: 7 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -5.77 0.0563 -102. 1.55e-284
## 2 LENGTH 0.165 0.00243 68.1 6.90e-219
## 3 MONTH3 -0.469 0.0608 -7.71 1.04e- 13
## 4 MONTH4 -0.289 0.0303 -9.54 1.52e- 19
## 5 MONTH9 -0.387 0.0597 -6.48 2.76e- 10
## 6 MONTH11 0.0523 0.0476 1.10 2.72e- 1
## 7 MONTH12 -0.0335 0.0319 -1.05 2.94e- 1
#more info
summary(clam_gamma)
##
## Call:
## glm(formula = AFD ~ LENGTH + MONTH, family = Gamma(link = "log"),
## data = clams)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.96706 -0.10582 0.01562 0.11773 0.73963
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.765076 0.056279 -102.437 < 2e-16 ***
## LENGTH 0.165464 0.002431 68.066 < 2e-16 ***
## MONTH3 -0.468798 0.060795 -7.711 1.04e-13 ***
## MONTH4 -0.289252 0.030308 -9.544 < 2e-16 ***
## MONTH9 -0.386962 0.059712 -6.480 2.76e-10 ***
## MONTH11 0.052287 0.047573 1.099 0.272
## MONTH12 -0.033517 0.031886 -1.051 0.294
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.04105819)
##
## Null deviance: 386.653 on 397 degrees of freedom
## Residual deviance: 18.885 on 391 degrees of freedom
## AIC: -2420.8
##
## Number of Fisher Scoring iterations: 6
Oh, hey, what’s that dispersion parameter? Well, a Gamma usually takes two parameters - a shape and a rate. We can reparameterize a gamma such that a mean = shape/rate In this case, we parameterize the Gamma with that mean and the shape. The dispersion parameter is 1/shape.
But, to make this more understandable, the variance with a Gamma expands porportionally to the mean squared. Typically it’s shown as \(Var[Y] = \phi \mu^2\) where \(\phi\) is the dispersion parameter. Larger dispersion parameter, faster expansion of the variance.
Last, we can look at an \(R^2\) using Nagelkerke’s pseudo-\(R^2\)
# fit
r2(clam_gamma)
## # R2 for Generalized Linear Regression
## Nagelkerke's R2: 0.970
One question you may well have is - why Gamma and not normal here? We can fit a glm with a normal error and log link.
clam_glm_norm <- glm(AFD ~ LENGTH + MONTH,
family = gaussian(link = "log"),
data = clams)
One way to tell is to look for overdispersion. The small Gamma dispersion parameter suggests that, well, maybe that error is not expanding that quickly. So, let’s look.
norm_res <- simulateResiduals(clam_glm_norm)
plotQQunif(norm_res)
plotResiduals(norm_res)
We can see that the QQ is great, actually. And the predobs is not terrible (particularly compared to the above). This is some solid evidence that perhaps a normal error and log link is all that is required here.
Open up the monarch_gardens.csv
which describes the
number of monarch butterflies in different types of gardens and analyze
it with the appropriate family and link. Note, this is an ANOVA-style
design, but with a glm. You can use emmeans
and the link to
draw conclusions. What does this data show you?
Take a look at a data set we have long used before,
16e2InbreedingWolves.csv
describing how inbreeding
coefficients influence the number of wolves born. This is Poisson count
data. Which link function is better? Use AIC to compare. Do you meet
assumptions?
Let’s take a look at our mouse infection by cryptosporidium example. Note that the data is bounded at 0 and 1.
mouse <- read_csv("lab/data/cryptoDATA.csv") %>%
mutate(Porportion = Y/N)
mouse_plot <- ggplot(mouse,
aes(x = Dose, y = Porportion)) +
geom_point()
mouse_plot
That’s because while N
is our total number of mice per
sample, we cannot have more than N infections! In essence, each mouse is
like a coin flip. Did it get infected or not. And when we have many coin
flips - Bernouli trials - that leads to a binomial distribution.
A binomial distribution has two parameters. The probability of success and the number of coin flips. The resulting distribution is always bounded between 0 and 1. So, consider 15 coin flips with different probabilities.
bin_tibble <- tibble(outcome = rep(0:15, 2),
probability = c(rep(0.3, 16), rep(0.8, 16)),
dens = dbinom(outcome, probability, size = 15))
ggplot(bin_tibble,
aes(x = outcome, y = dens, fill = factor(probability))) +
geom_col(position = position_dodge())
We could have just as easily have rescaled the x-axis to between 0 and 1 for porportion.
Or, consider the same probability, but a different number of coin flips.
bin_tibble <- tibble(outcome = rep(0:15, 2),
size = c(rep(15, 16), rep(30, 16)),
dens = dbinom(outcome, 0.4, size = size))
ggplot(bin_tibble,
aes(x = outcome, y = dens, fill = factor(size))) +
geom_col(position = position_dodge())
You can see both parameters matter for the shape of the distribution.
With binomial logistic regression, we are essentially estimating the probability of getting heads. We then supply - not estimate - the number of trials as a weights argument. The canonical link - the logit link - is perfect here, as it describes a logistic curve that saturates both at 0 and 1. There are other links we can use. If none of our values get to 0 or 1, linear might even be appropriate. Or log. There are other useful ones. The probit link, for example, describes the saturating shape of the cummulative function for a normal distribution, and has real meaning with respect to survivorship - although it approaches 0 and 1 very quickly. There are a wide variety of others - see here (page 32) or here for more info. Also, try saying “Scobit” without feeling silly.
We can parameterize binomial logistic regressions one of two ways in
R - both are equivalent, as they expand the numbers out to number of
successes and total number of trials. The first is a bit combersome, as
you have to cbind
the number of successes and failures.
mouse_glm_cbind <- glm(cbind(Y, N - Y) ~ Dose,
family = binomial(link = "logit"),
data = mouse)
The second formualation uses weights for number of trials.
mouse_glm <- glm(Porportion ~ Dose,
weights = N,
family = binomial(link = "logit"),
data = mouse)
These two models are identical.
From this point forward, the worflow is as usual - assumptions, analysis, and visualization.
check_model(mouse_glm)
binned_residuals(mouse_glm, check = "binned_residuals", n_bins = 8) |> plot()
## Warning: Computation failed in `stat_smooth()`:
## A term has fewer unique covariate combinations than specified maximum degrees of freedom
res_bin <- simulateResiduals(mouse_glm)
plotQQunif(res_bin)
plotResiduals(res_bin)
Looks mostly good!
Dose matters!
summary(mouse_glm)
##
## Call:
## glm(formula = Porportion ~ Dose, family = binomial(link = "logit"),
## data = mouse, weights = N)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.9532 -1.2442 0.2327 1.5531 3.6013
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.407769 0.148479 -9.481 <2e-16 ***
## Dose 0.013468 0.001046 12.871 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 434.34 on 67 degrees of freedom
## Residual deviance: 200.51 on 66 degrees of freedom
## AIC: 327.03
##
## Number of Fisher Scoring iterations: 4
r2(mouse_glm)
## # R2 for Logistic Regression
## Tjur's R2: 0.010
Note that the dispersion parameter is 1, just like the poisson.
ggplot(mouse,
aes(x= Dose, y = Porportion,
weight = N)) +
geom_point() +
stat_smooth(method = "glm",
method.args = list(family = binomial))
## `geom_smooth()` using formula 'y ~ x'
passing.csv
, how does Grade influence the
probability of passing?UCBAdmissions.csv
, how does gender and department
affect the probability of being admitted? You might have to do some data
manipulation first. Then, you should be able to analyze this using all
of the tools you have developed for analyzing categorical data.What if we massively fail our test of overdispersion? Consider this data on seals haulouts.
seals <- read_delim("lab/data/SSEAK98_FINAL.txt", delim = "\t")
There are a lot of predictors here, but let’s just look at Year.
seal_plot <- ggplot(seals,
aes(x = TideHtMeters, y = Number )) +
geom_point()
seal_plot
seal_pois <- glm(Number ~ TideHtMeters,
family = poisson(link = "log"),
data = seals)
seal_pois_res <- simulateResiduals(seal_pois)
plot(seal_pois_res)
## DHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details
check_overdispersion(seal_pois) |> plot()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
A first guess might be that the variance and mean, while scaling linearly, aren’t linked by a factor of 1. Maybe the variance increases at twice the rate of the mean! Or something else. This type of model is called a Quasipoisson model. It’s a bit of a hack, but, in essene, you fit a poisson model, and then algorithmically futz with a parameter to see what scaling coefficient you would need to actually encompass 95% of the CI.
seal_qpois <- glm(Number ~ TideHtMeters,
family = quasipoisson(link = "log"),
data = seals)
Now, we don’t have good diagnostics, as this isn’t a formal (or real) distribution. But, we can look at what that dispersion parameter is.
summary(seal_qpois)
##
## Call:
## glm(formula = Number ~ TideHtMeters, family = quasipoisson(link = "log"),
## data = seals)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -12.012 -8.991 -4.196 2.110 61.868
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.24790 0.04077 104.200 <2e-16 ***
## TideHtMeters 0.01105 0.05408 0.204 0.838
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 149.6154)
##
## Null deviance: 162030 on 1609 degrees of freedom
## Residual deviance: 162024 on 1608 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 6
We can see it’s nearly 150! That’s huge! Note also that the coefficients and p-values are not different from the poisson.
That scaling coefficient was large. Maybe instead of a linear
scaling, it’s a squared mean-variance relationship. This corresponds to
the negative binomial - a distribution of the number of successes before
a specified number of failures. Like the gamma, there a squared
relationship between mean and variance, making it flexible for
overdispersion. It’s not part of glm
, but there’s a
function for it - glm.nb
in the MASS
package.
It doesn’t require a family, but instead we can give it a link function
only.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
seal_nb <- glm.nb(Number ~ TideHtMeters,
link = "log",
data = seals)
How does this stack up on the diagnostic front.
check_model(seal_nb)
seal_nb_res <- simulateResiduals(seal_nb)
plotQQunif(seal_nb_res)
## DHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details
Bingo! A beautiful fit! We can now go forward with the coefficients produced by this model with confidence.
So….which to use? Often, by looking at the data, you can intuit
whether the relationship between fitted values and the variance in
residuals is linear or nonlinear. But, sometimes, you just need to test
it out. Fortunately, performance
gives us a
check_overdispersion()
function to evaluate this (you used
to have to do it by hand!)
check_overdispersion(seal_pois) |> plot()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
This looks roughly linear after a point, and so the QP would seem to be the way to go! If it’s more curveliniear, or squared, we’d go with NB here.
Load up the kelp_holdfast.csv
data and explore the
predictors of number of kelp stipes. Should you use a NB or Quasipoisson
error with the model you’ve chosen? What does your chosen model
say?
Try out other predictors in the seal data. Does including more predictors change which error distribution you should use? Sometimes, dispersion is caused by unaccounted and correlated predictors!
Finally, we often have data that is bounded, but is not drawn from a
binomial - i.e., there are not independent “coin flips”, if you will.
This data can easily be accomodated by the beta distirbution. The Beta
is often used to describe a distribution of probabilities (an
overdispersed binomial glm is actually beta binomial - but more on that
another time). But, it’s also useful for things like porportional data,
percent cover, and more. All of these can be logit transformed, but a
beta provides a more natural fit. We use the package
betareg
for beta regression.
Consider this analysis of porportion of sodium intake from different supplements when you exercise with different gym teachers. 2300 is your RDA, so we’re standardizing to that.
sodium <- read_csv("lab/data/sodium_intake.csv")
## Rows: 60 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Instructor, Supplement
## dbl (2): Sodium, Porportion_Sodium_Intake
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ggplot(sodium,
aes(x = Supplement, y = Porportion_Sodium_Intake,
fill = Instructor)) +
geom_boxplot()
Now, let’s look at this using beta regression.
library(betareg)
sodium_beta <- betareg(Porportion_Sodium_Intake ~ Supplement + Instructor,
data = sodium)
Beta regression isn’t standard in DHARMa yet, so, we’re a bit more by the seat of our pants. Performance similarly isn’t happy about it. We can use another library, though! glmmTMB although it can be a pain in the butt to install
library(glmmTMB)
## Warning in checkMatrixPackageVersion(): Package version inconsistency detected.
## TMB was built with Matrix version 1.4.1
## Current Matrix version is 1.5.1
## Please re-install 'TMB' from source using install.packages('TMB', type = 'source') or ask CRAN for a binary version of 'TMB' matching CRAN's 'Matrix' package
sodium_beta_tmb <- glmmTMB(Porportion_Sodium_Intake ~ Supplement + Instructor,
family = beta_family(link = "logit"),
data = sodium)
check_model(sodium_beta_tmb)
## `check_outliers()` does not yet support models of class `glmmTMB`.
sodium_res <- simulateResiduals(sodium_beta_tmb)
plotQQunif(sodium_beta_tmb)
Looks good!
We can then go forward with all of our usual tests and visualizations. For example -
library(emmeans)
emmeans(sodium_beta, specs = ~Supplement) |>
contrast("pairwise") |>
confint(adjust = "none")
## contrast estimate SE df asymp.LCL asymp.UCL
## A - B -0.06902 0.0224 Inf -0.1129 -0.02518
## A - C 0.00228 0.0225 Inf -0.0418 0.04635
## A - D -0.04866 0.0224 Inf -0.0926 -0.00471
## B - C 0.07130 0.0224 Inf 0.0275 0.11514
## B - D 0.02036 0.0223 Inf -0.0234 0.06408
## C - D -0.05094 0.0224 Inf -0.0949 -0.00699
##
## Results are averaged over the levels of: Instructor
## Confidence level used: 0.95
Or if we had a continuous covariate, we could get fitted values and error and put them into the model.
dock_pred_div_byrnes.xls
. You might have to add 0.01 to the
0 values to make it work. OR - if using glmmTMB, explore the
zi
argument!