Nonlinear Regression and Generalized Linear Models

1. Nonlinear Least Squares

1.1 Nonlinear Terms in Linear Regression

Fitting a simple squared or other nonlinear term in a model is fairly straightforward in R. One simply has to include the term in the model itself. So, let’s take a look at how it works.

library(tidyverse)
library(readr)
library(readxl)
theme_set(theme_classic(base_size = 17))

turtles <- read_excel("./data/Ashton_2007.xlsx")

turtle_plot <- ggplot(turtles,
                      aes(x = Length, y = Clutch)) +
  geom_point()

turtle_plot

This is clearly nonlinear. In fact, if we’d fit a linear model, the assumption plots would look wonky. Particularly the fitted-residual, which would show a leftover nonlinear relationship.

plot(lm(Clutch ~ Length , data = turtles), which = 1)

To take this data and fit a squared polynomial, you need to do a bit more than add it to the model. We use the function I() inside of the model. It says to R to stop and evaluate that transformed term.

turtle_poly <- lm(Clutch ~ Length + I(Length^2), data = turtles)

Now, if we look at the results…

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

library(broom)

Anova(turtle_poly)

## Anova Table (Type II tests)
## 
## Response: Clutch
##              Sum Sq Df F value   Pr(>F)   
## Length       79.879  1  11.201 0.004415 **
## I(Length^2)  79.178  1  11.102 0.004550 **
## Residuals   106.974 15                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

tidy(turtle_poly)

## # A tibble: 3 x 5
##   term          estimate std.error statistic p.value
##   <chr>            <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept) -900.      270.          -3.33 0.00457
## 2 Length         5.86      1.75         3.35 0.00441
## 3 I(Length^2)   -0.00942   0.00283     -3.33 0.00455

We can see that the squared term is included in both sets of results. Not that it will naturally be incorporated into predict() as well. To put our fit model into ggplot, we need to do a little work with stat_smooth()

turtle_plot +
  stat_smooth(method = "lm", formula = y ~ x + I(x^2))

Note, collinearity can at times be problematic with squared terms. To get your model to fit centered polynomial terms without having to deal with manually doing the centering, try

cent_mod <- lm(Clutch ~ poly(Length, 2), data = turtles)

tidy(cent_mod)

## # A tibble: 3 x 5
##   term             estimate std.error statistic       p.value
##   <chr>               <dbl>     <dbl>     <dbl>         <dbl>
## 1 (Intercept)          8.06     0.629    12.8   0.00000000178
## 2 poly(Length, 2)1     1.67     2.67      0.626 0.541        
## 3 poly(Length, 2)2    -8.90     2.67     -3.33  0.00455

While the estimates are different, that’s because they’re calculated at the average of the predictor. So, don’t forget to back-transform if you use this formulation.

1.2 True Nonlinear Least Squares Regression

You do not always want to fit a model that can easily be turned into a series of additive terms. For example, consider the power model:

\[y_i = ax^b + \epsilon_i\]

Sure, you can fit a log-log relationship, but then your error term is nonlonger additive. We encountered a dataset that would do well in this model with the zoo mortality dataset - animals with larger home ranges experience higher mortality.

zoo <- read_csv("./data/17q02ZooMortality Clubb and Mason 2003 replica.csv")

zoo_plot <- ggplot(zoo,
                   aes(x = homerange, y = mortality)) +
  geom_point()

zoo_plot

Now, this would do well in a power relationship with an additive error. To achieve this, we need to minimize least squares algorithmically. R provides a nice function for that, nls, which allows you to define a model with coefficients, but then also requires you to provide some reasonable starting values. Not a bad tradeoff. (Although remember to always test alternate start values!)

zoo_nls <- nls(mortality ~ a*homerange^b,
               data = zoo,
               start = list(a = 1, b = 1))

While F tests won’t work for this model, we can examine coefficients

tidy(zoo_nls)

## # A tibble: 2 x 5
##   term  estimate std.error statistic   p.value
##   <chr>    <dbl>     <dbl>     <dbl>     <dbl>
## 1 a       16.1      2.66        6.04 0.0000104
## 2 b        0.420    0.0867      4.84 0.000131

Easy peasy! Even better, if we want to plot the result - and we’re not interested in plotting confidence intervals, that’s straightforward as well using stat_smooth(). Note, though, that we have to set se=FALSE as otherwise it borks on us as it tries to get CIs. For that, we’d have to use predict.

zoo_plot +
  stat_smooth(method = "nls",
              formula = y ~ a*x^b,
              method.args = list(start = list(a = 1, b = 1)),
              se = FALSE)

Unfortunately, we don’t have an automatic \(R^2\), but we can magic that up using our classic formula.

1 - var(fitted(zoo_nls))/var(zoo$mortality)

## [1] 0.4732568

nls() can take a very very wide variety of functional forms. Moreover, if you do need to move to likelihood, you can code functions for models and use bbmle.

library(bbmle)

## Loading required package: stats4

## 
## Attaching package: 'bbmle'

## The following object is masked from 'package:dplyr':
## 
##     slice

zoo_pow <- function(a, b, sigma){
  yhat <- a * homerange ^ b
 # sigma <- sd(mortality - yhat)
  -sum(dnorm(mortality, yhat, sigma, log = TRUE))
}

zoo_mle <- mle2(zoo_pow,
                data = zoo,
                start = list(a = 30, b = 0.5, sigma = 10),
                method="Nelder-Mead")


#OR
zoo_mle2 <- mle2(mortality ~ dnorm(a * homerange ^ b, sigma),
                 data = zoo,
                 start = list(a = 30, b = 0.5, sigma = 10),
                 method="Nelder-Mead")

And then proceed accordingly.

2. Poisson and Gamma Regression - exploring Links

So what about non-normal data? Let’s start with some count data. If we look at the train collisons with motor vehicle data from Agresti (1996, page 83), we see an interesting pattern.

train <- read_csv("./data/trains_agresti.csv")

## Parsed with column specification:
## cols(
##   Years_Since_1975 = col_integer(),
##   collisions = col_integer(),
##   km_train_travel = col_integer(),
##   id = col_integer()
## )

train_plot <- ggplot(train,
       aes(x = km_train_travel,
           y = collisions)) +
  geom_point()

train_plot

Just looking at this, we can see that the variance differs with the predictor. Also, we are dealing with count data here - which has it’s own distribution. The Poisson. Still, what if we had soldiered on with an lm.

train_lm <- lm(collisions ~ km_train_travel, data = train)

plot(train_lm, which = 1)

plot(train_lm, which = 2)

shapiro.test(residuals(train_lm))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(train_lm)
## W = 0.8104, p-value = 0.0001291

Not good. Weird varianec in the fitted-residual relationship. A bad qq plot. And a failing Shapiro-Wilks test. Part of this is because the response variable here is not normal in its residuals. It’s Poisson, as it’s count data.

2.1 Poisson Regression

A GLM with a Poisson error typically has a log link, although it can also have an identity link. To remind you, a poisson distribution is one for discrete data whose variance scales with it’s mean. This one parameter, \(\lambda\), encompasses both. So, for example,

pois_tib <- tibble(x = rep(0:40,2),
                   lambda = c(rep(5,41), rep(20, 41)),
                   dens = dpois(x, lambda = lambda))

ggplot(pois_tib, aes(x = x, y = dens, 
                     fill = factor(lambda))) +
  geom_col(position = position_dodge())

shows two Poisson distributions, one with a mean of 5, and one with a mean of 20. Note how their variances change.

A log link (e.g.. \(\hat{y} = e^{a + bx}\)) is a natural fit, as it can have no values < 0. An identity (i.e. linear) fit is appropriate in some situations, and still other link functions can work. See ?glm-link for more. To fit a GLM we merely use glm(), but specify both the correct error family and link function. We can leave the link out if we’re using the canonical link, but I prefer using it so that future me remembers what I was doing. So, in this case.

train_glm <- glm(collisions ~ km_train_travel, 
                family = poisson(link = "log"),
                data = train)

We can then evaluate the profile likelihoods as usual to make sure they are either straight lines (default functions) or nice and quadratic.

plot(profile(train_glm))

But…whither residuals? Given that the residuals are not normal, a qqnorm plot hardly makes sense. And a residual-fitted relationship is still likely to look odd.

2.2 Quantile Residuals with GLMs

One way to evaluate GLMs, and many other model forms, is to look at their quantil residuals. As we discussed in lecture, we essentially use simulation to get an empirical distribution for each y value. We then ask what the quantile of the observed y value is. For any model - regardless of complexity - we would assume that the quantiles should be randomly distributed - i.e. according to a uniform distribution. We can therefore look at the qqunif plot, if you will. This is a lot of work - simulating, making empirical distributions, testing to see if we fall on a qqunif line. Fortunately, there is Florian Hartig’s excellent DHARMa package which has a beautiful and clear vignette.

So, to start with, let’s use DHARMa to make some simulated residuals

library(DHARMa)

res <- simulateResiduals(train_glm)

We can then plot these (and get a nonparametric test of fit).

plotQQunif(res)

Wonderful - a good fit.

We also can look at the prediction v. quantile residual plot. Curiously, here, we should see that different strata (different quantiles) all have nice straight parallel lines. If they don’t, then we might have a case of overdispersino - the variance does not scale with the mean in the way described by the distribution. More on that soon.

plotResiduals(res)

This isn’t amazing, but, given the good qqunif fit, it’s not terrible. When there is true overdispersion, all three are badly curved. It’s pretty shocking.

2.3 Looking at glm results

First, let’s hit this with our LR Test.

Anova(train_glm)

## Analysis of Deviance Table (Type II tests)
## 
## Response: collisions
##                 LR Chisq Df Pr(>Chisq)  
## km_train_travel   3.6545  1    0.05592 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looks good! There is a relationship suggested at the 0.056 level. We can also look at model results.

summary(train_glm)

## 
## Call:
## glm(formula = collisions ~ km_train_travel, family = poisson(link = "log"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6727  -1.1972  -0.1673   0.3682   3.1449  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)   
## (Intercept)      3.524000   1.115971   3.158  0.00159 **
## km_train_travel -0.004780   0.002565  -1.863  0.06245 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 43.057  on 28  degrees of freedom
## Residual deviance: 39.402  on 27  degrees of freedom
## AIC: 135.07
## 
## Number of Fisher Scoring iterations: 5

Note we see standard glm outputs here. We also have a dispersion parameter which described the relationship between the mean and variance. For Poisson, it’s 1.

Last, we can plot. Because we have a single predictor, we don’t need to get fancy. stat_smooth can accomodate glms with one predictor.

train_plot +
  stat_smooth(method = glm,
              method.args = list(family = poisson(link = "log")))

2.4 Gamma Regression

Often our data is not discrete, though. For example, we might have continuous data that has a terrible QQ plot due to being overdispersed. Worse, the curve might be non-normal. Consider the clam data

clams <- read_delim("./data/clams.txt", delim = "\t") %>%
  mutate(MONTH = factor(MONTH))

## Parsed with column specification:
## cols(
##   MONTH = col_double(),
##   LENGTH = col_double(),
##   AFD = col_double(),
##   LNLENGTH = col_double(),
##   LNAFD = col_double()
## )

How does AFD (ash free dry mass) relate to month and length? It’s clear there’s something nonlinear here.

clam_plot <- ggplot(clams,
                    aes(x = LENGTH, y = AFD, color = MONTH)) +
  geom_point()

clam_plot

But a qq plot of log transfomed AFD is still not good - nor is a residual-fitted plot. One reason that the relationship might be non-normal is that we’re talking about growth - many multiplicative events that combine together. Gamma distributions are more typically used for data that is about time to some number of events - it’s excellent for rates, for example - but it can also be used here. Gamma glms take the inverse function (e.g. \(\hat{Y} = 1/(a + bx)\)) as their canonical link, but they often work well with log links as well.

clam_gamma <- glm(AFD ~ LENGTH + MONTH,
                  family = Gamma(link = "log"),
                  data = clams)

Does it blend?

clam_res <- simulateResiduals(clam_gamma)

## Model family was recoginzed or set as continous, but duplicate values were detected in the simulation - changing to integer residuals (see help for details)

plotQQunif(clam_res)

Looks good! And

plotResiduals(clam_res)

OK, maybe not as great. But largely this is due to the sparsity of high values, so, we’re OK.

We could use predict to plot, but I’m going to be lazy for a moment, and just plot each month separately here.

clam_plot +
  stat_smooth(method = glm,
              method.args = list(family = Gamma(link = "log"))) +
  facet_wrap(~MONTH)

We can also look at other properties.

Anova(clam_gamma)

## Analysis of Deviance Table (Type II tests)
## 
## Response: AFD
##        LR Chisq Df Pr(>Chisq)    
## LENGTH   5075.4  1  < 2.2e-16 ***
## MONTH     142.0  5  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#coefficients
tidy(clam_gamma)

## # A tibble: 7 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  -5.77     0.0563    -102.   1.55e-284
## 2 LENGTH        0.165    0.00243     68.1  6.90e-219
## 3 MONTH3       -0.469    0.0608      -7.71 1.04e- 13
## 4 MONTH4       -0.289    0.0303      -9.54 1.52e- 19
## 5 MONTH9       -0.387    0.0597      -6.48 2.76e- 10
## 6 MONTH11       0.0523   0.0476       1.10 2.72e-  1
## 7 MONTH12      -0.0335   0.0319      -1.05 2.94e-  1

#more info
summary(clam_gamma)

## 
## Call:
## glm(formula = AFD ~ LENGTH + MONTH, family = Gamma(link = "log"), 
##     data = clams)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.96706  -0.10582   0.01562   0.11773   0.73963  
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -5.765076   0.056279 -102.437  < 2e-16 ***
## LENGTH       0.165464   0.002431   68.066  < 2e-16 ***
## MONTH3      -0.468798   0.060795   -7.711 1.04e-13 ***
## MONTH4      -0.289252   0.030308   -9.544  < 2e-16 ***
## MONTH9      -0.386962   0.059712   -6.480 2.76e-10 ***
## MONTH11      0.052287   0.047573    1.099    0.272    
## MONTH12     -0.033517   0.031886   -1.051    0.294    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.04105819)
## 
##     Null deviance: 386.653  on 397  degrees of freedom
## Residual deviance:  18.885  on 391  degrees of freedom
## AIC: -2420.8
## 
## Number of Fisher Scoring iterations: 6

Oh, hey, what’s that dispersion parameter? Well, a Gamma usually takes two parameters - a shape and a rate. We can reparameterize a gamma such that a mean = shape/rate In this case, we parameterize the Gamma with that mean and the shape. The dispersion parameter is 1/shape.

But, to make this more understandable, the variance with a Gamma expands porportionally to the mean squared. Typically it’s shown as \(Var[Y] = \phi \mu^2\) where \(\phi\) is the dispersion parameter. Larger dispersion parameter, faster expansion of the variance. Faded examples - 3?

2.5 Was it normal?

One question you may well have is - why Gamma and not normal here? We can fit a glm with a normal error and log link.

clam_glm_norm <- glm(AFD ~ LENGTH + MONTH,
                     family = gaussian(link = "log"),
                     data = clams)

One way to tell is to look for overdispersion. The small Gamma dispersion parameter suggests that, well, maybe that error is not expanding that quickly. So, let’s look.

norm_res <- simulateResiduals(clam_glm_norm)

## Model family was recoginzed or set as continous, but duplicate values were detected in the simulation - changing to integer residuals (see help for details)

plotQQunif(norm_res)

plotResiduals(norm_res)

We can see that the QQ is great, actually. And the predobs is not terrible (particularly compared to the above). This is some solid evidence that perhaps a normal error and log link is all that is required here.

2.6 Exercise

Open up the monarch_gardens.csv which describes the number of monarch butterflies in different types of gardens and analyze it with the appropriate family and link. Note, this is an ANOVA-style design, but with a glm. You can use emmeans and the link to draw conclusions. What does this data show you?
Take a look at a data set we have long used before, 16e2InbreedingWolves.csv describing how inbreeding coefficients influence the number of wolves born. This is Poisson count data. Which link function is better? Use AIC to compare. Do you meet assumptions?

3. Logistic Regression

Let’s take a look at our mouse infection by cryptosporidium example. Note that the data is bounded at 0 and 1.

mouse <- read_csv("./data/cryptoDATA.csv") %>%
  mutate(Porportion = Y/N)

mouse_plot <- ggplot(mouse,
       aes(x = Dose, y = Porportion)) +
  geom_point()

mouse_plot

That’s because while N is our total number of mice per sample, we cannot have more than N infections! In essence, each mouse is like a coin flip. Did it get infected or not. And when we have many coin flips - Bernouli trials - that leads to a binomial distribution.

3.1 The Binomial Distribution

A binomial distribution has two parameters. The probability of success and the number of coin flips. The resulting distribution is always bounded between 0 and 1. So, consider 15 coin flips with different probabilities.

bin_tibble <- tibble(outcome = rep(0:15, 2),
                     probability = c(rep(0.3, 16), rep(0.8, 16)),
                     dens = dbinom(outcome, probability, size = 15))

ggplot(bin_tibble,
       aes(x = outcome, y = dens, fill = factor(probability))) +
  geom_col(position = position_dodge())

We could have just as easily have rescaled the x-axis to between 0 and 1 for porportion.

Or, consider the same probability, but a different number of coin flips.

bin_tibble <- tibble(outcome = rep(0:15, 2),
                     size = c(rep(15, 16), rep(30, 16)),
                     dens = dbinom(outcome, 0.4, size = size))

ggplot(bin_tibble,
       aes(x = outcome, y = dens, fill = factor(size))) +
  geom_col(position = position_dodge())

You can see both parameters matter for the shape of the distribution.

3.2 Binomial Logistic Regression

With binomial logistic regression, we are essentially estimating the probability of getting heads. We then supply - not estimate - the number of trials as a weights argument. The canonical link - the logit link - is perfect here, as it describes a logistic curve that saturates both at 0 and 1. There are other links we can use. If none of our values get to 0 or 1, linear might even be appropriate. Or log. There are other useful ones. The probit link, for example, describes the saturating shape of the cummulative function for a normal distribution, and has real meaning with respect to survivorship - although it approaches 0 and 1 very quickly. There are a wide variety of others - see here (page 32) or here for more info. Also, try saying “Scobit” without feeling silly.

We can parameterize binomial logistic regressions one of two ways in R - both are equivalent, as they expand the numbers out to number of successes and total number of trials. The first is a bit combersome, as you have to cbind the number of successes and failures.

mouse_glm_cbind <- glm(cbind(Y, N - Y) ~ Dose,
                 family = binomial(link = "logit"),
                 data = mouse)

The second formualation uses weights for number of trials.

mouse_glm <- glm(Porportion ~ Dose,
                 weights = N,
                 family = binomial(link = "logit"),
                 data = mouse)

These two models are identical.

From this point forward, the worflow is as usual - assumptions, analysis, and visualization.

res_bin <- simulateResiduals(mouse_glm)

plotQQunif(res_bin)

plotResiduals(res_bin)

Looks good!

Anova(mouse_glm)

## Analysis of Deviance Table (Type II tests)
## 
## Response: Porportion
##      LR Chisq Df Pr(>Chisq)    
## Dose   233.84  1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Dose matters!

summary(mouse_glm)

## 
## Call:
## glm(formula = Porportion ~ Dose, family = binomial(link = "logit"), 
##     data = mouse, weights = N)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9532  -1.2442   0.2327   1.5531   3.6013  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.407769   0.148479  -9.481   <2e-16 ***
## Dose         0.013468   0.001046  12.871   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 434.34  on 67  degrees of freedom
## Residual deviance: 200.51  on 66  degrees of freedom
## AIC: 327.03
## 
## Number of Fisher Scoring iterations: 4

Note that the dispersion parameter is 1, just like the poisson.

3.3 Examples

With passing.csv, how does Grade influence the probability of passing?
With UCBAdmissions.csv, how does gender and department affect the probability of being admitted? You might have to do some data manipulation first. Then, you should be able to analyze this using all of the tools you have developed for analyzing categorical data.

4. Overdispersion: Negative Binomial and Quasipoisson

What if we massively fail our test of overdispersion? Consider this data on seals haulouts.

seals <- read_delim("./data/SSEAK98_FINAL.txt", delim = "\t")

There are a lot of predictors here, but let’s just look at Year.

seal_plot <- ggplot(seals,
       aes(x = TideHtMeters, y = Number )) +
  geom_point()

seal_plot

seal_pois <- glm(Number ~ TideHtMeters, 
                 family = poisson(link = "log"),
                 data = seals)

seal_pois_res <- simulateResiduals(seal_pois)
plot(seal_pois_res)

4.2 The Quasipoisson

A first guess might be that the variance and mean, while scaling linearly, aren’t linked by a factor of 1. Maybe the variance increases at twice the rate of the mean! Or something else. This type of model is called a Quasipoisson model. It’s a bit of a hack, but, in essene, you fit a poisson model, and then algorithmically futz with a parameter to see what scaling coefficient you would need to actually encompass 95% of the CI.

seal_qpois <- glm(Number ~ TideHtMeters, 
                 family = quasipoisson(link = "log"),
                 data = seals)

Now, we don’t have good diagnostics, as this isn’t a formal (or real) distribution. But, we can look at what that dispersion parameter is.

summary(seal_qpois)

## 
## Call:
## glm(formula = Number ~ TideHtMeters, family = quasipoisson(link = "log"), 
##     data = seals)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -12.012   -8.991   -4.196    2.110   61.868  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.24790    0.04077 104.200   <2e-16 ***
## TideHtMeters  0.01105    0.05408   0.204    0.838    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 149.6154)
## 
##     Null deviance: 162030  on 1609  degrees of freedom
## Residual deviance: 162024  on 1608  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6

We can see it’s nearly 150! That’s huge! Note also that the coefficients and p-values are not different from the poisson.

4.2.1 Enough of this quasi stuff

Incidentally, if you can write a function for it and use bbmle in conduction with rethinking, another name for a more formal version of the quasipoisson that doesn’t involve this quasi- meishegas is the Gamma Poisson. See rethinking::dgampois where you explicitly parameterize the scaling factor.

library(rethinking)

## Loading required package: rstan

## Loading required package: StanHeaders

## rstan (Version 2.18.1, GitRev: 2e1f913d3ca3)

## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)

## 
## Attaching package: 'rstan'

## The following object is masked from 'package:tidyr':
## 
##     extract

## Loading required package: parallel

## rethinking (Version 1.59)

## 
## Attaching package: 'rethinking'

## The following object is masked from 'package:car':
## 
##     logit

## The following object is masked from 'package:purrr':
## 
##     map

library(bbmle)

seal_fun <- function(int, slope, scale){
  mu <- exp(int + slope*TideHtMeters)
  -sum(dgampois(Number, mu, scale, log = TRUE))
}

seal_bbmle <- mle2(seal_fun,
                   data = seals,
                   start = list(int = 5, slope = 1, scale = 150))

summary(seal_bbmle)

## Maximum likelihood estimation
## 
## Call:
## mle2(minuslogl = seal_fun, start = list(int = 5, slope = 1, scale = 150), 
##     data = seals)
## 
## Coefficients:
##         Estimate Std. Error z value     Pr(z)    
## int     4.321545   0.037468 115.341 < 2.2e-16 ***
## slope  -0.246791   0.039562  -6.238 4.431e-10 ***
## scale 147.980728   7.319003  20.219 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## -2 log L: 16276.06

4.2.2 A Binomial Note

Yes, there is also a quasibinomial. If you want to be more formal about it, and use bbmle, an overdispersed binomial is also known as a beta binomial distribution. I’m not going to get into it here, but, see again what we can do with some rethinking + bbmle magic using theta as our scaling parameter.

crypto_quasi <- glm(Y ~ Dose,
                    family = quasibinomial(),
                    data = mouse)

#OR
crypto_fun <- function(int, slope, theta){
  prob <- int + slope * Dose
  -sum(dbetabinom(Y, N, prob, theta, log = TRUE))
}

crypto_bbmle <- mle2(crypto_fun,
                     data = mouse,
                     start = list(int = -2, slope = 0.01, theta = 3),
                     optimizer = "nlminb",
                     lower = list(theta = 0.1))

4.3 The Negative Binomial

That scaling coefficient was large. Maybe instead of a linear scaling, it’s a squared mean-variance relationship. This corresponds to the negative binomial - a distribution of the number of successes before a specified number of failures. Like the gamma, there a squared relationship between mean and variance, making it flexible for overdispersion. It’s not part of glm, but there’s a function for it - glm.nb in the MASS package. It doesn’t require a family, but instead we can give it a link function only.

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

seal_nb <-  glm.nb(Number ~ TideHtMeters, 
                   link = "log",
                 data = seals)

How does this stack up on the diagnostic front.

seal_nb_res <- simulateResiduals(seal_nb)
plot(seal_nb_res)

Bingo! A beautiful fit! We can now go forward with the coefficients produced by this model with confidence.

4.3 Quasi or NB

So….which to use? Often, by looking at the data, you can intuit whether the relationship between fitted values and the variance in residuals is linear or nonlinear. But, sometimes, you just need to test it out. In this case, we can look at the results from our poisson fit, bin them into “residual bins”, and ask if there is a linear or squared relationship between the variance and bins. Let’s start by getting fitted and residual values.

res_test <- tibble(fit = fitted(seal_pois),
                    res = residuals(seal_pois))

Now, we have to make bins. Depending on the number of data points, you can chose a different number of bins. We have 1610 points here, so, I’ll go with ~10 bins of 161 points. Don’t forget to get the median of your fitted values as well.

res_summary <- res_test %>%
  arrange(fit) %>%
  mutate(bin = sort(rep(1:10, 161))) %>%
  group_by(bin) %>%
  summarize(fit_bin = mean(fit),
            var_res = var(res))

We can then visualize this to look for a linear or nonlinear relationship.

ggplot(res_summary,
       aes(x = fit_bin,
           y = var_res)) + 
  geom_point() +
  stat_smooth(method = "lm", color = "red", se = FALSE)+
  stat_smooth(method = "lm", formula = y ~ poly(x,2),
              color = "blue", se = FALSE)

This looks roughly linear, and so the QP would seem to be the way to go!

4.4 Exercise

Load up the kelp_holdfast.csv data and explore the predictors of number of kelp stipes. Should you use a NB or Quasipoisson error with the model you’ve chosen? What does your chosen model say?
Try out other predictors in the seal data. Does including more predictors change which error distribution you should use? Sometimes, dispersion is caused by unaccounted and correlated predictors!

5. Beta Regression

Finally, we often have data that is bounded, but is not drawn from a binomial - i.e., there are not independent “coin flips”, if you will. This data can easily be accomodated by the beta distirbution. The Beta is often used to describe a distribution of probabilities (an overdispersed binomial glm is actually beta binomial - but more on that another time). But, it’s also useful for things like porportional data, percent cover, and more. All of these can be logit transformed, but a beta provides a more natural fit. We use the package betareg for beta regression.

Consider this analysis of porportion of sodium intake from different supplements when you exercise with different gym teachers. 2300 is your RDA, so we’re standardizing to that.

sodium <- read_csv("./data/sodium_intake.csv")

## Parsed with column specification:
## cols(
##   Instructor = col_character(),
##   Supplement = col_character(),
##   Sodium = col_integer(),
##   Porportion_Sodium_Intake = col_double()
## )

ggplot(sodium,
       aes(x = Supplement, y = Porportion_Sodium_Intake,
           fill = Instructor)) +
  geom_boxplot()

Now, let’s look at this using beta regression.

library(betareg)

sodium_beta <- betareg(Porportion_Sodium_Intake ~ Supplement + Instructor,
                       data = sodium)

Beta regression isn’t standard in DHARMa yet, so, we’re a bit more by the seat of our pants. It does, however, have a function for plotting the linearized fited versus residual values.

plot(sodium_beta, which = 4)

We can then go forward with all of our usual tests and visualizations. For example -

library(emmeans)
cld(emmeans(sodium_beta, ~Supplement))

##  Supplement    emmean         SE  df asymp.LCL asymp.UCL .group
##  C          0.4992099 0.01589676 Inf 0.4680528 0.5303669  1    
##  A          0.5014942 0.01589670 Inf 0.4703373 0.5326512  1    
##  D          0.5501519 0.01581484 Inf 0.5191554 0.5811484  12   
##  B          0.5705132 0.01573450 Inf 0.5396741 0.6013522   2   
## 
## Results are averaged over the levels of: Instructor 
## Confidence level used: 0.95 
## P value adjustment: tukey method for comparing a family of 4 estimates 
## significance level used: alpha = 0.05

cld(emmeans(sodium_beta, ~Instructor))

##  Instructor        emmean         SE  df asymp.LCL asymp.UCL .group
##  Melissa Robins 0.4886649 0.01376260 Inf 0.4616907 0.5156391  1    
##  Coach McGuirk  0.5417052 0.01371747 Inf 0.5148194 0.5685909   2   
##  Brendon Small  0.5606568 0.01366287 Inf 0.5338781 0.5874355   2   
## 
## Results are averaged over the levels of: Supplement 
## Confidence level used: 0.95 
## P value adjustment: tukey method for comparing a family of 3 estimates 
## significance level used: alpha = 0.05

Or if we had a continuous covariate, we could get fitted values and error and put them into the model.

5.1 Exercises

Try it out with some of my old data on docks looking at how predators shape the percent cover of bare space - dock_pred_div_byrnes.xls. You might have to add 0.01 to the 0 values to make it work.