1. Nonlinear Least Squares

1.1 Nonlinear Terms in Linear Regression

Fitting a simple squared or other nonlinear term in a model is fairly straightforward in R. One simply has to include the term in the model itself. So, let’s take a look at how it works.

library(tidyverse)
library(readr)
library(readxl)
theme_set(theme_classic(base_size = 17))

turtles <- read_excel("./data/Ashton_2007.xlsx")

turtle_plot <- ggplot(turtles,
                      aes(x = Length, y = Clutch)) +
  geom_point()

turtle_plot

This is clearly nonlinear. In fact, if we’d fit a linear model, the assumption plots would look wonky. Particularly the fitted-residual, which would show a leftover nonlinear relationship.

plot(lm(Clutch ~ Length , data = turtles), which = 1)

To take this data and fit a squared polynomial, you need to do a bit more than add it to the model. We use the function I() inside of the model. It says to R to stop and evaluate that transformed term.

turtle_poly <- lm(Clutch ~ Length + I(Length^2), data = turtles)

Now, if we look at the results…

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
library(broom)

Anova(turtle_poly)
## Anova Table (Type II tests)
## 
## Response: Clutch
##              Sum Sq Df F value   Pr(>F)   
## Length       79.879  1  11.201 0.004415 **
## I(Length^2)  79.178  1  11.102 0.004550 **
## Residuals   106.974 15                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tidy(turtle_poly)
## # A tibble: 3 x 5
##   term          estimate std.error statistic p.value
##   <chr>            <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept) -900.      270.          -3.33 0.00457
## 2 Length         5.86      1.75         3.35 0.00441
## 3 I(Length^2)   -0.00942   0.00283     -3.33 0.00455

We can see that the squared term is included in both sets of results. Not that it will naturally be incorporated into predict() as well. To put our fit model into ggplot, we need to do a little work with stat_smooth()

turtle_plot +
  stat_smooth(method = "lm", formula = y ~ x + I(x^2))

Note, collinearity can at times be problematic with squared terms. To get your model to fit centered polynomial terms without having to deal with manually doing the centering, try

cent_mod <- lm(Clutch ~ poly(Length, 2), data = turtles)

tidy(cent_mod)
## # A tibble: 3 x 5
##   term             estimate std.error statistic       p.value
##   <chr>               <dbl>     <dbl>     <dbl>         <dbl>
## 1 (Intercept)          8.06     0.629    12.8   0.00000000178
## 2 poly(Length, 2)1     1.67     2.67      0.626 0.541        
## 3 poly(Length, 2)2    -8.90     2.67     -3.33  0.00455

While the estimates are different, that’s because they’re calculated at the average of the predictor. So, don’t forget to back-transform if you use this formulation.

1.2 True Nonlinear Least Squares Regression

You do not always want to fit a model that can easily be turned into a series of additive terms. For example, consider the power model:

\[y_i = ax^b + \epsilon_i\]

Sure, you can fit a log-log relationship, but then your error term is nonlonger additive. We encountered a dataset that would do well in this model with the zoo mortality dataset - animals with larger home ranges experience higher mortality.

zoo <- read_csv("./data/17q02ZooMortality Clubb and Mason 2003 replica.csv")

zoo_plot <- ggplot(zoo,
                   aes(x = homerange, y = mortality)) +
  geom_point()

zoo_plot

Now, this would do well in a power relationship with an additive error. To achieve this, we need to minimize least squares algorithmically. R provides a nice function for that, nls, which allows you to define a model with coefficients, but then also requires you to provide some reasonable starting values. Not a bad tradeoff. (Although remember to always test alternate start values!)

zoo_nls <- nls(mortality ~ a*homerange^b,
               data = zoo,
               start = list(a = 1, b = 1))

While F tests won’t work for this model, we can examine coefficients

tidy(zoo_nls)
## # A tibble: 2 x 5
##   term  estimate std.error statistic   p.value
##   <chr>    <dbl>     <dbl>     <dbl>     <dbl>
## 1 a       16.1      2.66        6.04 0.0000104
## 2 b        0.420    0.0867      4.84 0.000131

Easy peasy! Even better, if we want to plot the result - and we’re not interested in plotting confidence intervals, that’s straightforward as well using stat_smooth(). Note, though, that we have to set se=FALSE as otherwise it borks on us as it tries to get CIs. For that, we’d have to use predict.

zoo_plot +
  stat_smooth(method = "nls",
              formula = y ~ a*x^b,
              method.args = list(start = list(a = 1, b = 1)),
              se = FALSE)

Unfortunately, we don’t have an automatic \(R^2\), but we can magic that up using our classic formula.

1 - var(fitted(zoo_nls))/var(zoo$mortality)
## [1] 0.4732568

nls() can take a very very wide variety of functional forms. Moreover, if you do need to move to likelihood, you can code functions for models and use bbmle.

library(bbmle)
## Loading required package: stats4
## 
## Attaching package: 'bbmle'
## The following object is masked from 'package:dplyr':
## 
##     slice
zoo_pow <- function(a, b, sigma){
  yhat <- a * homerange ^ b
 # sigma <- sd(mortality - yhat)
  -sum(dnorm(mortality, yhat, sigma, log = TRUE))
}

zoo_mle <- mle2(zoo_pow,
                data = zoo,
                start = list(a = 30, b = 0.5, sigma = 10),
                method="Nelder-Mead")


#OR
zoo_mle2 <- mle2(mortality ~ dnorm(a * homerange ^ b, sigma),
                 data = zoo,
                 start = list(a = 30, b = 0.5, sigma = 10),
                 method="Nelder-Mead")

And then proceed accordingly.

3. Logistic Regression

Let’s take a look at our mouse infection by cryptosporidium example. Note that the data is bounded at 0 and 1.

mouse <- read_csv("./data/cryptoDATA.csv") %>%
  mutate(Porportion = Y/N)

mouse_plot <- ggplot(mouse,
       aes(x = Dose, y = Porportion)) +
  geom_point()

mouse_plot

That’s because while N is our total number of mice per sample, we cannot have more than N infections! In essence, each mouse is like a coin flip. Did it get infected or not. And when we have many coin flips - Bernouli trials - that leads to a binomial distribution.

3.1 The Binomial Distribution

A binomial distribution has two parameters. The probability of success and the number of coin flips. The resulting distribution is always bounded between 0 and 1. So, consider 15 coin flips with different probabilities.

bin_tibble <- tibble(outcome = rep(0:15, 2),
                     probability = c(rep(0.3, 16), rep(0.8, 16)),
                     dens = dbinom(outcome, probability, size = 15))

ggplot(bin_tibble,
       aes(x = outcome, y = dens, fill = factor(probability))) +
  geom_col(position = position_dodge())

We could have just as easily have rescaled the x-axis to between 0 and 1 for porportion.

Or, consider the same probability, but a different number of coin flips.

bin_tibble <- tibble(outcome = rep(0:15, 2),
                     size = c(rep(15, 16), rep(30, 16)),
                     dens = dbinom(outcome, 0.4, size = size))

ggplot(bin_tibble,
       aes(x = outcome, y = dens, fill = factor(size))) +
  geom_col(position = position_dodge())

You can see both parameters matter for the shape of the distribution.

3.2 Binomial Logistic Regression

With binomial logistic regression, we are essentially estimating the probability of getting heads. We then supply - not estimate - the number of trials as a weights argument. The canonical link - the logit link - is perfect here, as it describes a logistic curve that saturates both at 0 and 1. There are other links we can use. If none of our values get to 0 or 1, linear might even be appropriate. Or log. There are other useful ones. The probit link, for example, describes the saturating shape of the cummulative function for a normal distribution, and has real meaning with respect to survivorship - although it approaches 0 and 1 very quickly. There are a wide variety of others - see here (page 32) or here for more info. Also, try saying “Scobit” without feeling silly.

We can parameterize binomial logistic regressions one of two ways in R - both are equivalent, as they expand the numbers out to number of successes and total number of trials. The first is a bit combersome, as you have to cbind the number of successes and failures.

mouse_glm_cbind <- glm(cbind(Y, N - Y) ~ Dose,
                 family = binomial(link = "logit"),
                 data = mouse)

The second formualation uses weights for number of trials.

mouse_glm <- glm(Porportion ~ Dose,
                 weights = N,
                 family = binomial(link = "logit"),
                 data = mouse)

These two models are identical.

From this point forward, the worflow is as usual - assumptions, analysis, and visualization.

res_bin <- simulateResiduals(mouse_glm)

plotQQunif(res_bin)

plotResiduals(res_bin)

Looks good!

Anova(mouse_glm)
## Analysis of Deviance Table (Type II tests)
## 
## Response: Porportion
##      LR Chisq Df Pr(>Chisq)    
## Dose   233.84  1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Dose matters!

summary(mouse_glm)
## 
## Call:
## glm(formula = Porportion ~ Dose, family = binomial(link = "logit"), 
##     data = mouse, weights = N)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9532  -1.2442   0.2327   1.5531   3.6013  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.407769   0.148479  -9.481   <2e-16 ***
## Dose         0.013468   0.001046  12.871   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 434.34  on 67  degrees of freedom
## Residual deviance: 200.51  on 66  degrees of freedom
## AIC: 327.03
## 
## Number of Fisher Scoring iterations: 4

Note that the dispersion parameter is 1, just like the poisson.

3.3 Examples

  1. With passing.csv, how does Grade influence the probability of passing?

  2. With UCBAdmissions.csv, how does gender and department affect the probability of being admitted? You might have to do some data manipulation first. Then, you should be able to analyze this using all of the tools you have developed for analyzing categorical data.

4. Overdispersion: Negative Binomial and Quasipoisson

What if we massively fail our test of overdispersion? Consider this data on seals haulouts.

seals <- read_delim("./data/SSEAK98_FINAL.txt", delim = "\t")

There are a lot of predictors here, but let’s just look at Year.

seal_plot <- ggplot(seals,
       aes(x = TideHtMeters, y = Number )) +
  geom_point()

seal_plot

seal_pois <- glm(Number ~ TideHtMeters, 
                 family = poisson(link = "log"),
                 data = seals)

seal_pois_res <- simulateResiduals(seal_pois)
plot(seal_pois_res)

4.2 The Quasipoisson

A first guess might be that the variance and mean, while scaling linearly, aren’t linked by a factor of 1. Maybe the variance increases at twice the rate of the mean! Or something else. This type of model is called a Quasipoisson model. It’s a bit of a hack, but, in essene, you fit a poisson model, and then algorithmically futz with a parameter to see what scaling coefficient you would need to actually encompass 95% of the CI.

seal_qpois <- glm(Number ~ TideHtMeters, 
                 family = quasipoisson(link = "log"),
                 data = seals)

Now, we don’t have good diagnostics, as this isn’t a formal (or real) distribution. But, we can look at what that dispersion parameter is.

summary(seal_qpois)
## 
## Call:
## glm(formula = Number ~ TideHtMeters, family = quasipoisson(link = "log"), 
##     data = seals)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -12.012   -8.991   -4.196    2.110   61.868  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.24790    0.04077 104.200   <2e-16 ***
## TideHtMeters  0.01105    0.05408   0.204    0.838    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 149.6154)
## 
##     Null deviance: 162030  on 1609  degrees of freedom
## Residual deviance: 162024  on 1608  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6

We can see it’s nearly 150! That’s huge! Note also that the coefficients and p-values are not different from the poisson.

4.2.1 Enough of this quasi stuff

Incidentally, if you can write a function for it and use bbmle in conduction with rethinking, another name for a more formal version of the quasipoisson that doesn’t involve this quasi- meishegas is the Gamma Poisson. See rethinking::dgampois where you explicitly parameterize the scaling factor.

library(rethinking)
## Loading required package: rstan
## Loading required package: StanHeaders
## rstan (Version 2.18.1, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## 
## Attaching package: 'rstan'
## The following object is masked from 'package:tidyr':
## 
##     extract
## Loading required package: parallel
## rethinking (Version 1.59)
## 
## Attaching package: 'rethinking'
## The following object is masked from 'package:car':
## 
##     logit
## The following object is masked from 'package:purrr':
## 
##     map
library(bbmle)

seal_fun <- function(int, slope, scale){
  mu <- exp(int + slope*TideHtMeters)
  -sum(dgampois(Number, mu, scale, log = TRUE))
}

seal_bbmle <- mle2(seal_fun,
                   data = seals,
                   start = list(int = 5, slope = 1, scale = 150))

summary(seal_bbmle)
## Maximum likelihood estimation
## 
## Call:
## mle2(minuslogl = seal_fun, start = list(int = 5, slope = 1, scale = 150), 
##     data = seals)
## 
## Coefficients:
##         Estimate Std. Error z value     Pr(z)    
## int     4.321545   0.037468 115.341 < 2.2e-16 ***
## slope  -0.246791   0.039562  -6.238 4.431e-10 ***
## scale 147.980728   7.319003  20.219 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## -2 log L: 16276.06

4.2.2 A Binomial Note

Yes, there is also a quasibinomial. If you want to be more formal about it, and use bbmle, an overdispersed binomial is also known as a beta binomial distribution. I’m not going to get into it here, but, see again what we can do with some rethinking + bbmle magic using theta as our scaling parameter.

crypto_quasi <- glm(Y ~ Dose,
                    family = quasibinomial(),
                    data = mouse)

#OR
crypto_fun <- function(int, slope, theta){
  prob <- int + slope * Dose
  -sum(dbetabinom(Y, N, prob, theta, log = TRUE))
}

crypto_bbmle <- mle2(crypto_fun,
                     data = mouse,
                     start = list(int = -2, slope = 0.01, theta = 3),
                     optimizer = "nlminb",
                     lower = list(theta = 0.1))

4.3 The Negative Binomial

That scaling coefficient was large. Maybe instead of a linear scaling, it’s a squared mean-variance relationship. This corresponds to the negative binomial - a distribution of the number of successes before a specified number of failures. Like the gamma, there a squared relationship between mean and variance, making it flexible for overdispersion. It’s not part of glm, but there’s a function for it - glm.nb in the MASS package. It doesn’t require a family, but instead we can give it a link function only.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
seal_nb <-  glm.nb(Number ~ TideHtMeters, 
                   link = "log",
                 data = seals)

How does this stack up on the diagnostic front.

seal_nb_res <- simulateResiduals(seal_nb)
plot(seal_nb_res)

Bingo! A beautiful fit! We can now go forward with the coefficients produced by this model with confidence.

4.3 Quasi or NB

So….which to use? Often, by looking at the data, you can intuit whether the relationship between fitted values and the variance in residuals is linear or nonlinear. But, sometimes, you just need to test it out. In this case, we can look at the results from our poisson fit, bin them into “residual bins”, and ask if there is a linear or squared relationship between the variance and bins. Let’s start by getting fitted and residual values.

res_test <- tibble(fit = fitted(seal_pois),
                    res = residuals(seal_pois))

Now, we have to make bins. Depending on the number of data points, you can chose a different number of bins. We have 1610 points here, so, I’ll go with ~10 bins of 161 points. Don’t forget to get the median of your fitted values as well.

res_summary <- res_test %>%
  arrange(fit) %>%
  mutate(bin = sort(rep(1:10, 161))) %>%
  group_by(bin) %>%
  summarize(fit_bin = mean(fit),
            var_res = var(res))

We can then visualize this to look for a linear or nonlinear relationship.

ggplot(res_summary,
       aes(x = fit_bin,
           y = var_res)) + 
  geom_point() +
  stat_smooth(method = "lm", color = "red", se = FALSE)+
  stat_smooth(method = "lm", formula = y ~ poly(x,2),
              color = "blue", se = FALSE)

This looks roughly linear, and so the QP would seem to be the way to go!

4.4 Exercise

  1. Load up the kelp_holdfast.csv data and explore the predictors of number of kelp stipes. Should you use a NB or Quasipoisson error with the model you’ve chosen? What does your chosen model say?

  2. Try out other predictors in the seal data. Does including more predictors change which error distribution you should use? Sometimes, dispersion is caused by unaccounted and correlated predictors!

5. Beta Regression

Finally, we often have data that is bounded, but is not drawn from a binomial - i.e., there are not independent “coin flips”, if you will. This data can easily be accomodated by the beta distirbution. The Beta is often used to describe a distribution of probabilities (an overdispersed binomial glm is actually beta binomial - but more on that another time). But, it’s also useful for things like porportional data, percent cover, and more. All of these can be logit transformed, but a beta provides a more natural fit. We use the package betareg for beta regression.

Consider this analysis of porportion of sodium intake from different supplements when you exercise with different gym teachers. 2300 is your RDA, so we’re standardizing to that.

sodium <- read_csv("./data/sodium_intake.csv")
## Parsed with column specification:
## cols(
##   Instructor = col_character(),
##   Supplement = col_character(),
##   Sodium = col_integer(),
##   Porportion_Sodium_Intake = col_double()
## )
ggplot(sodium,
       aes(x = Supplement, y = Porportion_Sodium_Intake,
           fill = Instructor)) +
  geom_boxplot()

Now, let’s look at this using beta regression.

library(betareg)

sodium_beta <- betareg(Porportion_Sodium_Intake ~ Supplement + Instructor,
                       data = sodium)

Beta regression isn’t standard in DHARMa yet, so, we’re a bit more by the seat of our pants. It does, however, have a function for plotting the linearized fited versus residual values.

plot(sodium_beta, which = 4)

We can then go forward with all of our usual tests and visualizations. For example -

library(emmeans)
cld(emmeans(sodium_beta, ~Supplement))
##  Supplement    emmean         SE  df asymp.LCL asymp.UCL .group
##  C          0.4992099 0.01589676 Inf 0.4680528 0.5303669  1    
##  A          0.5014942 0.01589670 Inf 0.4703373 0.5326512  1    
##  D          0.5501519 0.01581484 Inf 0.5191554 0.5811484  12   
##  B          0.5705132 0.01573450 Inf 0.5396741 0.6013522   2   
## 
## Results are averaged over the levels of: Instructor 
## Confidence level used: 0.95 
## P value adjustment: tukey method for comparing a family of 4 estimates 
## significance level used: alpha = 0.05
cld(emmeans(sodium_beta, ~Instructor))
##  Instructor        emmean         SE  df asymp.LCL asymp.UCL .group
##  Melissa Robins 0.4886649 0.01376260 Inf 0.4616907 0.5156391  1    
##  Coach McGuirk  0.5417052 0.01371747 Inf 0.5148194 0.5685909   2   
##  Brendon Small  0.5606568 0.01366287 Inf 0.5338781 0.5874355   2   
## 
## Results are averaged over the levels of: Supplement 
## Confidence level used: 0.95 
## P value adjustment: tukey method for comparing a family of 3 estimates 
## significance level used: alpha = 0.05

Or if we had a continuous covariate, we could get fitted values and error and put them into the model.

5.1 Exercises

  1. Try it out with some of my old data on docks looking at how predators shape the percent cover of bare space - dock_pred_div_byrnes.xls. You might have to add 0.01 to the 0 values to make it work.