Many Types of Predictors in Linear Models

Author

Biol607

1. Multiple Linear Regression

Multiple linear regression is conceptually very similar to ANCOVA. Let’s use the keeley fire severity plant richness data to see it in action.

keeley <- read.csv("./data/Keeley_rawdata_select4.csv")

head(keeley)

  distance elev  abiotic age   hetero firesev     cover rich
1 53.40900 1225 60.67103  40 0.757065    3.50 1.0387974   51
2 37.03745   60 40.94291  25 0.491340    4.05 0.4775924   31
3 53.69565  200 50.98805  15 0.844485    2.60 0.9489357   71
4 53.69565  200 61.15633  15 0.690847    2.90 1.1949002   64
5 51.95985  970 46.66807  23 0.545628    4.30 1.2981890   68
6 51.95985  970 39.82357  24 0.652895    4.00 1.1734866   34

For our purposes, we’ll focus on fire severity and plant cover as predictors.

1.1 Visualizing

I’m not going to lie, visualizing multiple continuous variables is as much of an art as a science. One can use colors and sizes of points, or slice up the data into chunks and facet that. Here are a few examples.

library(ggplot2)

ggplot(data = keeley,
       aes(x = cover, y = rich, color=firesev)) +
  geom_point() +
  scale_color_gradient(low="yellow", high="red") +
  theme_bw() +
  facet_wrap(vars(cut_number(firesev, 4)))

We can also look at everything!

pairs(keeley)

What other plots can you make?

1.2 Fit and Evaluate Assumptions

Fitting is straightforward for an additive MLR model. It’s just a linear model!

keeley_mlr <- lm(rich ~ firesev + cover, data=keeley)

As for assumptions, we have the usual

library(performance)
check_model(keeley_mlr)

But now we also need to think about how the residuals relate to each predictor. Fortunately, there’s still residualPlots() in car.

library(car)

Loading required package: carData

residualPlots(keeley_mlr, test=FALSE)

Odd bow shape - but not too bad. Maybe there’s an interaction? Maybe we want to log transform something? Who knows!

We also want to look at multicollinearity. There are two steps for that. First, the vif

check_collinearity(keeley_mlr) |> plot()

Not bad. We might also want to look at the correlation matrix. dplyr can help us here as we want to select down to just relevant columns.

library(dplyr)

keeley |>
  dplyr::select(firesev, cover) |>
  cor()

           firesev      cover
firesev  1.0000000 -0.4371346
cover   -0.4371346  1.0000000

Also, not too bad! Well within our range of tolerances.

1.3 Evaluation

What about those coefficients?

library(broom)
tidy(keeley_mlr)

# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    53.9      7.04       7.66 2.45e-11
2 firesev        -2.53     0.993     -2.55 1.25e- 2
3 cover           9.91     5.17       1.92 5.85e- 2

And then the R²

r2(keeley_mlr)

# R2 for Linear Regression
       R2: 0.170
  adj. R2: 0.151

Not amazing fit, but, the coefficients are clearly different from 0.

1.4 Visualization

This is where things get sticky. We have two main approaches. First, visualizing with component + residual plots

crPlots(keeley_mlr, smooth=FALSE)

But the values on the y axis are….odd. We get a sense of what’s going on and the scatter after accounting for our predictor of interest, but we might want to look at, say, evaluation of a variable at the mean of the other.

For that, we have visreg

library(visreg)
par(mfrow=c(1,2))
visreg(keeley_mlr)

par(mfrow=c(1,1))

Now the axes make far more sense, and we have a sense of the relationship.

We can actually whip this up on our own using crossing, the median of each value, and predict.

k_med_firesev <- data.frame(firesev = median(keeley$firesev),
                                 cover = seq(0,1.5, length.out = 100))

k_med_firesev <- augment(keeley_mlr, newdata = k_med_firesev, 
                         interval = "confidence") 
  
  
ggplot() +
  geom_point(data=keeley, 
             mapping = aes(x=cover, y = rich)) +
  geom_line(data = k_med_firesev, 
            mapping=aes(x=cover, y=.fitted), 
            color="blue") +
  geom_ribbon(data = k_med_firesev, mapping = aes(x=cover, 
                                                  ymin = .lower,
                                                  ymax = .upper),
              fill="lightgrey", alpha=0.5)

Or, we can use modelr to explore the model and combine that exploration with the data. Let’s get the curve for cover at four levels of fire severity. We’ll use both modelr::data_grid and modelr::add_predictions for a nice easy workflow.

library(modelr)

k_firesev_explore <- data_grid(keeley,
                               cover = seq_range(cover, 100),
                               firesev = seq_range(firesev, 4)) %>%
  add_predictions(model = keeley_mlr, var = "rich")

Nice, no? Sadly, add_predictions doesn’t give us variance (but see my code in https://github.com/tidyverse/modelr/issues/38 for a possible solution if you want it). If you want those, you’ll have to go back to using augment, although you can still use it with data_grid. Actually, let’s try that.

k_firesev_explore <- data_grid(keeley,
                               cover = seq_range(cover, 100),
                               firesev = seq_range(firesev, 4)) |>
  augment(keeley_mlr, newdata = _, interval = "confidence") |>
  rename(rich = .fitted)

Huh. Not bad.

Anywho, then we can visualize with the data!

ggplot(data=keeley,
             mapping = aes(x=cover, y = rich, color = firesev)) +
  geom_point() +
  geom_line(data = k_firesev_explore, aes(group = firesev)) +
  geom_ribbon(data = k_firesev_explore, 
              aes(group = firesev, ymin = .lower, ymax = .upper),
              color = "lightgrey", alpha = 0.2) +
  scale_color_gradient(low="yellow", high="red") +
  facet_wrap(vars(cut_interval(firesev, 4))) +
  theme_bw(base_size = 14)

1.5 Examples

OK, here are two datasets to work with:

planktonSummary.csv showing plankton from Lake Biakal (thanks, Stephanie Hampton). Evluate how Chlorophyll (CHLFa) is affected by other predictors.
SwaddleWestNile2002NCEAS_shortnames.csv is about the prevalence of West Nile virus in Birds around Sacramento county in California. What predicts human WNV?

Using the following workflow, analyze these data sets.

# Load the data

# Perform a preliminary visualization. Play with this and choose two predictors

# Fit a MLR model

# Test Asssumptions and modify model if needed

# Evaluate results

# Visualize results

2. Many Categorical Varibles

We’ll work with the zooplankton depredation dataset for two-way linear model with categorical predictors. This is a blocked experiment, so, each treatment is in each block just once.

zooplankton <- read.csv("./data/18e2ZooplanktonDepredation.csv")

ggplot(data = zooplankton,
       aes(x = treatment, y = zooplankton)) +
  geom_boxplot()

ggplot(data = zooplankton,
       aes(x = block, y = zooplankton)) +
  geom_boxplot()

Warning: Continuous x aesthetic -- did you forget aes(group=...)?

Oh. That’s odd. What is up with block? AH HA! It’s continuous. We need to make it discrete to work with it.

zooplankton <- zooplankton |>
  mutate(block = as.character(block))

ggplot(data = zooplankton,
       aes(x = block, y = zooplankton)) +
  geom_boxplot()

There we go. Always check!

2.1 Fit and Assumption Evaluation

Fit is quite easy. We just add one more factor to an lm model!

zooplankton_lm <- lm(zooplankton ~ treatment + block,
                     data = zooplankton)

We then evaluate residuals almost as usual…

library(performance)
check_model(zooplankton_lm)

We want to look more deeply by fitted versus residuals to inspect for nonadditivity. Let’s check it out.

check_model(zooplankton_lm, check = "linearity") |> plot()

Looks good!

2.2 Coefficients

Let’s look at the coefs!

tidy(zooplankton_lm)

# A tibble: 7 × 5
  term           estimate std.error statistic    p.value
  <chr>             <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)    3.42e+ 0     0.313  1.09e+ 1 0.00000433
2 treatmenthigh -1.64e+ 0     0.289 -5.67e+ 0 0.000473  
3 treatmentlow  -1.02e+ 0     0.289 -3.52e+ 0 0.00781   
4 block2         1.04e-15     0.374  2.78e-15 1.00      
5 block3        -7.00e- 1     0.374 -1.87e+ 0 0.0979    
6 block4        -1.00e+ 0     0.374 -2.68e+ 0 0.0281    
7 block5        -3.00e- 1     0.374 -8.03e- 1 0.445

This is still a little odd, though, as our treatments are evaluated in block 1 control. To truly get just the treatment effect, we need to look at the estimated marginal means - the emmeans! The big thing about emmeans is that it creates a reference grid based on the blocks. It then calculates the treatment effect averaged over all blocks, rather than just in block 1.

library(emmeans)

zoop_em <- emmeans(zooplankton_lm, ~treatment)

zoop_em

 treatment emmean    SE df lower.CL upper.CL
 control     3.02 0.205  8    2.548     3.49
 high        1.38 0.205  8    0.908     1.85
 low         2.00 0.205  8    1.528     2.47

Results are averaged over the levels of: block 
Confidence level used: 0.95

2.3 Evaluating treatment differences

Here, emmeans gets interesting.

contrast(zoop_em, method = "pairwise") |>
  confint()

 contrast       estimate    SE df lower.CL upper.CL
 control - high     1.64 0.289  8    0.813    2.467
 control - low      1.02 0.289  8    0.193    1.847
 high - low        -0.62 0.289  8   -1.447    0.207

Results are averaged over the levels of: block 
Confidence level used: 0.95 
Conf-level adjustment: tukey method for comparing a family of 3 estimates

Note the message that we’ve averaged over the levels of block. You can do any sort of posthoc evaluation (query your model!) as you wanted before. And, you could have done the same workflow for block as well.

2.4 Faded Examples

Given then similarity with models from last week, let’s just jump right into two examples, noting a key difference or two here and there.

To start with, let’s look at gene expression by different types of bees.

bees <- read.csv("./data/18q07BeeGeneExpression.csv")

#Visualize
________(data=____,
         aes(x = type, y = Expression)) +
  _________()

#fit
bee_lm <- __(______ ~ ______ + _____, data=________)

#assumptions
________(______)

#pairwise comparison
emmeans(__________, spec = ~ ________) |>
  contrast(method = "pairwise") |>
  confint()

Wow, not that different, save adding one more term and the residualPlots.

OK, one more…. repeating an experiment in the intertidal?

intertidal <- read.csv("./data/18e3IntertidalAlgae.csv")

#Visualize
ggplot(data = _____________,
       _____ = ______(x = herbivores, y = sqrtarea)) +
  _____________()

#fit
intertidal_lm <- __(______ ~ ______ + _____, data=________)

#assumptions
________(______)

#pairwise comparison
emmeans(__________, spec = ~ ________) |>
  contrast(method = "___________") |>
  ________()

Did that last one pass the test of non-additivity?

3. Mixing Categorical and Continuous Variables

Combining categorical and continuous variables is not that different from what we have done above. To start with, let’s look at the neanderthal data from class.

neand <- read.csv("./data/18q09NeanderthalBrainSize.csv")
head(neand)

  lnmass lnbrain species
1   4.00    7.19  recent
2   3.96    7.23  recent
3   3.95    7.26  recent
4   4.04    7.23  recent
5   4.11    7.22  recent
6   4.10    7.24  recent

We can clearly see both the categorical variable we’re interested in, and the covariate.

To get a preliminary sense of what’s going on here, do some exploratory visualization with ggplot2 why doncha!

library(ggplot2)

qplot(lnmass, lnbrain, color=species, data=neand) +
  stat_smooth(method="lm")

Now, the CIs are going to be off as this wasn’t all tested in the same model, but you begin to get a sense of whether things are parallel or not, and whether this covariate is important.

What other plots might you produce?

As this is a general linear model, good olde lm() is still there for us.

neand_lm <- lm(lnbrain ~ species + lnmass, data=neand)

3.1 Testing Assumptions

In addition to the ususal tests, we need to make sure that the slopes really are parallel. We do that by fitting a model with an interaction and testing it (which, well, if there was and interaction, might that be interesting).

First, the usual

check_model(neand_lm)

#And now look at residuals by group/predictors
library(car)
residualPlots(neand_lm, tests=FALSE)

To test the parallel presumption

neand_int <- lm(lnbrain ~ species*lnmass, data=neand)

tidy(neand_int)

# A tibble: 4 × 5
  term                 estimate std.error statistic  p.value
  <chr>                   <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)             4.25      0.977      4.35 0.000111
2 speciesrecent           1.18      1.06       1.11 0.274   
3 lnmass                  0.714     0.227      3.14 0.00340 
4 speciesrecent:lnmass   -0.259     0.248     -1.05 0.303

We can see that there is no interaction, and hence we are a-ok to proceed with the parallel lines model.

1.2 Assessing results

First, let’s look at our coefficients

tidy(neand_lm)

# A tibble: 3 × 5
  term          estimate std.error statistic  p.value
  <chr>            <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     5.19      0.395      13.1  2.74e-15
2 speciesrecent   0.0703    0.0282      2.49 1.75e- 2
3 lnmass          0.496     0.0917      5.41 4.26e- 6

Everything looks like it matters. If you had wanted to do this with a centered covariate, how would things differ? Try it out.

Second, does our model explain much?

r2(neand_lm)

# R2 for Linear Regression
       R2: 0.449
  adj. R2: 0.418

Not bad - and both types agree fairly well.

Next, we might want to compare covariate adjusted groups and/or look at covariate adjusted means.

library(emmeans)
adj_means <- emmeans(neand_lm, specs = ~species)

#adjusted means
adj_means

 species     emmean      SE df lower.CL upper.CL
 neanderthal  7.272 0.02419 36    7.223    7.321
 recent       7.342 0.01250 36    7.317    7.367

Confidence level used: 0.95

Nice! But, what if we wanted to know what value of the covariate we were using? To get those, we need to use say that we are using lnmass as a reference grid - i.e. we’re marginalizing over it.

avg_sp_means <- emmeans(neand_lm, ~species|lnmass)

avg_sp_means

lnmass = 4.198:
 species     emmean      SE df lower.CL upper.CL
 neanderthal  7.272 0.02419 36    7.223    7.321
 recent       7.342 0.01250 36    7.317    7.367

Confidence level used: 0.95

We can then use this for pairwise comparisons, etc.

#comparisons
contrast(avg_sp_means, method="pairwise") |>
  confint()

lnmass = 4.2:
 contrast             estimate     SE df lower.CL upper.CL
 neanderthal - recent  -0.0703 0.0282 36   -0.128  -0.0131

Confidence level used: 0.95

You can also specify other levels of mass, if you are more interested in those for biological relevance. You can even specify multiple levels, or even a smooth curve and plot the results!

We can also see the estimates of the slopes for each species (which is more useful when there is an interaction - as summary is just fine without.)

#no interaction
emtrends(neand_lm, ~species, var = "lnmass")

 species     lnmass.trend     SE df lower.CL upper.CL
 neanderthal        0.496 0.0917 36     0.31    0.682
 recent             0.496 0.0917 36     0.31    0.682

Confidence level used: 0.95

3.3 Visualization

Visualization is funny, as you want to make parallel lines and also get the CIs right. Rather than rely on ggplot2 to do this natively, we need to futz around a bit with generating predictions

neand_predict <- augment(neand_lm, interval="confidence")

head(neand_predict)

# A tibble: 6 × 11
  lnbrain species lnmass .fitted .lower .upper   .resid   .hat .sigma  .cooksd
    <dbl> <chr>    <dbl>   <dbl>  <dbl>  <dbl>    <dbl>  <dbl>  <dbl>    <dbl>
1    7.19 recent    4       7.24   7.20   7.28 -0.0536  0.0860 0.0669 0.0222  
2    7.23 recent    3.96    7.22   7.18   7.27  0.00624 0.114  0.0676 0.000425
3    7.26 recent    3.95    7.22   7.17   7.27  0.0412  0.122  0.0672 0.0202  
4    7.23 recent    4.04    7.26   7.23   7.30 -0.0335  0.0637 0.0673 0.00611 
5    7.22 recent    4.11    7.30   7.27   7.33 -0.0782  0.0394 0.0662 0.0196  
6    7.24 recent    4.1     7.29   7.27   7.32 -0.0532  0.0418 0.0670 0.00968 
# … with 1 more variable: .std.resid <dbl>

So, here we have fit values, lower confidence interval, and upper confidence intervals. As we have not fed augment() any new data, these values line up with our neand data frame, so can use them immediately.

ggplot(data = neand_predict,
       aes(x = lnmass, color = species)) +
  geom_point(mapping=aes(y=lnbrain)) +
  geom_line(mapping = aes(y=.fitted)) + 
  geom_ribbon(aes(ymin=.lower,
                  ymax=.upper,
                  group = species), 
              fill="lightgrey", 
              alpha=0.5,
              color = NA) +
  theme_bw()

And there we have nice parallel lines with model predicted confidence intervals!

3.4 Examples

I’ve provided two data sets:
1) 18e4MoleRatLayabouts.csv looking at how caste and mass affect the energy mole rates expend

2) 18q11ExploitedLarvalFish.csv looking at how status of a marine area - protected or not - influences the CV around age of maturity of a number of different fish (so, age is a predictor)

Using the following workflow, analyze these data sets.

# Load the data

# Perform a preliminary visualization

# Fit a model

# Test Assumptions and modify model if needed

# Evaluate results

# Post-hocs if you can

# Visualize results