1. ANCOVA

Combining categorical and continuous variables is not that different from multiway ANOVA. To start with, let’s look at the neanderthal data.

neand <- read.csv("./data/18q09NeanderthalBrainSize.csv")
head(neand)
##   lnmass lnbrain species
## 1   4.00    7.19  recent
## 2   3.96    7.23  recent
## 3   3.95    7.26  recent
## 4   4.04    7.23  recent
## 5   4.11    7.22  recent
## 6   4.10    7.24  recent

We can clearly see both the categorical variable we’re interested in, and the covariate.

To get a preliminary sense of what’s going on here, do some exploratory visualization with ggplot2 why doncha!

library(ggplot2)

qplot(lnmass, lnbrain, color=species, data=neand) +
  stat_smooth(method="lm")

Now, the CIs are going to be off as this wasn’t all tested in the same model, but you begin to get a sense of whether things are parallel or not, and whether this covariate is important.

What other plots might you produce?

As this is a general linear model, good olde lm() is still there for us.

neand_lm <- lm(lnbrain ~ species + lnmass, data=neand)

1.1 Testing Assumptions of ANCOVA

In addition to the ususal tests, we need to make sure that the slopes really are parallel. We do that by fitting a model with an interaction and testing it (which, well, if there was and interaction, might that be interesting).

First, the usual

par(mfrow=c(2,2))
plot(neand_lm, which=c(1,2,5))
par(mfrow=c(1,1))

#And now look at residuals by group/predictors
library(car)
residualPlots(neand_lm, tests=FALSE)

To test the parallel presumption

neand_int <- lm(lnbrain ~ species*lnmass, data=neand)

Anova(neand_int)
## Anova Table (Type II tests)
## 
## Response: lnbrain
##                  Sum Sq Df F value    Pr(>F)    
## species        0.027553  1  6.2203    0.0175 *  
## lnmass         0.130018  1 29.3527 4.523e-06 ***
## species:lnmass 0.004845  1  1.0938    0.3028    
## Residuals      0.155033 35                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that there is no interaction, and hence we are a-ok to proceed with the parallel lines model.

1.2 Assessing results

So, first we have our F-tests using type II sums of squares.

Anova(neand_lm)
## Anova Table (Type II tests)
## 
## Response: lnbrain
##             Sum Sq Df F value    Pr(>F)    
## species   0.027553  1  6.2041   0.01749 *  
## lnmass    0.130018  1 29.2764 4.262e-06 ***
## Residuals 0.159878 36                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both the treatment and covariate matter.

Second, we might want to compare covariate adjusted groups and/or look at covariate adjusted means.

library(emmeans)
adj_means <- emmeans(neand_lm, ~species)

#adjusted means
adj_means
##  species       emmean         SE df lower.CL upper.CL
##  neanderthal 7.271581 0.02418546 36 7.222530 7.320631
##  recent      7.341859 0.01250077 36 7.316506 7.367212
## 
## Confidence level used: 0.95

Huh - those numbers seem low. That’s because they are the intercepts, not comparable adjusted means. To get those, we need to use say that we are using lnmass as a reference grid - i.e. we’re marginalizing over it.

avg_sp_means <- emmeans(neand_lm, ~species|lnmass)

avg_sp_means
## lnmass = 4.197949:
##  species       emmean         SE df lower.CL upper.CL
##  neanderthal 7.271581 0.02418546 36 7.222530 7.320631
##  recent      7.341859 0.01250077 36 7.316506 7.367212
## 
## Confidence level used: 0.95

We can then use this for tukey tests, etc.

#comparisons
contrast(avg_sp_means, method="tukey", adjust="none")
## lnmass = 4.197949:
##  contrast               estimate         SE df t.ratio p.value
##  neanderthal - recent -0.0702784 0.02821518 36  -2.491  0.0175

If you have an interaction, this method is no longer valid. Instead, you’ll need to specify the levels of lnmass. For example

low_lnmass_comp <- emmeans(neand_int, ~species|lnmass, at =list(lnmass = 4.1))
high_lnmass_comp <- emmeans(neand_int, ~species|lnmass, at =list(lnmass = 4.4))

contrast(low_lnmass_comp, method="tukey")
## lnmass = 4.1:
##  contrast               estimate         SE df t.ratio p.value
##  neanderthal - recent -0.1170245 0.05283693 35  -2.215  0.0334
contrast(high_lnmass_comp, method="tukey")
## lnmass = 4.4:
##  contrast                estimate        SE df t.ratio p.value
##  neanderthal - recent -0.03917795 0.0409668 35  -0.956  0.3455

How you choose the levels of your covariate you want to test is up to you and the questions we are asking.

You can also see the estimates of the slopes for each species (which is more useful when there is an interaction - as summary is just fine without.)

#no interaction
emtrends(neand_lm, ~species, var = "lnmass")
##  species     lnmass.trend         SE df lower.CL  upper.CL
##  neanderthal    0.4963161 0.09172755 36 0.310284 0.6823482
##  recent         0.4963161 0.09172755 36 0.310284 0.6823482
## 
## Confidence level used: 0.95
#interaction
emtrends(neand_int, ~species, var = "lnmass")
##  species     lnmass.trend        SE df  lower.CL  upper.CL
##  neanderthal    0.7135471 0.2270081 35 0.2526962 1.1743980
##  recent         0.4540585 0.1001227 35 0.2507986 0.6573185
## 
## Confidence level used: 0.95

1.3 Visualization

Visualization is funny, as you want to make parallel lines and also get the CIs right. Rather than rely on ggplot2 to do this natively, we need to futz around a bit with generating predictions

neand_predict <- predict(neand_lm, interval="confidence")

head(neand_predict)
##        fit      lwr      upr
## 1 7.243614 7.203988 7.283240
## 2 7.223761 7.178077 7.269445
## 3 7.218798 7.171538 7.266059
## 4 7.263467 7.229347 7.297586
## 5 7.298209 7.271375 7.325042
## 6 7.293246 7.265628 7.320863

So, here we have fit values, lower confidence interval, and upper confidence intervals. As we have not fed predict() any new data, these values line up with our neand data frame, so we can cbind it all together, and then use these values to make a prediction plot.

neand <- cbind(neand, neand_predict)

ggplot(data=neand) +
  geom_point(mapping=aes(x=lnmass, y=lnbrain, color=species)) +
  geom_line(mapping = aes(x = lnmass, y=fit, color=species)) + 
  geom_ribbon(data=neand, aes(x = lnmass, 
                              ymin=lwr, 
                              ymax=upr,
                              group = species), 
              fill="lightgrey", 
              alpha=0.5) +
  theme_bw()

And there we have nice parallel lines with model predicted confidence intervals!

1.4 Examples

I’ve provided two data sets:
1) 18e4MoleRatLayabouts.csv looking at how caste and mass affect the energy mole rates expend

2) 18q11ExploitedLarvalFish.csv looking at how status of a marine area - protected or not - influences the CV around age of maturity of a number of different fish (so, age is a predictor)

Using the following workflow, analyze these data sets.

# Load the data

# Perform a preliminary visualization

# Fit an ANCOVA model

# Test Asssumptions and modeify model if needed

# Evaluate results

# Post-hocs if you can

# Visualize results

2. Multiple Linear Regression

Multiple linear regression is conceptially very similar to ANCOVA. Let’s use the keeley fire severity plant richness data to see it in action.

keeley <- read.csv("data/Keeley_rawdata_select4.csv")

head(keeley)
##   distance elev  abiotic age   hetero firesev     cover rich
## 1 53.40900 1225 60.67103  40 0.757065    3.50 1.0387974   51
## 2 37.03745   60 40.94291  25 0.491340    4.05 0.4775924   31
## 3 53.69565  200 50.98805  15 0.844485    2.60 0.9489357   71
## 4 53.69565  200 61.15633  15 0.690847    2.90 1.1949002   64
## 5 51.95985  970 46.66807  23 0.545628    4.30 1.2981890   68
## 6 51.95985  970 39.82357  24 0.652895    4.00 1.1734866   34

For our purposes, we’ll focus on fire severity and plant cover as predictors.

2.1 Visualizing

I’m not going to lie, visualizing multiple continuous variables is as much of an art as a science. One can use colors and sizes of points, or slice up the data into chunks and facet that. Here are a few examples.

qplot(cover, rich, color=firesev, data=keeley) +
  scale_color_gradient(low="yellow", high="red") +
  theme_bw() +
  facet_wrap(~cut_number(firesev, 4))

We can also look at everything!

pairs(keeley)

What other plots can you make?

2.2 Fit and Evaluate Assumptions

Fitting is straightforward for an additive MLR model. It’s just a linear model!

keeley_mlr <- lm(rich ~ firesev + cover, data=keeley)

As for assumptions, we have the usual

par(mfrow=c(2,2))
plot(keeley_mlr, which=c(1,2,5))
par(mfrow=c(1,1))

But now we also need to think about how the residuals relate to each predictor. Fortunately, there’s still residualPlots.

residualPlots(keeley_mlr, test=FALSE)

Odd bow shape - but not too bad. Maybe there’s an interaction? Maybe we want to log transform something? Who knows!

We also want to look at multicollinearity. There are two steps for that. First, the vif

vif(keeley_mlr)
##  firesev    cover 
## 1.236226 1.236226

Not bad. We might also want to look at the correlation matrix. Dplyr can help us here as we want to select down to just relevant columns.

library(dplyr)

keeley %>%
  select(firesev, cover) %>%
  cor()
##            firesev      cover
## firesev  1.0000000 -0.4371346
## cover   -0.4371346  1.0000000

Also, not too bad! Well within our range of tolerances.

2.3 Evaluation

We evaluate the same way as usual

Anova(keeley_mlr)
## Anova Table (Type II tests)
## 
## Response: rich
##            Sum Sq Df F value  Pr(>F)  
## firesev    1258.9  1  6.5004 0.01254 *
## cover       711.6  1  3.6744 0.05854 .
## Residuals 16849.1 87                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And then the coefficients and R2 (sidenote - in brms there is a Bayesian \(R^2\) - brms::bayes_R2() - see here for more)

summary(keeley_mlr)
## 
## Call:
## lm(formula = rich ~ firesev + cover, data = keeley)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.029 -11.583  -1.655  11.759  28.722 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  53.9358     7.0437   7.657 2.45e-11 ***
## firesev      -2.5308     0.9926  -2.550   0.0125 *  
## cover         9.9105     5.1701   1.917   0.0585 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.92 on 87 degrees of freedom
## Multiple R-squared:  0.1703, Adjusted R-squared:  0.1513 
## F-statistic:  8.93 on 2 and 87 DF,  p-value: 0.0002968

Not amazing fit, but, the coefficients are clearly different from 0.

2.3 Visualization

This is where things get sticky. We have two main approaches. First, visualizing with component + residual plots

crPlots(keeley_mlr, smooth=FALSE)

But the values on the y axis are….odd. We get a sense of what’s going on and the scatter after accounting for our predictor of interest, but we might want to look at, say, evaluation of a variable at the mean of the other.

For that, we have visreg

library(visreg)
par(mfrow=c(1,2))
visreg(keeley_mlr)

par(mfrow=c(1,1))

Now the axes make far more sense, and we have a sense of the relationship.

We can actually whip this up on our own using crossing, the median of each value, and predict.

k_med_firesev <- data.frame(firesev = median(keeley$firesev),
                                 cover = seq(0,1.5, length.out = 100))
k_med_firesev <- cbind(k_med_firesev, predict(keeley_mlr, 
                                              newdata = k_med_firesev,
                                              interval="confidence"))
  
ggplot() +
  geom_point(data=keeley, mapping = aes(x=cover, y = rich)) +
  geom_line(data = k_med_firesev, mapping=aes(x=cover, y=fit), color="blue") +
  geom_ribbon(data = k_med_firesev, mapping = aes(x=cover, 
                                                  ymin = lwr,
                                                  ymax = upr),
              fill="lightgrey", alpha=0.5)

Or, we can use modelr to explore the model and combine that exploration with the data. Let’s get the curve for cover at four levels of fire severity. We’ll use both modelr::data_grid and modelr::add_predictions for a nice easy workflow.

library(modelr)

k_firesev_explore <- data_grid(keeley,
                               cover = seq_range(cover, 100),
                               firesev = seq_range(firesev, 4)) %>%
  add_predictions(model = keeley_mlr, var = "rich")

Nice, no? Sadly, add_predictions doesn’t give us variance (but see my code in https://github.com/tidyverse/modelr/issues/38 for a possible solution if you want it). If you want those, you’ll have to go back to using predict, although you can still use it with data_grid. Actually, let’s try that.

k_firesev_explore <- data_grid(keeley,
                               cover = seq_range(cover, 100),
                               firesev = seq_range(firesev, 4)) %>%
  cbind(predict(keeley_mlr, newdata = ., interval = "confidence")) %>%
  rename(rich = fit)

Huh. Not bad.

Anywho, then we can visualize with the data!

ggplot(data=keeley,
             mapping = aes(x=cover, y = rich, color = firesev)) +
  geom_point() +
  geom_line(data = k_firesev_explore, aes(group = firesev)) +
  scale_color_gradient(low="yellow", high="red") +
  facet_wrap(~cut_interval(firesev, 4)) +
  theme_bw(base_size = 14)

2.4 Examples

OK, here are two datasets to work with:

  1. planktonSummary.csv showing plankton from Lake Biakal (thanks, Stephanie Hampton). Evluate how Chlorophyll (CHLFa) is affected by other predictors.

  2. SwaddleWestNile2002NCEAS_shortnames.csv is about the prevalence of West Nile virus in Birds around Sacramento county in California. What predicts human WNV?

Using the following workflow, analyze these data sets.

# Load the data

# Perform a preliminary visualization. Play with this and choose two predictors

# Fit a MLR model

# Test Asssumptions and modify model if needed

# Evaluate results

# Visualize results

3. Interaction Effects in MLR

3.1 Explore and Model

What if there was an interaction here? One reasonable hypothesis is that fires affect stands of different ages differently. So, let’s look at richness as a function of age and fire severity. First, we’ll visualize, then fit a model.

ggplot(keeley,
       aes(x = age, y = rich, color = firesev)) +
  geom_point(size = 2) +
  scale_color_viridis_c(option = "B") +
  facet_wrap(~cut_number(firesev, 4))

Huh. Kinda looks like age doesn’t have any effect until high fire severity.

keeley_mlr_int <- lm(rich ~ age * firesev, data=keeley)

3.2 Assumptions

The assumptions here are the same as for MLR - normality of residuals, homoscedasticity, the relationship should not be driven by outliers alone, and, of course, lack of collinearity. Let’s explore the first few.

par(mfrow=c(2,2))
plot(keeley_mlr_int, which = c(1,2,5))
par(mfrow=c(1,1))

That looks good! If we look at the partial residual plots…

residualPlots(keeley_mlr_int)

##            Test stat Pr(>|Test stat|)
## age          -0.6982           0.4869
## firesev       0.6176           0.5385
## Tukey test    1.3802           0.1675

Also, excellent. Now, the vif…

vif(keeley_mlr_int)
##         age     firesev age:firesev 
##    7.967244    4.879394   15.800768

So… that 15 could be troublesome. Often, it’s not with nonlinearities, but, how can we be certain? First, let’s look at the correlation between the predictors.

keeley %>% select(firesev, age) %>%
  mutate(int = firesev*age) %>%
  cor()
##           firesev       age       int
## firesev 1.0000000 0.4538654 0.7743624
## age     0.4538654 1.0000000 0.8687952
## int     0.7743624 0.8687952 1.0000000

OK, the interaction is pretty highly correlated with age. Enough that, well, it could be troublesome to pull out the true effect. So, what’s the answer?

3.2.1 Centering

Well, one way to break that correlation is to center our predictors before using them in the relationship. It changes the meaning of the coefficients, but, the model results are the same - particularly with respect to F tests, etc. Let’s try that here.

keeley <- keeley %>%
  mutate(firesev_c = firesev - mean(firesev),
         age_c = age - mean(age))

keeley_mlr_int_c <- lm(rich ~ age_c * firesev_c, data=keeley)

vif(keeley_mlr_int_c)
##           age_c       firesev_c age_c:firesev_c 
##        1.261883        1.261126        1.005983

We see the vif is now small. Moreover…

keeley %>% select(firesev_c, age_c) %>%
  mutate(int = firesev_c*age_c) %>%
  cor()
##             firesev_c       age_c         int
## firesev_c  1.00000000  0.45386536 -0.06336909
## age_c      0.45386536  1.00000000 -0.06792114
## int       -0.06336909 -0.06792114  1.00000000

Look at that correlation melt away. And notice that our F table from Anova(keeley_mlr_int_c) is idential to the one before. We’re still explaining variation in the same way. Let’s see how the coefficients differ.

coef(keeley_mlr_int)
## (Intercept)         age     firesev age:firesev 
##  42.9352358   0.8268119   3.1056379  -0.2302444
coef(keeley_mlr_int_c)
##     (Intercept)           age_c       firesev_c age_c:firesev_c 
##      51.3790453      -0.2242540      -2.7809451      -0.2302444

So, in the no interaction model, the intercept is the richness when age and firesev are 0. The age and firesev effects are their effect when the other is 0, respectively. And the interaction is how they modify one another. Note that in the centered model, the interaction is the same. But, now the intercept is the richness at the average level of both predictors, and the additive predictors are the effects of each predictor at the average level of the other.

3.3 Evaluate

Given the lack of difference between the two models, we’re going to work with the original one. We’ve seen the coefficients and talked about their interpretation. The rest of the output can be interpreted as usual for a linear model.

summary(keeley_mlr_int)
## 
## Call:
## lm(formula = rich ~ age * firesev, data = keeley)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -27.7559 -10.1586   0.0662  10.7169  30.1410 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.93524    7.85054   5.469 4.37e-07 ***
## age          0.82681    0.31303   2.641 0.009809 ** 
## firesev      3.10564    1.86304   1.667 0.099157 .  
## age:firesev -0.23024    0.06412  -3.591 0.000548 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.15 on 86 degrees of freedom
## Multiple R-squared:  0.268,  Adjusted R-squared:  0.2425 
## F-statistic:  10.5 on 3 and 86 DF,  p-value: 5.934e-06

Not that it doesn’t look like fire severity has an effect when age is 0. But what does age = 0 really mean? This is one reason by often the centered model is more interpretable.

So, how do we understand the interaction?

We can do things like calculate out the effect of one predictor at different levels of the other…

tibble(age = 0:10) %>%
  mutate(fire_effect = coef(keeley_mlr_int)[3] + coef(keeley_mlr_int)[4]*age)
## # A tibble: 11 x 2
##      age fire_effect
##    <int>       <dbl>
##  1     0       3.11 
##  2     1       2.88 
##  3     2       2.65 
##  4     3       2.41 
##  5     4       2.18 
##  6     5       1.95 
##  7     6       1.72 
##  8     7       1.49 
##  9     8       1.26 
## 10     9       1.03 
## 11    10       0.803

But this can be unsatisfying, and we have to propogate error and - ECH. Why not try and create something more intuitive…

(sidenote: this is trivial to do with all of the incumbant uncertainty with Bayes. Just sayin’.)

3.4 Visualize Results

Fundamentally, we have to create visualizations that look at different levels of both predictors somehow. There are a number of strategies.

3.4.1 Visreg

We can begin by looking at the effect of one variable at different levels of the other.

visreg(keeley_mlr_int, "age", "firesev", gg=TRUE)

Here we see a plot where the points are adjusted for different levels of fireseverity - an even split between three levels - and then we see the resulting fit lines. We can of course reverse this, if we’re ore interested in the other effect.

visreg(keeley_mlr_int, "firesev", "age", gg=TRUE)

Both tell a compelling story of changing slopes that are easily understood. For example, in the later, fire has a stronger effect on older stands.

3.4.2 Fitted Model under Different Conditions

We might want to roll our own, however, and see the plot with different slices of the data. Fortunately, we can use the identical strategy as our MLR above - only now the slope changes. So, using the same strategy as above for making predicted data frames:

k_int_explore <- data_grid(keeley,
                               firesev = seq_range(firesev, 100),
                               age = seq_range(age, 4)) %>%
  cbind(predict(keeley_mlr_int, newdata = ., interval = "confidence")) %>%
  rename(rich = fit)

Notice the lack of difference. We can now play with how we want to plot this. Here’s one, but you can come up with others easily.

ggplot(keeley,
       aes(x = firesev, y = rich, color = age)) + 
  geom_point() +
  geom_line(data = k_int_explore, aes(group = age)) +
  geom_ribbon(data = k_int_explore, aes(ymin = lwr, ymax = upr), alpha = 0.2, color = NA) +
  facet_wrap(~cut_interval(age,4)) +
  scale_color_viridis_c(option = "D")

3.4.3 Surface Plot of Model

Finally, a surface plot can be very helpful - particularly under a condition of many interactions. To make a surface plot, we can look at the model and exract values at the intersection of multiple predictors and then visualize a grid where color or some other property reflects predicted values. It lets us see more nuance about an interaction. Let’s start with getting our surface values for a 100x100 grid of predictors.

keeley_surface <- data_grid(keeley,
                            age = seq_range(age, 100),
                            firesev = seq_range(firesev, 100)) %>%
  add_predictions(keeley_mlr_int, var = "rich")

Now, we can plot this grid using geom_raster or geom_tile (raster will smooth things a bit).

ggplot(keeley_surface,
       aes(x = age, y = firesev, fill = rich)) +
  geom_raster() +
  scale_fill_viridis_c(option = "B")

Now we can see the landscape more clearly. old stands with high fire severity have low richness. However, either youth or low fire severity means higher richness. This kind of technique can be great with higher order interactions. For example, you can facet by up to two variables (or more) to examine higher order interactions, etc.

Also note, if you want to be lazy, you can do this with visreg for two variables

visreg2d(keeley_mlr_int, "age", "firesev")

3.5 Examples

OK, here are two datasets to work with. Choose one, and play around with possible interactions using ggplot. Then fit and evaluate a model!

  1. planktonSummary.csv showing plankton from Lake Biakal (thanks, Stephanie Hampton). Evluate how Chlorophyll (CHLFa) is affected by other predictors.

  2. SwaddleWestNile2002NCEAS_shortnames.csv is about the prevalence of West Nile virus in Birds around Sacramento county in California. What predicts human WNV?

Using the following workflow, analyze these data sets.

# Load the data

# Perform a preliminary visualization. Play with this and choose two predictors

# Fit a MLR model with an interaction

# Test Asssumptions and modify model if needed

# Evaluate results

# Visualize results

4. Information Criteria

So, what if you have a number of models and want to evaluate them against one another using AIC? If you only have two or three and just want to look at their AIC scores, R has a simple function - AIC!

AIC(keeley_mlr)
## [1] 734.3108
AIC(keeley_mlr_int)
## [1] 725.0355

Hey, notice as a predictor of richness, the interaction model does better. But, there are a lot of possible models here with age, firesev, and cover as predictors. And lots of possible interactions! Let’s keep it simple at first.

4.1 Our Models

We’ll compare five models. The first two are the ones we’ve already fit. The last onces are as follows:

keeley_no_ints <- lm(rich ~ age + firesev + cover, data = keeley)
keeley_two_ints <- lm(rich ~ age * firesev + cover*firesev + age*cover, data = keeley)
keeley_three_ints <- lm(rich ~ age * firesev * cover, data = keeley)

Should be fun!

4.2 An AIC Table

So, with all of these models, we want to build an AIC table with weights, delta AICs, etc, before we go forward. To do that, we use the ‘AICcmodavg’ package, which is brilliant and constantly expanding. It works with lists of models.

library(AICcmodavg)

mod_list <- list(keeley_mlr, keeley_mlr_int, keeley_no_ints, keeley_two_ints, keeley_three_ints)
names(mod_list) <- c("mlr", "int", "noint", "twoint", "threeint")

aictab(mod_list)
## 
## Model selection based on AICc:
## 
##          K   AICc Delta_AICc AICcWt Cum.Wt      LL
## int      5 725.75       0.00   0.87   0.87 -357.52
## twoint   8 730.41       4.66   0.08   0.95 -356.32
## threeint 9 732.32       6.57   0.03   0.98 -356.03
## mlr      4 734.78       9.03   0.01   0.99 -363.16
## noint    5 735.49       9.74   0.01   1.00 -362.39

Notice that by default it does the AICc. Let’s see how this compares to the AIC.

aictab(mod_list, second.ord = FALSE)
## 
## Model selection based on AIC:
## 
##          K    AIC Delta_AIC AICWt Cum.Wt      LL
## int      5 725.04      0.00  0.79   0.79 -357.52
## twoint   8 728.63      3.60  0.13   0.92 -356.32
## threeint 9 730.07      5.03  0.06   0.99 -356.03
## mlr      4 734.31      9.28  0.01   0.99 -363.16
## noint    5 734.78      9.74  0.01   1.00 -362.39

No huge differences here, save maybe in weighting. But qualitatively the same.

4.3 Evaluating

All signs point to int being the ‘best’ but the two and three interaction models are comparable. But are they really different? Let’s look using broom. This is where broom starts to shine

library(broom)

all_coefs <- bind_rows(tidy(keeley_mlr_int) %>% mutate(model = "int"), 
                       tidy(keeley_two_ints) %>% mutate(model = "two"),
                       tidy(keeley_three_ints) %>% mutate(model = "three")) %>%
  filter(term != "(Intercept)")

Let’s look at the three way interaction

tail(all_coefs)
## # A tibble: 6 x 6
##   term              estimate std.error statistic p.value model
##   <chr>                <dbl>     <dbl>     <dbl>   <dbl> <chr>
## 1 firesev              1.27      4.94      0.257   0.798 three
## 2 cover              -13.1      27.5      -0.478   0.634 three
## 3 age:firesev         -0.107     0.149    -0.718   0.475 three
## 4 age:cover            1.10      1.07      1.03    0.306 three
## 5 firesev:cover        2.24      6.46      0.347   0.730 three
## 6 age:firesev:cover   -0.154     0.213    -0.721   0.473 three

Well, not different from 0. Nor are any of the two-ways or other tems in that model. It’s not great.

What about the two-way interactions

all_coefs %>%
  filter(term %in% c("age:cover", "age:firesev", "firesev:cover"))
## # A tibble: 7 x 6
##   term          estimate std.error statistic  p.value model
##   <chr>            <dbl>     <dbl>     <dbl>    <dbl> <chr>
## 1 age:firesev     -0.230    0.0641    -3.59  0.000548 int  
## 2 age:firesev     -0.197    0.0810    -2.43  0.0170   two  
## 3 firesev:cover   -1.70     3.44      -0.494 0.622    two  
## 4 age:cover        0.394    0.428      0.922 0.359    two  
## 5 age:firesev     -0.107    0.149     -0.718 0.475    three
## 6 age:cover        1.10     1.07       1.03  0.306    three
## 7 firesev:cover    2.24     6.46       0.347 0.730    three

We can see that, across models, age:firesev is different from 0, but the others never are. What about cover?

all_coefs %>%
  filter(term == "cover")
## # A tibble: 2 x 6
##   term  estimate std.error statistic p.value model
##   <chr>    <dbl>     <dbl>     <dbl>   <dbl> <chr>
## 1 cover     3.44      15.0     0.229   0.819 two  
## 2 cover   -13.1       27.5    -0.478   0.634 three

Nope - never different from 0.

So, functionally, keeley_mlr_int is where it’s at. You could justtify choosing it as a ‘best’ model. But…why would you?

4.3 Coefficient averaging

We can look at averaged coefficients

modavg(mod_list, parm="firesev")

Except this package doesn’t play well with interactions. With good theoretical reason. If we’re including an interaction, in some ways, we’re double-counting it by having it in two places. Better to go straight to prediction!

4.4 Ensemble prediction

Let’s start by creating a sample data frame to work with.

aic_newdata <- data_grid(keeley,
                         firesev = seq_range(firesev, 100),
                         age = seq_range(age, 4),
                         cover = seq_range(cover, 4))

OK, so, we’re looking at how fire severity’s effect is modified by age and cover. Note, this is how we’d do it for a 3-way interaction, regardless. Now, let’s get some model averaged predictions. This can take a bit, as we’re doing a lot of predictions

aic_preds <- bind_cols(aic_newdata, 
                       modavgPred(mod_list,  newdata = aic_newdata) %>% as.data.frame) %>%
  rename(rich = mod.avg.pred)

And now, we plot! Just like any other MLR, but, now we can really understand what’s happening here.

ggplot(keeley,
       aes(x = firesev, y = rich)) +
  geom_point() +
  facet_grid(cut_interval(age, 4) ~ cut_interval(cover, 4),
             labeller = labeller(.rows = label_both, .cols = label_both)) +
  geom_line(data = aic_preds)  +
  geom_ribbon(data = aic_preds, mapping = aes(ymin = lower.CL, ymax = upper.CL),
              alpha = 0.1)

So, we can see pretty clearly here that cover… doesn’t matter! Again, this would support our intuition to go with the firesev*age only model.

4.5 All models

Note, we oculd have worked with all possible models using MuMIn, but I generally recommend against it.

library(MuMIn)

options(na.action = "na.fail")
allmods <- dredge(keeley_three_ints)
options(na.action = "na.omit")

This object can be used for a multitide of things, such as coefficients -

ensemble_model <- model.avg(allmods, fit = TRUE)
summary(ensemble_model)
## 
## Call:
## model.avg(object = get.models(object = allmods, subset = NA))
## 
## Component model call: 
## lm(formula = rich ~ <19 unique rhs>, data = keeley)
## 
## Component models: 
##         df  logLik   AICc delta weight
## 135      5 -357.52 725.75  0.00   0.41
## 1235     6 -356.78 726.57  0.82   0.27
## 12345    7 -356.45 728.26  2.52   0.12
## 12356    7 -356.78 728.92  3.17   0.08
## 123456   8 -356.32 730.41  4.66   0.04
## 1234     6 -359.45 731.92  6.17   0.02
## 1234567  9 -356.03 732.32  6.57   0.02
## 124      5 -361.51 733.73  7.98   0.01
## 12346    7 -359.42 734.21  8.46   0.01
## 236      5 -361.79 734.29  8.54   0.01
## 23       4 -363.16 734.78  9.03   0.00
## 1236     6 -361.07 735.15  9.40   0.00
## 123      5 -362.39 735.49  9.74   0.00
## 13       4 -363.80 736.08 10.33   0.00
## 3        3 -365.02 736.31 10.56   0.00
## 12       4 -364.35 737.16 11.41   0.00
## 2        3 -366.40 739.08 13.33   0.00
## 1        3 -367.25 740.78 15.03   0.00
## (Null)   2 -371.56 747.25 21.50   0.00
## 
## Term codes: 
##               age             cover           firesev         age:cover 
##                 1                 2                 3                 4 
##       age:firesev     cover:firesev age:cover:firesev 
##                 5                 6                 7 
## 
## Model-averaged coefficients:  
## (full average) 
##                    Estimate Std. Error Adjusted SE z value Pr(>|z|)   
## (Intercept)       42.655533  12.835277   12.967892   3.289   0.0010 **
## age                0.677043   0.489664    0.493695   1.371   0.1703   
## firesev            2.770071   2.600026    2.627424   1.054   0.2917   
## age:firesev       -0.204642   0.086827    0.087603   2.336   0.0195 * 
## cover              1.590026   9.333162    9.435542   0.169   0.8662   
## age:cover          0.093783   0.305187    0.307432   0.305   0.7603   
## cover:firesev     -0.012407   1.567992    1.587464   0.008   0.9938   
## age:cover:firesev -0.002378   0.032584    0.032907   0.072   0.9424   
##  
## (conditional average) 
##                   Estimate Std. Error Adjusted SE z value Pr(>|z|)   
## (Intercept)       42.65553   12.83528    12.96789   3.289  0.00100 **
## age                0.68590    0.48665     0.49076   1.398  0.16222   
## firesev            2.79738    2.59815     2.62584   1.065  0.28673   
## age:firesev       -0.21686    0.07308     0.07405   2.928  0.00341 **
## cover              2.72936   12.10023    12.23576   0.223  0.82349   
## age:cover          0.45636    0.53644     0.54264   0.841  0.40035   
## cover:firesev     -0.07963    3.97159     4.02093   0.020  0.98420   
## age:cover:firesev -0.15356    0.21293     0.21611   0.711  0.47737   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Relative variable importance: 
##                      firesev age  age:firesev cover age:cover
## Importance:          0.99    0.99 0.94        0.58  0.21     
## N containing models:   14      14    6          14     6     
##                      cover:firesev age:cover:firesev
## Importance:          0.16          0.02             
## N containing models:    6             1

Ensemble prediction, and more…

predict(ensemble_model, newdata = keeley, se.fit = TRUE)
## $fit
##  [1] 55.94025 51.82056 54.75745 55.35670 54.38951 54.52023 49.07104
##  [8] 52.80317 27.07656 40.42988 24.59934 35.39336 35.74320 42.10585
## [15] 53.63905 44.38464 53.07259 52.55435 48.89172 49.44536 54.49227
## [22] 52.66096 52.39500 53.67973 48.25867 51.04438 26.23036 32.89635
## [29] 27.27638 17.70247 41.91399 49.11634 55.09672 54.17400 46.17010
## [36] 44.99241 48.90847 42.54092 46.51661 54.11613 56.67886 53.90774
## [43] 55.30103 55.82946 54.24276 53.65008 51.26428 53.07287 44.82727
## [50] 50.36650 48.34256 51.93611 56.86923 48.51114 52.30049 56.26858
## [57] 52.56040 52.77886 55.75125 53.07282 49.32055 50.50299 54.47587
## [64] 51.44560 51.80181 52.50160 50.33155 54.42453 53.50820 52.05141
## [71] 52.72063 50.53759 45.74539 37.75844 46.87192 52.30524 51.74679
## [78] 52.60328 55.60156 53.76973 52.73553 53.14333 50.67973 50.26351
## [85] 53.21702 56.05900 54.43245 49.10692 50.03613 47.59231
## 
## $se.fit
##  [1] 3.848471 2.145000 2.381958 2.857262 3.111395 2.584599 2.277701
##  [8] 2.897538 4.603265 2.363902 7.312039 3.608073 3.115718 4.446242
## [15] 1.916508 5.205473 1.667000 3.797781 3.075767 1.892694 3.634058
## [22] 1.793240 1.758975 2.884631 6.570208 3.538713 4.938348 3.799940
## [29] 5.614239 6.769176 2.128169 2.681771 4.127411 2.782341 2.134876
## [36] 2.109807 1.609281 3.903918 4.439608 2.048748 2.634068 2.004455
## [43] 3.630051 2.444019 2.036847 2.201299 2.984301 2.146869 3.744039
## [50] 1.801258 2.228370 4.417619 2.806461 2.299537 3.916103 2.701289
## [57] 3.408298 1.801591 2.958364 2.201928 5.816436 4.954496 1.900818
## [64] 1.604661 3.195837 1.928137 1.666128 3.530190 2.637017 3.597457
## [71] 3.308711 1.798562 3.160948 5.159222 2.209312 4.094805 2.661071
## [78] 2.197053 4.755136 2.871134 2.568999 2.664298 2.346199 2.085816
## [85] 2.520114 2.638233 4.974604 2.050819 5.456887 2.609242

4.6 Examples

Go back to the previous datasets and try multiple models! They don’t have to contain interactions (really!). To remind you, they are

  1. planktonSummary.csv showing plankton from Lake Biakal (thanks, Stephanie Hampton). Evluate how Chlorophyll (CHLFa) is affected by other predictors.

  2. SwaddleWestNile2002NCEAS_shortnames.csv is about the prevalence of West Nile virus in Birds around Sacramento county in California. What predicts human WNV?

But now your worflow is different (depending also on if you use AICcmodavg of MuMIn)

# Fit many models (or a master one and then dredge)

# Put into a list or modavg object

# Evaluate the AIC table - draw conclusions if possible

# Look at model averaged quantities

# Make ansemble forecasts and visualize