Combining categorical and continuous variables is not that different from multiway ANOVA. To start with, let’s look at the neanderthal data.
neand <- read.csv("./data/18q09NeanderthalBrainSize.csv")
head(neand)
## lnmass lnbrain species
## 1 4.00 7.19 recent
## 2 3.96 7.23 recent
## 3 3.95 7.26 recent
## 4 4.04 7.23 recent
## 5 4.11 7.22 recent
## 6 4.10 7.24 recent
We can clearly see both the categorical variable we’re interested in, and the covariate.
To get a preliminary sense of what’s going on here, do some exploratory visualization with ggplot2
why doncha!
library(ggplot2)
qplot(lnmass, lnbrain, color=species, data=neand) +
stat_smooth(method="lm")
Now, the CIs are going to be off as this wasn’t all tested in the same model, but you begin to get a sense of whether things are parallel or not, and whether this covariate is important.
What other plots might you produce?
As this is a general linear model, good olde lm()
is still there for us.
neand_lm <- lm(lnbrain ~ species + lnmass, data=neand)
In addition to the ususal tests, we need to make sure that the slopes really are parallel. We do that by fitting a model with an interaction and testing it (which, well, if there was and interaction, might that be interesting).
First, the usual
par(mfrow=c(2,2))
plot(neand_lm, which=c(1,2,5))
par(mfrow=c(1,1))
#And now look at residuals by group/predictors
library(car)
residualPlots(neand_lm, tests=FALSE)
To test the parallel presumption
neand_int <- lm(lnbrain ~ species*lnmass, data=neand)
Anova(neand_int)
## Anova Table (Type II tests)
##
## Response: lnbrain
## Sum Sq Df F value Pr(>F)
## species 0.027553 1 6.2203 0.0175 *
## lnmass 0.130018 1 29.3527 4.523e-06 ***
## species:lnmass 0.004845 1 1.0938 0.3028
## Residuals 0.155033 35
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that there is no interaction, and hence we are a-ok to proceed with the parallel lines model.
So, first we have our F-tests using type II sums of squares.
Anova(neand_lm)
## Anova Table (Type II tests)
##
## Response: lnbrain
## Sum Sq Df F value Pr(>F)
## species 0.027553 1 6.2041 0.01749 *
## lnmass 0.130018 1 29.2764 4.262e-06 ***
## Residuals 0.159878 36
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Both the treatment and covariate matter.
Second, we might want to compare covariate adjusted groups and/or look at covariate adjusted means.
library(emmeans)
adj_means <- emmeans(neand_lm, ~species)
#adjusted means
adj_means
## species emmean SE df lower.CL upper.CL
## neanderthal 7.271581 0.02418546 36 7.222530 7.320631
## recent 7.341859 0.01250077 36 7.316506 7.367212
##
## Confidence level used: 0.95
Huh - those numbers seem low. That’s because they are the intercepts, not comparable adjusted means. To get those, we need to use say that we are using lnmass
as a reference grid - i.e. we’re marginalizing over it.
avg_sp_means <- emmeans(neand_lm, ~species|lnmass)
avg_sp_means
## lnmass = 4.197949:
## species emmean SE df lower.CL upper.CL
## neanderthal 7.271581 0.02418546 36 7.222530 7.320631
## recent 7.341859 0.01250077 36 7.316506 7.367212
##
## Confidence level used: 0.95
We can then use this for tukey tests, etc.
#comparisons
contrast(avg_sp_means, method="tukey", adjust="none")
## lnmass = 4.197949:
## contrast estimate SE df t.ratio p.value
## neanderthal - recent -0.0702784 0.02821518 36 -2.491 0.0175
If you have an interaction, this method is no longer valid. Instead, you’ll need to specify the levels of lnmass. For example
low_lnmass_comp <- emmeans(neand_int, ~species|lnmass, at =list(lnmass = 4.1))
high_lnmass_comp <- emmeans(neand_int, ~species|lnmass, at =list(lnmass = 4.4))
contrast(low_lnmass_comp, method="tukey")
## lnmass = 4.1:
## contrast estimate SE df t.ratio p.value
## neanderthal - recent -0.1170245 0.05283693 35 -2.215 0.0334
contrast(high_lnmass_comp, method="tukey")
## lnmass = 4.4:
## contrast estimate SE df t.ratio p.value
## neanderthal - recent -0.03917795 0.0409668 35 -0.956 0.3455
How you choose the levels of your covariate you want to test is up to you and the questions we are asking.
You can also see the estimates of the slopes for each species (which is more useful when there is an interaction - as summary
is just fine without.)
#no interaction
emtrends(neand_lm, ~species, var = "lnmass")
## species lnmass.trend SE df lower.CL upper.CL
## neanderthal 0.4963161 0.09172755 36 0.310284 0.6823482
## recent 0.4963161 0.09172755 36 0.310284 0.6823482
##
## Confidence level used: 0.95
#interaction
emtrends(neand_int, ~species, var = "lnmass")
## species lnmass.trend SE df lower.CL upper.CL
## neanderthal 0.7135471 0.2270081 35 0.2526962 1.1743980
## recent 0.4540585 0.1001227 35 0.2507986 0.6573185
##
## Confidence level used: 0.95
Visualization is funny, as you want to make parallel lines and also get the CIs right. Rather than rely on ggplot2
to do this natively, we need to futz around a bit with generating predictions
neand_predict <- predict(neand_lm, interval="confidence")
head(neand_predict)
## fit lwr upr
## 1 7.243614 7.203988 7.283240
## 2 7.223761 7.178077 7.269445
## 3 7.218798 7.171538 7.266059
## 4 7.263467 7.229347 7.297586
## 5 7.298209 7.271375 7.325042
## 6 7.293246 7.265628 7.320863
So, here we have fit values, lower confidence interval, and upper confidence intervals. As we have not fed predict()
any new data, these values line up with our neand
data frame, so we can cbind it all together, and then use these values to make a prediction plot.
neand <- cbind(neand, neand_predict)
ggplot(data=neand) +
geom_point(mapping=aes(x=lnmass, y=lnbrain, color=species)) +
geom_line(mapping = aes(x = lnmass, y=fit, color=species)) +
geom_ribbon(data=neand, aes(x = lnmass,
ymin=lwr,
ymax=upr,
group = species),
fill="lightgrey",
alpha=0.5) +
theme_bw()
And there we have nice parallel lines with model predicted confidence intervals!
I’ve provided two data sets:
1) 18e4MoleRatLayabouts.csv
looking at how caste and mass affect the energy mole rates expend
2) 18q11ExploitedLarvalFish.csv
looking at how status of a marine area - protected or not - influences the CV around age of maturity of a number of different fish (so, age is a predictor)
Using the following workflow, analyze these data sets.
# Load the data
# Perform a preliminary visualization
# Fit an ANCOVA model
# Test Asssumptions and modeify model if needed
# Evaluate results
# Post-hocs if you can
# Visualize results
Multiple linear regression is conceptially very similar to ANCOVA. Let’s use the keeley fire severity plant richness data to see it in action.
keeley <- read.csv("data/Keeley_rawdata_select4.csv")
head(keeley)
## distance elev abiotic age hetero firesev cover rich
## 1 53.40900 1225 60.67103 40 0.757065 3.50 1.0387974 51
## 2 37.03745 60 40.94291 25 0.491340 4.05 0.4775924 31
## 3 53.69565 200 50.98805 15 0.844485 2.60 0.9489357 71
## 4 53.69565 200 61.15633 15 0.690847 2.90 1.1949002 64
## 5 51.95985 970 46.66807 23 0.545628 4.30 1.2981890 68
## 6 51.95985 970 39.82357 24 0.652895 4.00 1.1734866 34
For our purposes, we’ll focus on fire severity and plant cover as predictors.
I’m not going to lie, visualizing multiple continuous variables is as much of an art as a science. One can use colors and sizes of points, or slice up the data into chunks and facet that. Here are a few examples.
qplot(cover, rich, color=firesev, data=keeley) +
scale_color_gradient(low="yellow", high="red") +
theme_bw() +
facet_wrap(~cut_number(firesev, 4))
We can also look at everything!
pairs(keeley)
What other plots can you make?
Fitting is straightforward for an additive MLR model. It’s just a linear model!
keeley_mlr <- lm(rich ~ firesev + cover, data=keeley)
As for assumptions, we have the usual
par(mfrow=c(2,2))
plot(keeley_mlr, which=c(1,2,5))
par(mfrow=c(1,1))
But now we also need to think about how the residuals relate to each predictor. Fortunately, there’s still residualPlots
.
residualPlots(keeley_mlr, test=FALSE)
Odd bow shape - but not too bad. Maybe there’s an interaction? Maybe we want to log transform something? Who knows!
We also want to look at multicollinearity. There are two steps for that. First, the vif
vif(keeley_mlr)
## firesev cover
## 1.236226 1.236226
Not bad. We might also want to look at the correlation matrix. Dplyr can help us here as we want to select down to just relevant columns.
library(dplyr)
keeley %>%
select(firesev, cover) %>%
cor()
## firesev cover
## firesev 1.0000000 -0.4371346
## cover -0.4371346 1.0000000
Also, not too bad! Well within our range of tolerances.
We evaluate the same way as usual
Anova(keeley_mlr)
## Anova Table (Type II tests)
##
## Response: rich
## Sum Sq Df F value Pr(>F)
## firesev 1258.9 1 6.5004 0.01254 *
## cover 711.6 1 3.6744 0.05854 .
## Residuals 16849.1 87
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
And then the coefficients and R2 (sidenote - in brms
there is a Bayesian \(R^2\) - brms::bayes_R2()
- see here for more)
summary(keeley_mlr)
##
## Call:
## lm(formula = rich ~ firesev + cover, data = keeley)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.029 -11.583 -1.655 11.759 28.722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.9358 7.0437 7.657 2.45e-11 ***
## firesev -2.5308 0.9926 -2.550 0.0125 *
## cover 9.9105 5.1701 1.917 0.0585 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.92 on 87 degrees of freedom
## Multiple R-squared: 0.1703, Adjusted R-squared: 0.1513
## F-statistic: 8.93 on 2 and 87 DF, p-value: 0.0002968
Not amazing fit, but, the coefficients are clearly different from 0.
This is where things get sticky. We have two main approaches. First, visualizing with component + residual plots
crPlots(keeley_mlr, smooth=FALSE)
But the values on the y axis are….odd. We get a sense of what’s going on and the scatter after accounting for our predictor of interest, but we might want to look at, say, evaluation of a variable at the mean of the other.
For that, we have visreg
library(visreg)
par(mfrow=c(1,2))
visreg(keeley_mlr)
par(mfrow=c(1,1))
Now the axes make far more sense, and we have a sense of the relationship.
We can actually whip this up on our own using crossing
, the median of each value, and predict.
k_med_firesev <- data.frame(firesev = median(keeley$firesev),
cover = seq(0,1.5, length.out = 100))
k_med_firesev <- cbind(k_med_firesev, predict(keeley_mlr,
newdata = k_med_firesev,
interval="confidence"))
ggplot() +
geom_point(data=keeley, mapping = aes(x=cover, y = rich)) +
geom_line(data = k_med_firesev, mapping=aes(x=cover, y=fit), color="blue") +
geom_ribbon(data = k_med_firesev, mapping = aes(x=cover,
ymin = lwr,
ymax = upr),
fill="lightgrey", alpha=0.5)
Or, we can use modelr
to explore the model and combine that exploration with the data. Let’s get the curve for cover at four levels of fire severity. We’ll use both modelr::data_grid
and modelr::add_predictions
for a nice easy workflow.
library(modelr)
k_firesev_explore <- data_grid(keeley,
cover = seq_range(cover, 100),
firesev = seq_range(firesev, 4)) %>%
add_predictions(model = keeley_mlr, var = "rich")
Nice, no? Sadly, add_predictions
doesn’t give us variance (but see my code in https://github.com/tidyverse/modelr/issues/38 for a possible solution if you want it). If you want those, you’ll have to go back to using predict
, although you can still use it with data_grid
. Actually, let’s try that.
k_firesev_explore <- data_grid(keeley,
cover = seq_range(cover, 100),
firesev = seq_range(firesev, 4)) %>%
cbind(predict(keeley_mlr, newdata = ., interval = "confidence")) %>%
rename(rich = fit)
Huh. Not bad.
Anywho, then we can visualize with the data!
ggplot(data=keeley,
mapping = aes(x=cover, y = rich, color = firesev)) +
geom_point() +
geom_line(data = k_firesev_explore, aes(group = firesev)) +
scale_color_gradient(low="yellow", high="red") +
facet_wrap(~cut_interval(firesev, 4)) +
theme_bw(base_size = 14)
OK, here are two datasets to work with:
planktonSummary.csv
showing plankton from Lake Biakal (thanks, Stephanie Hampton). Evluate how Chlorophyll (CHLFa
) is affected by other predictors.SwaddleWestNile2002NCEAS_shortnames.csv
is about the prevalence of West Nile virus in Birds around Sacramento county in California. What predicts human WNV?Using the following workflow, analyze these data sets.
# Load the data
# Perform a preliminary visualization. Play with this and choose two predictors
# Fit a MLR model
# Test Asssumptions and modify model if needed
# Evaluate results
# Visualize results
What if there was an interaction here? One reasonable hypothesis is that fires affect stands of different ages differently. So, let’s look at richness as a function of age and fire severity. First, we’ll visualize, then fit a model.
ggplot(keeley,
aes(x = age, y = rich, color = firesev)) +
geom_point(size = 2) +
scale_color_viridis_c(option = "B") +
facet_wrap(~cut_number(firesev, 4))
Huh. Kinda looks like age doesn’t have any effect until high fire severity.
keeley_mlr_int <- lm(rich ~ age * firesev, data=keeley)
The assumptions here are the same as for MLR - normality of residuals, homoscedasticity, the relationship should not be driven by outliers alone, and, of course, lack of collinearity. Let’s explore the first few.
par(mfrow=c(2,2))
plot(keeley_mlr_int, which = c(1,2,5))
par(mfrow=c(1,1))
That looks good! If we look at the partial residual plots…
residualPlots(keeley_mlr_int)
## Test stat Pr(>|Test stat|)
## age -0.6982 0.4869
## firesev 0.6176 0.5385
## Tukey test 1.3802 0.1675
Also, excellent. Now, the vif…
vif(keeley_mlr_int)
## age firesev age:firesev
## 7.967244 4.879394 15.800768
So… that 15 could be troublesome. Often, it’s not with nonlinearities, but, how can we be certain? First, let’s look at the correlation between the predictors.
keeley %>% select(firesev, age) %>%
mutate(int = firesev*age) %>%
cor()
## firesev age int
## firesev 1.0000000 0.4538654 0.7743624
## age 0.4538654 1.0000000 0.8687952
## int 0.7743624 0.8687952 1.0000000
OK, the interaction is pretty highly correlated with age. Enough that, well, it could be troublesome to pull out the true effect. So, what’s the answer?
Well, one way to break that correlation is to center our predictors before using them in the relationship. It changes the meaning of the coefficients, but, the model results are the same - particularly with respect to F tests, etc. Let’s try that here.
keeley <- keeley %>%
mutate(firesev_c = firesev - mean(firesev),
age_c = age - mean(age))
keeley_mlr_int_c <- lm(rich ~ age_c * firesev_c, data=keeley)
vif(keeley_mlr_int_c)
## age_c firesev_c age_c:firesev_c
## 1.261883 1.261126 1.005983
We see the vif is now small. Moreover…
keeley %>% select(firesev_c, age_c) %>%
mutate(int = firesev_c*age_c) %>%
cor()
## firesev_c age_c int
## firesev_c 1.00000000 0.45386536 -0.06336909
## age_c 0.45386536 1.00000000 -0.06792114
## int -0.06336909 -0.06792114 1.00000000
Look at that correlation melt away. And notice that our F table from Anova(keeley_mlr_int_c)
is idential to the one before. We’re still explaining variation in the same way. Let’s see how the coefficients differ.
coef(keeley_mlr_int)
## (Intercept) age firesev age:firesev
## 42.9352358 0.8268119 3.1056379 -0.2302444
coef(keeley_mlr_int_c)
## (Intercept) age_c firesev_c age_c:firesev_c
## 51.3790453 -0.2242540 -2.7809451 -0.2302444
So, in the no interaction model, the intercept is the richness when age and firesev are 0. The age and firesev effects are their effect when the other is 0, respectively. And the interaction is how they modify one another. Note that in the centered model, the interaction is the same. But, now the intercept is the richness at the average level of both predictors, and the additive predictors are the effects of each predictor at the average level of the other.
Given the lack of difference between the two models, we’re going to work with the original one. We’ve seen the coefficients and talked about their interpretation. The rest of the output can be interpreted as usual for a linear model.
summary(keeley_mlr_int)
##
## Call:
## lm(formula = rich ~ age * firesev, data = keeley)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.7559 -10.1586 0.0662 10.7169 30.1410
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.93524 7.85054 5.469 4.37e-07 ***
## age 0.82681 0.31303 2.641 0.009809 **
## firesev 3.10564 1.86304 1.667 0.099157 .
## age:firesev -0.23024 0.06412 -3.591 0.000548 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.15 on 86 degrees of freedom
## Multiple R-squared: 0.268, Adjusted R-squared: 0.2425
## F-statistic: 10.5 on 3 and 86 DF, p-value: 5.934e-06
Not that it doesn’t look like fire severity has an effect when age is 0. But what does age = 0 really mean? This is one reason by often the centered model is more interpretable.
So, how do we understand the interaction?
We can do things like calculate out the effect of one predictor at different levels of the other…
tibble(age = 0:10) %>%
mutate(fire_effect = coef(keeley_mlr_int)[3] + coef(keeley_mlr_int)[4]*age)
## # A tibble: 11 x 2
## age fire_effect
## <int> <dbl>
## 1 0 3.11
## 2 1 2.88
## 3 2 2.65
## 4 3 2.41
## 5 4 2.18
## 6 5 1.95
## 7 6 1.72
## 8 7 1.49
## 9 8 1.26
## 10 9 1.03
## 11 10 0.803
But this can be unsatisfying, and we have to propogate error and - ECH. Why not try and create something more intuitive…
(sidenote: this is trivial to do with all of the incumbant uncertainty with Bayes. Just sayin’.)
Fundamentally, we have to create visualizations that look at different levels of both predictors somehow. There are a number of strategies.
We can begin by looking at the effect of one variable at different levels of the other.
visreg(keeley_mlr_int, "age", "firesev", gg=TRUE)
Here we see a plot where the points are adjusted for different levels of fireseverity - an even split between three levels - and then we see the resulting fit lines. We can of course reverse this, if we’re ore interested in the other effect.
visreg(keeley_mlr_int, "firesev", "age", gg=TRUE)
Both tell a compelling story of changing slopes that are easily understood. For example, in the later, fire has a stronger effect on older stands.
We might want to roll our own, however, and see the plot with different slices of the data. Fortunately, we can use the identical strategy as our MLR above - only now the slope changes. So, using the same strategy as above for making predicted data frames:
k_int_explore <- data_grid(keeley,
firesev = seq_range(firesev, 100),
age = seq_range(age, 4)) %>%
cbind(predict(keeley_mlr_int, newdata = ., interval = "confidence")) %>%
rename(rich = fit)
Notice the lack of difference. We can now play with how we want to plot this. Here’s one, but you can come up with others easily.
ggplot(keeley,
aes(x = firesev, y = rich, color = age)) +
geom_point() +
geom_line(data = k_int_explore, aes(group = age)) +
geom_ribbon(data = k_int_explore, aes(ymin = lwr, ymax = upr), alpha = 0.2, color = NA) +
facet_wrap(~cut_interval(age,4)) +
scale_color_viridis_c(option = "D")
Finally, a surface plot can be very helpful - particularly under a condition of many interactions. To make a surface plot, we can look at the model and exract values at the intersection of multiple predictors and then visualize a grid where color or some other property reflects predicted values. It lets us see more nuance about an interaction. Let’s start with getting our surface values for a 100x100 grid of predictors.
keeley_surface <- data_grid(keeley,
age = seq_range(age, 100),
firesev = seq_range(firesev, 100)) %>%
add_predictions(keeley_mlr_int, var = "rich")
Now, we can plot this grid using geom_raster
or geom_tile
(raster will smooth things a bit).
ggplot(keeley_surface,
aes(x = age, y = firesev, fill = rich)) +
geom_raster() +
scale_fill_viridis_c(option = "B")
Now we can see the landscape more clearly. old stands with high fire severity have low richness. However, either youth or low fire severity means higher richness. This kind of technique can be great with higher order interactions. For example, you can facet by up to two variables (or more) to examine higher order interactions, etc.
Also note, if you want to be lazy, you can do this with visreg for two variables
visreg2d(keeley_mlr_int, "age", "firesev")
OK, here are two datasets to work with. Choose one, and play around with possible interactions using ggplot. Then fit and evaluate a model!
planktonSummary.csv
showing plankton from Lake Biakal (thanks, Stephanie Hampton). Evluate how Chlorophyll (CHLFa
) is affected by other predictors.SwaddleWestNile2002NCEAS_shortnames.csv
is about the prevalence of West Nile virus in Birds around Sacramento county in California. What predicts human WNV?Using the following workflow, analyze these data sets.
# Load the data
# Perform a preliminary visualization. Play with this and choose two predictors
# Fit a MLR model with an interaction
# Test Asssumptions and modify model if needed
# Evaluate results
# Visualize results
So, what if you have a number of models and want to evaluate them against one another using AIC? If you only have two or three and just want to look at their AIC scores, R has a simple function - AIC!
AIC(keeley_mlr)
## [1] 734.3108
AIC(keeley_mlr_int)
## [1] 725.0355
Hey, notice as a predictor of richness, the interaction model does better. But, there are a lot of possible models here with age, firesev, and cover as predictors. And lots of possible interactions! Let’s keep it simple at first.
We’ll compare five models. The first two are the ones we’ve already fit. The last onces are as follows:
keeley_no_ints <- lm(rich ~ age + firesev + cover, data = keeley)
keeley_two_ints <- lm(rich ~ age * firesev + cover*firesev + age*cover, data = keeley)
keeley_three_ints <- lm(rich ~ age * firesev * cover, data = keeley)
Should be fun!
So, with all of these models, we want to build an AIC table with weights, delta AICs, etc, before we go forward. To do that, we use the ‘AICcmodavg’ package, which is brilliant and constantly expanding. It works with lists of models.
library(AICcmodavg)
mod_list <- list(keeley_mlr, keeley_mlr_int, keeley_no_ints, keeley_two_ints, keeley_three_ints)
names(mod_list) <- c("mlr", "int", "noint", "twoint", "threeint")
aictab(mod_list)
##
## Model selection based on AICc:
##
## K AICc Delta_AICc AICcWt Cum.Wt LL
## int 5 725.75 0.00 0.87 0.87 -357.52
## twoint 8 730.41 4.66 0.08 0.95 -356.32
## threeint 9 732.32 6.57 0.03 0.98 -356.03
## mlr 4 734.78 9.03 0.01 0.99 -363.16
## noint 5 735.49 9.74 0.01 1.00 -362.39
Notice that by default it does the AICc. Let’s see how this compares to the AIC.
aictab(mod_list, second.ord = FALSE)
##
## Model selection based on AIC:
##
## K AIC Delta_AIC AICWt Cum.Wt LL
## int 5 725.04 0.00 0.79 0.79 -357.52
## twoint 8 728.63 3.60 0.13 0.92 -356.32
## threeint 9 730.07 5.03 0.06 0.99 -356.03
## mlr 4 734.31 9.28 0.01 0.99 -363.16
## noint 5 734.78 9.74 0.01 1.00 -362.39
No huge differences here, save maybe in weighting. But qualitatively the same.
All signs point to int being the ‘best’ but the two and three interaction models are comparable. But are they really different? Let’s look using broom
. This is where broom
starts to shine
library(broom)
all_coefs <- bind_rows(tidy(keeley_mlr_int) %>% mutate(model = "int"),
tidy(keeley_two_ints) %>% mutate(model = "two"),
tidy(keeley_three_ints) %>% mutate(model = "three")) %>%
filter(term != "(Intercept)")
Let’s look at the three way interaction
tail(all_coefs)
## # A tibble: 6 x 6
## term estimate std.error statistic p.value model
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 firesev 1.27 4.94 0.257 0.798 three
## 2 cover -13.1 27.5 -0.478 0.634 three
## 3 age:firesev -0.107 0.149 -0.718 0.475 three
## 4 age:cover 1.10 1.07 1.03 0.306 three
## 5 firesev:cover 2.24 6.46 0.347 0.730 three
## 6 age:firesev:cover -0.154 0.213 -0.721 0.473 three
Well, not different from 0. Nor are any of the two-ways or other tems in that model. It’s not great.
What about the two-way interactions
all_coefs %>%
filter(term %in% c("age:cover", "age:firesev", "firesev:cover"))
## # A tibble: 7 x 6
## term estimate std.error statistic p.value model
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 age:firesev -0.230 0.0641 -3.59 0.000548 int
## 2 age:firesev -0.197 0.0810 -2.43 0.0170 two
## 3 firesev:cover -1.70 3.44 -0.494 0.622 two
## 4 age:cover 0.394 0.428 0.922 0.359 two
## 5 age:firesev -0.107 0.149 -0.718 0.475 three
## 6 age:cover 1.10 1.07 1.03 0.306 three
## 7 firesev:cover 2.24 6.46 0.347 0.730 three
We can see that, across models, age:firesev
is different from 0, but the others never are. What about cover?
all_coefs %>%
filter(term == "cover")
## # A tibble: 2 x 6
## term estimate std.error statistic p.value model
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 cover 3.44 15.0 0.229 0.819 two
## 2 cover -13.1 27.5 -0.478 0.634 three
Nope - never different from 0.
So, functionally, keeley_mlr_int
is where it’s at. You could justtify choosing it as a ‘best’ model. But…why would you?
We can look at averaged coefficients
modavg(mod_list, parm="firesev")
Except this package doesn’t play well with interactions. With good theoretical reason. If we’re including an interaction, in some ways, we’re double-counting it by having it in two places. Better to go straight to prediction!
Let’s start by creating a sample data frame to work with.
aic_newdata <- data_grid(keeley,
firesev = seq_range(firesev, 100),
age = seq_range(age, 4),
cover = seq_range(cover, 4))
OK, so, we’re looking at how fire severity’s effect is modified by age and cover. Note, this is how we’d do it for a 3-way interaction, regardless. Now, let’s get some model averaged predictions. This can take a bit, as we’re doing a lot of predictions
aic_preds <- bind_cols(aic_newdata,
modavgPred(mod_list, newdata = aic_newdata) %>% as.data.frame) %>%
rename(rich = mod.avg.pred)
And now, we plot! Just like any other MLR, but, now we can really understand what’s happening here.
ggplot(keeley,
aes(x = firesev, y = rich)) +
geom_point() +
facet_grid(cut_interval(age, 4) ~ cut_interval(cover, 4),
labeller = labeller(.rows = label_both, .cols = label_both)) +
geom_line(data = aic_preds) +
geom_ribbon(data = aic_preds, mapping = aes(ymin = lower.CL, ymax = upper.CL),
alpha = 0.1)
So, we can see pretty clearly here that cover… doesn’t matter! Again, this would support our intuition to go with the firesev*age
only model.
Note, we oculd have worked with all possible models using MuMIn
, but I generally recommend against it.
library(MuMIn)
options(na.action = "na.fail")
allmods <- dredge(keeley_three_ints)
options(na.action = "na.omit")
This object can be used for a multitide of things, such as coefficients -
ensemble_model <- model.avg(allmods, fit = TRUE)
summary(ensemble_model)
##
## Call:
## model.avg(object = get.models(object = allmods, subset = NA))
##
## Component model call:
## lm(formula = rich ~ <19 unique rhs>, data = keeley)
##
## Component models:
## df logLik AICc delta weight
## 135 5 -357.52 725.75 0.00 0.41
## 1235 6 -356.78 726.57 0.82 0.27
## 12345 7 -356.45 728.26 2.52 0.12
## 12356 7 -356.78 728.92 3.17 0.08
## 123456 8 -356.32 730.41 4.66 0.04
## 1234 6 -359.45 731.92 6.17 0.02
## 1234567 9 -356.03 732.32 6.57 0.02
## 124 5 -361.51 733.73 7.98 0.01
## 12346 7 -359.42 734.21 8.46 0.01
## 236 5 -361.79 734.29 8.54 0.01
## 23 4 -363.16 734.78 9.03 0.00
## 1236 6 -361.07 735.15 9.40 0.00
## 123 5 -362.39 735.49 9.74 0.00
## 13 4 -363.80 736.08 10.33 0.00
## 3 3 -365.02 736.31 10.56 0.00
## 12 4 -364.35 737.16 11.41 0.00
## 2 3 -366.40 739.08 13.33 0.00
## 1 3 -367.25 740.78 15.03 0.00
## (Null) 2 -371.56 747.25 21.50 0.00
##
## Term codes:
## age cover firesev age:cover
## 1 2 3 4
## age:firesev cover:firesev age:cover:firesev
## 5 6 7
##
## Model-averaged coefficients:
## (full average)
## Estimate Std. Error Adjusted SE z value Pr(>|z|)
## (Intercept) 42.655533 12.835277 12.967892 3.289 0.0010 **
## age 0.677043 0.489664 0.493695 1.371 0.1703
## firesev 2.770071 2.600026 2.627424 1.054 0.2917
## age:firesev -0.204642 0.086827 0.087603 2.336 0.0195 *
## cover 1.590026 9.333162 9.435542 0.169 0.8662
## age:cover 0.093783 0.305187 0.307432 0.305 0.7603
## cover:firesev -0.012407 1.567992 1.587464 0.008 0.9938
## age:cover:firesev -0.002378 0.032584 0.032907 0.072 0.9424
##
## (conditional average)
## Estimate Std. Error Adjusted SE z value Pr(>|z|)
## (Intercept) 42.65553 12.83528 12.96789 3.289 0.00100 **
## age 0.68590 0.48665 0.49076 1.398 0.16222
## firesev 2.79738 2.59815 2.62584 1.065 0.28673
## age:firesev -0.21686 0.07308 0.07405 2.928 0.00341 **
## cover 2.72936 12.10023 12.23576 0.223 0.82349
## age:cover 0.45636 0.53644 0.54264 0.841 0.40035
## cover:firesev -0.07963 3.97159 4.02093 0.020 0.98420
## age:cover:firesev -0.15356 0.21293 0.21611 0.711 0.47737
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Relative variable importance:
## firesev age age:firesev cover age:cover
## Importance: 0.99 0.99 0.94 0.58 0.21
## N containing models: 14 14 6 14 6
## cover:firesev age:cover:firesev
## Importance: 0.16 0.02
## N containing models: 6 1
Ensemble prediction, and more…
predict(ensemble_model, newdata = keeley, se.fit = TRUE)
## $fit
## [1] 55.94025 51.82056 54.75745 55.35670 54.38951 54.52023 49.07104
## [8] 52.80317 27.07656 40.42988 24.59934 35.39336 35.74320 42.10585
## [15] 53.63905 44.38464 53.07259 52.55435 48.89172 49.44536 54.49227
## [22] 52.66096 52.39500 53.67973 48.25867 51.04438 26.23036 32.89635
## [29] 27.27638 17.70247 41.91399 49.11634 55.09672 54.17400 46.17010
## [36] 44.99241 48.90847 42.54092 46.51661 54.11613 56.67886 53.90774
## [43] 55.30103 55.82946 54.24276 53.65008 51.26428 53.07287 44.82727
## [50] 50.36650 48.34256 51.93611 56.86923 48.51114 52.30049 56.26858
## [57] 52.56040 52.77886 55.75125 53.07282 49.32055 50.50299 54.47587
## [64] 51.44560 51.80181 52.50160 50.33155 54.42453 53.50820 52.05141
## [71] 52.72063 50.53759 45.74539 37.75844 46.87192 52.30524 51.74679
## [78] 52.60328 55.60156 53.76973 52.73553 53.14333 50.67973 50.26351
## [85] 53.21702 56.05900 54.43245 49.10692 50.03613 47.59231
##
## $se.fit
## [1] 3.848471 2.145000 2.381958 2.857262 3.111395 2.584599 2.277701
## [8] 2.897538 4.603265 2.363902 7.312039 3.608073 3.115718 4.446242
## [15] 1.916508 5.205473 1.667000 3.797781 3.075767 1.892694 3.634058
## [22] 1.793240 1.758975 2.884631 6.570208 3.538713 4.938348 3.799940
## [29] 5.614239 6.769176 2.128169 2.681771 4.127411 2.782341 2.134876
## [36] 2.109807 1.609281 3.903918 4.439608 2.048748 2.634068 2.004455
## [43] 3.630051 2.444019 2.036847 2.201299 2.984301 2.146869 3.744039
## [50] 1.801258 2.228370 4.417619 2.806461 2.299537 3.916103 2.701289
## [57] 3.408298 1.801591 2.958364 2.201928 5.816436 4.954496 1.900818
## [64] 1.604661 3.195837 1.928137 1.666128 3.530190 2.637017 3.597457
## [71] 3.308711 1.798562 3.160948 5.159222 2.209312 4.094805 2.661071
## [78] 2.197053 4.755136 2.871134 2.568999 2.664298 2.346199 2.085816
## [85] 2.520114 2.638233 4.974604 2.050819 5.456887 2.609242
Go back to the previous datasets and try multiple models! They don’t have to contain interactions (really!). To remind you, they are
planktonSummary.csv
showing plankton from Lake Biakal (thanks, Stephanie Hampton). Evluate how Chlorophyll (CHLFa
) is affected by other predictors.SwaddleWestNile2002NCEAS_shortnames.csv
is about the prevalence of West Nile virus in Birds around Sacramento county in California. What predicts human WNV?But now your worflow is different (depending also on if you use AICcmodavg of MuMIn)
# Fit many models (or a master one and then dredge)
# Put into a list or modavg object
# Evaluate the AIC table - draw conclusions if possible
# Look at model averaged quantities
# Make ansemble forecasts and visualize