Let’s start with the roach data looking at activity of prothoracic ganglia under different temperature regimes. Here’s the data and a plot of a linear model fit.
library(ggplot2)
roaches <- read.csv("http://biol607.github.io/lab/data/chap17f5_4CockroachNeurons.csv")
roach_lm <- lm(rate ~ temperature, data = roaches)
roach_lm_plot <- ggplot(roaches,
aes(x = temperature, y = rate)) +
geom_point() +
stat_smooth(method = "lm", color = "blue")
roach_lm_plot
We can easily pluck out any random point and see how out of sample
error works. For that, we use the {modelr} library and the
mse()
to get mean square error.
out_of_sample <- roaches[5,]
roaches_no_5 <- roaches[-5]
Now, let’s refit the model and use mse()
to see the out
of sample mean square error.
library(modelr)
roach_lm_no_5 <- lm(rate ~ temperature,
data = roaches_no_5)
mse(model = roach_lm_no_5,
data = out_of_sample)
## [1] 1453.541
We can fit an intercept only model and compare the MSE
roach_int_no_5 <- lm(rate ~ 1,
data = roaches_no_5)
mse(model = roach_int_no_5,
data = out_of_sample)
## [1] 1828.838
Much higher.
OK, let’s look at a more general solution here. One problem in doing
cross-validation is that making folds in the data can add up to
ballooning the size of data in memory. For that, there’s the library
{rsample} which plays beautifully with {dplyr} and {purrr}. It uses some
underlying tricks to not ballon data. Let’s look at an
rsample
object for 5-folk cross-validation. Note, the
function for this is vfold_cv()
not kfold - no, I’m not
sure why.
library(rsample)
roach_five_fold <- vfold_cv(roaches, v = 5)
roach_five_fold
## # 5-fold cross-validation
## # A tibble: 5 × 2
## splits id
## <list> <chr>
## 1 <split [45/12]> Fold1
## 2 <split [45/12]> Fold2
## 3 <split [46/11]> Fold3
## 4 <split [46/11]> Fold4
## 5 <split [46/11]> Fold5
OK, neat - we have a tibble with a column of splits. We can use the
functions analysis()
to get the train set and
assessment()
.
We can use purrr::map()
to generate a new column of
models fit on each fold.
library(dplyr)
library(purrr)
roach_five_fold <- roach_five_fold %>%
mutate(mod = map(splits,
~lm(rate ~ temperature,
data = analysis(.))))
roach_five_fold
## # 5-fold cross-validation
## # A tibble: 5 × 3
## splits id mod
## <list> <chr> <list>
## 1 <split [45/12]> Fold1 <lm>
## 2 <split [45/12]> Fold2 <lm>
## 3 <split [46/11]> Fold3 <lm>
## 4 <split [46/11]> Fold4 <lm>
## 5 <split [46/11]> Fold5 <lm>
Cool! We have a list of fit linear models! Now, how do we get the
MSE? purrr has a nice function called map2()
which works
just like map()
where, instead of iterating over one list
or vector, we iterate over two, and use .x
and
.y
instead of just a .
in the function.
Let’s see how this works.
roach_five_fold <- roach_five_fold %>%
mutate(mse = map2_dbl(mod, splits,
~mse(model = .x, data = assessment(.y))))
roach_five_fold
## # 5-fold cross-validation
## # A tibble: 5 × 4
## splits id mod mse
## <list> <chr> <list> <dbl>
## 1 <split [45/12]> Fold1 <lm> 539.
## 2 <split [45/12]> Fold2 <lm> 648.
## 3 <split [46/11]> Fold3 <lm> 293.
## 4 <split [46/11]> Fold4 <lm> 749.
## 5 <split [46/11]> Fold5 <lm> 778.
We can then get the average MSE by summing up that last column. We could do the same for an intercept only model and compare.
We can do the same for leave-one-out using
rsample::loo_cv()
with pretty much the same workflow.
roach_loo <- roaches %>%
loo_cv()%>%
mutate(tempmod = map(splits,
~lm(rate ~ temperature,
data = analysis(.))),
intmod = map(splits,
~lm(rate ~ 1,
data = analysis(.))),
)
What’s cool here is that we can use
tidyr::pivot_longer()
and then get the mean square for all
of it.
roach_loo <- roach_loo %>%
tidyr::pivot_longer(cols = c(tempmod, intmod),
names_to = "mod_name",
values_to = "mod") %>%
mutate(mse = map2_dbl(mod, splits,
~mse(model = .x, data = assessment(.y))))
Now we can look at the LOO MSE for each point.
ggplot(roach_loo,
aes(x = id, y = sqrt(mse), color = mod_name)) +
geom_point() +
scale_x_discrete(labels = NULL)
OK, can’t see a pattern there, but…
roach_loo %>%
group_by(mod_name) %>%
summarise(mse = mean(mse))
## # A tibble: 2 × 2
## mod_name mse
## <chr> <dbl>
## 1 intmod 819.
## 2 tempmod 570.
This is great. But, kind of annoying as we have to do all of this work. It’s wonderful for arbitrary out of sample error functions. But…. WOuldn’t it be nice if we could do a lot of this in one line?
Enter the {boot} package - which also has some nice functions for
bootstrapping. It works with glm()
fits to extract
deviances and calculate an out of sample deviance for either a k-fold CV
or LOO CV. It has a piece called delta
whose first index is
the average out of sample deviance.
Let’s see it in action for a LOO analysis.
library(boot)
roach_glm <- glm(rate ~ temperature, data = roaches)
loo_roach <- cv.glm(roaches, roach_glm, K = nrow(roaches))
loo_roach$delta[1]
## [1] 569.5849
Hey, it’s the same as above…. but less work! Note the K argument. We can lower it to be a more traditional k-fold analysis. Now, granted, it’s a bit ungainly to use, but we can wrap it up in a function and make it easier to use.
Last, AIC! Any fit model can give us an AIC score.
AIC(roach_lm)
## [1] 525.9157
Fantastic! This uses all of the data. And, hey, it’s not TOO far off of the LOOCV. Let’s compare it to the int only model.
roach_int <- lm(rate ~ 1, data = roaches)
AIC(roach_int)
## [1] 546.08
Nice - we can see the difference. But, wouldn’t it be nice if we got an easy to use table?
Enter {AICcmodavg}! It’s function aictab()
takes a list
of models and a vector of model names, and gives you an easy to read
output.
library(AICcmodavg)
aictab(list(roach_lm, roach_int),
c("Temperature Model", "Intercept Only"))
##
## Model selection based on AICc:
##
## K AICc Delta_AICc AICcWt Cum.Wt LL
## Temperature Model 3 526.37 0.00 1 1 -259.96
## Intercept Only 2 546.30 19.93 0 1 -271.04
Note that by default it gives the AICc. You can, however, use
second.ord = TRUE
to get the AIC.
This is somewhat boring (yes, temperature matters), so let’s make it more interesting!
roach_lm_2 <- lm(rate ~ poly(temperature,2), data = roaches)
roach_lm_3 <- lm(rate ~ poly(temperature,3), data = roaches)
roach_lm_4 <- lm(rate ~ poly(temperature,4), data = roaches)
modlist <- list(roach_lm, roach_int, roach_lm_2, roach_lm_3, roach_lm_4)
modnames <- 0:4
aictab(modlist,
modnames)
##
## Model selection based on AICc:
##
## K AICc Delta_AICc AICcWt Cum.Wt LL
## 3 5 523.54 0.00 0.54 0.54 -256.18
## 4 6 525.06 1.52 0.25 0.80 -255.69
## 0 3 526.37 2.83 0.13 0.93 -259.96
## 2 4 527.66 4.12 0.07 1.00 -259.44
## 1 2 546.30 22.76 0.00 1.00 -271.04
We can get the model averaged coefficients, as it were with these
models, using modavg()
modavg(modlist, modnames, parm = "(Intercept)")
##
## Multimodel inference on "(Intercept)" based on AICc
##
## AICc table used to obtain model-averaged estimate:
##
## K AICc Delta_AICc AICcWt Estimate SE
## 0 3 526.37 2.83 0.13 133.95 8.69
## 1 2 546.30 22.76 0.00 92.48 3.76
## 2 4 527.66 4.12 0.07 92.48 3.12
## 3 5 523.54 0.00 0.54 92.48 2.98
## 4 6 525.06 1.52 0.25 92.48 2.98
##
## Model-averaged estimate: 97.96
## Unconditional SE: 14.67
## 95% Unconditional confidence interval: 69.21, 126.71
There’s a lot of controversy, however, as to whether this is a good
idea. Rather, we should focus on model predictions! We can get these and
their fit intervals using modavgPred()
mod_pred <- tibble(temperature = seq(11, 34, length.out = 200)) |>
modavgPred(modlist, modnames, newdata = _) |>
as_tibble()
mod_pred
## # A tibble: 200 × 7
## type mod.avg.pred uncond.se conf.level lower.CL upper.CL matrix…¹ [,"un…²
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 response 91.9 13.7 0.95 65.0 119. 91.9 13.7
## 2 response 93.0 12.9 0.95 67.8 118. 93.0 12.9
## 3 response 94.0 12.0 0.95 70.5 118. 94.0 12.0
## 4 response 95.0 11.2 0.95 73.1 117. 95.0 11.2
## 5 response 96.0 10.4 0.95 75.6 116. 96.0 10.4
## 6 response 96.9 9.69 0.95 77.9 116. 96.9 9.69
## 7 response 97.8 9.00 0.95 80.2 115. 97.8 9.00
## 8 response 98.7 8.35 0.95 82.3 115. 98.7 8.35
## 9 response 99.5 7.75 0.95 84.3 115. 99.5 7.75
## 10 response 100. 7.19 0.95 86.2 114. 100. 7.19
## # … with 190 more rows, 1 more variable: matrix.output[3:4] <dbl>, and
## # abbreviated variable names ¹​matrix.output[,"mod.avg.pred"], ²​[,"uncond.se"]
Prediction intervals are a bit trickier. They’re not specified in the function. Fortunately, uh, I figured this out a few years ago. And then forgot I figured it out. But past me had been smart and made https://rpubs.com/jebyrnes/aic_prediction to look at. It all comes down to getting prediction intervals from each model, and then calculating a weighted average.
pred_list <- modlist |>
map(predict,
interval = "prediction",
newdata = tibble(temperature = seq(11, 34, length.out = 200))) |>
map(as_tibble)
Every element of that list corresponds to a model. We can get the weighted value of each by multiplying them against the model weights for that model.
aicc_weights <- aictab(modlist, modnames, sort=FALSE)$AICcWt
weighted_pred_list <- pred_list |>
map2(aicc_weights, `*`)
Great! Now we just need to sum up across the lists. We can use
purrr::reduce()
for this which iterates through a list, and
applies a function to the cummulative result.
reduce(1:5, sum) #for example
## [1] 15
pred_intervals <- reduce(weighted_pred_list, `+`) |>
select(-fit)
Now we can bind this to our model predictions and plot!
mod_pred <- mod_pred |>
mutate(temperature = seq(11, 34, length.out = 200)) |>
rename(rate = mod.avg.pred) |>
select(temperature, rate, lower.CL, upper.CL) |>
bind_cols(pred_intervals)
ggplot(mod_pred,
aes(x = temperature, y = rate)) +
geom_ribbon(fill = "lightblue",
aes(ymin = lwr, ymax = upr)) +
geom_ribbon(fill = "darkgrey",
aes(ymin = lower.CL, ymax = upper.CL)) +
geom_line() +
geom_point(data = roaches)
Load up the keeley data -
data("keeley", package = "piecewiseSEM")
. Fit the following
five models.
data("keeley", package = "piecewiseSEM")
keeley_firecover <- lm(rich ~ firesev + cover, data=keeley)
keeley_firecover_int <- lm(rich ~ age * firesev, data=keeley)
keeley_agefirecover <- lm(rich ~ age + firesev + cover, data = keeley)
keeley_two_ints <- lm(rich ~ age * firesev + cover*firesev + age*cover, data = keeley)
keeley_three_ints <- lm(rich ~ age * firesev * cover, data = keeley)
Which model is the best? Would you use just that one, or a model averaged dataset? Plot the model averaged predictions for a range of age, firesev, or cover holding all other predictors at their mean.