Let’s start with the roach data looking at activity of prothoracic ganglia under different temperature regimes. Here’s the data and a plot of a linear model fit.
library(ggplot2)
roaches <- read.csv("data/chap17f5_4CockroachNeurons.csv")
roach_lm <- lm(rate ~ temperature, data = roaches)
roach_lm_plot <- ggplot(roaches,
aes(x = temperature, y = rate)) +
geom_point() +
stat_smooth(method = "lm", color = "blue")
roach_lm_plot
We can easily pluck out any random point and see how out of sample error works. For that, we use the {modelr} library and the mse()
to get mean square error.
out_of_sample <- roaches[5,]
roaches_no_5 <- roaches[-5]
Now, let’s refit the model and use mse()
to see the out of sample mean square error.
library(modelr)
roach_lm_no_5 <- lm(rate ~ temperature,
data = roaches_no_5)
mse(model = roach_lm_no_5,
data = out_of_sample)
## [1] 1453.541
We can fit an intercept only model and compare the MSE
roach_int_no_5 <- lm(rate ~ 1,
data = roaches_no_5)
mse(model = roach_int_no_5,
data = out_of_sample)
## [1] 1828.838
Much higher.
OK, let’s look at a more general solution here. One problem in doing cross-validation is that making folds in the data can add up to ballooning the size of data in memory. For that, there’s the library {rsample} which plays beautifully with {dplyr} and {purrr}. It uses some underlying tricks to not ballon data. Let’s look at an rsample
object for 5-folk cross-validation. Note, the function for this is vfold_cv()
not kfold - no, I’m not sure why.
library(rsample)
roach_five_fold <- vfold_cv(roaches, v = 5)
roach_five_fold
## # 5-fold cross-validation
## # A tibble: 5 x 2
## splits id
## <list> <chr>
## 1 <split [45/12]> Fold1
## 2 <split [45/12]> Fold2
## 3 <split [46/11]> Fold3
## 4 <split [46/11]> Fold4
## 5 <split [46/11]> Fold5
OK, neat - we have a tibble with a column of splits. We can use the functions analysis()
to get the train set and assessment()
.
We can use purrr::map()
to generate a new column of models fit on each fold.
library(dplyr)
library(purrr)
roach_five_fold <- roach_five_fold %>%
mutate(mod = map(splits,
~lm(rate ~ temperature,
data = analysis(.))))
roach_five_fold
## # 5-fold cross-validation
## # A tibble: 5 x 3
## splits id mod
## <list> <chr> <list>
## 1 <split [45/12]> Fold1 <lm>
## 2 <split [45/12]> Fold2 <lm>
## 3 <split [46/11]> Fold3 <lm>
## 4 <split [46/11]> Fold4 <lm>
## 5 <split [46/11]> Fold5 <lm>
Cool! We have a list of fit linear models! Now, how do we get the MSE? purrr has a nice function called map2()
which works just like map()
where, instead of iterating over one list or vector, we iterate over two, and use .x
and .y
instead of just a .
in the function.
Let’s see how this works.
roach_five_fold <- roach_five_fold %>%
mutate(mse = map2_dbl(mod, splits,
~mse(model = .x, data = assessment(.y))))
roach_five_fold
## # 5-fold cross-validation
## # A tibble: 5 x 4
## splits id mod mse
## <list> <chr> <list> <dbl>
## 1 <split [45/12]> Fold1 <lm> 1138.
## 2 <split [45/12]> Fold2 <lm> 378.
## 3 <split [46/11]> Fold3 <lm> 211.
## 4 <split [46/11]> Fold4 <lm> 568.
## 5 <split [46/11]> Fold5 <lm> 523.
We can then get the average MSE by summing up that last column. We could do the same for an intercept only model and compare.
We can do the same for leave-one-out using rsample::loo_cv()
with pretty much the same workflow.
roach_loo <- roaches %>%
loo_cv()%>%
mutate(tempmod = map(splits,
~lm(rate ~ temperature,
data = analysis(.))),
intmod = map(splits,
~lm(rate ~ 1,
data = analysis(.))),
)
What’s cool here is that we can use tidyr::pivot_longer()
and then get the mean square for all of it.
roach_loo <- roach_loo %>%
tidyr::pivot_longer(cols = c(tempmod, intmod),
names_to = "mod_name",
values_to = "mod") %>%
mutate(mse = map2_dbl(mod, splits,
~mse(model = .x, data = assessment(.y))))
Now we can look at the LOO MSE for each point.
ggplot(roach_loo,
aes(x = id, y = sqrt(mse), color = mod_name)) +
geom_point() +
scale_x_discrete(labels = NULL)
OK, can’t see a pattern there, but…
roach_loo %>%
group_by(mod_name) %>%
summarise(mse = mean(mse))
## # A tibble: 2 x 2
## mod_name mse
## <chr> <dbl>
## 1 intmod 819.
## 2 tempmod 570.
This is great. But, kind of annoying as we have to do all of this work. It’s wonderful for arbitrary out of sample error functions. But…. WOuldn’t it be nice if we could do a lot of this in one line?
Enter the {boot} package - which also has some nice functions for bootstrapping. It works with glm()
fits to extract deviances and calculate an out of sample deviance for either a k-fold CV or LOO CV. It has a piece called delta
whose first index is the average out of sample deviance.
Let’s see it in action for a LOO analysis.
library(boot)
roach_glm <- glm(rate ~ temperature, data = roaches)
loo_roach <- cv.glm(roaches, roach_glm, K = nrow(roaches))
loo_roach$delta[1]
## [1] 569.5849
Hey, it’s the same as above…. but less work! Note the K argument. We can lower it to be a more traditional k-fold analysis. Now, granted, it’s a bit ungainly to use, but we can wrap it up in a function and make it easier to use.
Last, AIC! Any fit model can give us an AIC score.
AIC(roach_lm)
## [1] 525.9157
Fantastic! This uses all of the data. And, hey, it’s not TOO far off of the LOOCV. Let’s compare it to the int only model.
roach_int <- lm(rate ~ 1, data = roaches)
AIC(roach_int)
## [1] 546.08
Nice - we can see the difference. But, wouldn’t it be nice if we got an easy to use table?
Enter {AICcmodavg}! It’s function aictab()
takes a list of models and a vector of model names, and gives you an easy to read output.
library(AICcmodavg)
aictab(list(roach_lm, roach_int),
c("Temperature Model", "Intercept Only"))
##
## Model selection based on AICc:
##
## K AICc Delta_AICc AICcWt Cum.Wt LL
## Temperature Model 3 526.37 0.00 1 1 -259.96
## Intercept Only 2 546.30 19.93 0 1 -271.04
Note that by default it gives the AICd. You can, however, use second.ord = TRUE
to get the AIC.