Cross-Validation

1. The data

Let’s start with the roach data looking at activity of prothoracic ganglia under different temperature regimes. Here’s the data and a plot of a linear model fit.

library(ggplot2)

roaches <- read.csv("http://biol607.github.io/lab/data/chap17f5_4CockroachNeurons.csv")

roach_lm <- lm(rate ~ temperature, data = roaches)

roach_lm_plot <- ggplot(roaches,
       aes(x = temperature, y = rate)) +
  geom_point() +
  stat_smooth(method = "lm", color = "blue") 

roach_lm_plot

2. Basic Cross-validation with One Point

We can easily pluck out any random point and see how out of sample error works. For that, we use the {modelr} library and the mse() to get mean square error.

out_of_sample <- roaches[5,]

roaches_no_5 <- roaches[-5]

Now, let’s refit the model and use mse() to see the out of sample mean square error.

library(modelr)

roach_lm_no_5 <- lm(rate ~ temperature, 
                    data = roaches_no_5)

mse(model = roach_lm_no_5,
    data = out_of_sample)

## [1] 1453.541

We can fit an intercept only model and compare the MSE

roach_int_no_5 <-  lm(rate ~ 1, 
                    data = roaches_no_5)

mse(model = roach_int_no_5,
    data = out_of_sample)

## [1] 1828.838

Much higher.

3. K-Fold Cross-Validation

OK, let’s look at a more general solution here. One problem in doing cross-validation is that making folds in the data can add up to ballooning the size of data in memory. For that, there’s the library {rsample} which plays beautifully with {dplyr} and {purrr}. It uses some underlying tricks to not ballon data. Let’s look at an rsample object for 5-folk cross-validation. Note, the function for this is vfold_cv() not kfold - no, I’m not sure why.

library(rsample)

roach_five_fold <- vfold_cv(roaches, v = 5)

roach_five_fold

## #  5-fold cross-validation 
## # A tibble: 5 × 2
##   splits          id   
##   <list>          <chr>
## 1 <split [45/12]> Fold1
## 2 <split [45/12]> Fold2
## 3 <split [46/11]> Fold3
## 4 <split [46/11]> Fold4
## 5 <split [46/11]> Fold5

OK, neat - we have a tibble with a column of splits. We can use the functions analysis() to get the train set and assessment().

We can use purrr::map() to generate a new column of models fit on each fold.

library(dplyr)
library(purrr)

roach_five_fold <- roach_five_fold %>%
  mutate(mod = map(splits,
                   ~lm(rate ~ temperature,
                       data = analysis(.))))

roach_five_fold

## #  5-fold cross-validation 
## # A tibble: 5 × 3
##   splits          id    mod   
##   <list>          <chr> <list>
## 1 <split [45/12]> Fold1 <lm>  
## 2 <split [45/12]> Fold2 <lm>  
## 3 <split [46/11]> Fold3 <lm>  
## 4 <split [46/11]> Fold4 <lm>  
## 5 <split [46/11]> Fold5 <lm>

Cool! We have a list of fit linear models! Now, how do we get the MSE? purrr has a nice function called map2() which works just like map() where, instead of iterating over one list or vector, we iterate over two, and use .x and .y instead of just a . in the function.

Let’s see how this works.

roach_five_fold <- roach_five_fold %>%
  mutate(mse = map2_dbl(mod, splits,
         ~mse(model = .x, data = assessment(.y))))

roach_five_fold

## #  5-fold cross-validation 
## # A tibble: 5 × 4
##   splits          id    mod      mse
##   <list>          <chr> <list> <dbl>
## 1 <split [45/12]> Fold1 <lm>    539.
## 2 <split [45/12]> Fold2 <lm>    648.
## 3 <split [46/11]> Fold3 <lm>    293.
## 4 <split [46/11]> Fold4 <lm>    749.
## 5 <split [46/11]> Fold5 <lm>    778.

We can then get the average MSE by summing up that last column. We could do the same for an intercept only model and compare.

4. LOO Cross-Validation

We can do the same for leave-one-out using rsample::loo_cv() with pretty much the same workflow.

roach_loo <- roaches %>%
  loo_cv()%>%
  mutate(tempmod = map(splits,
                   ~lm(rate ~ temperature,
                       data = analysis(.))),
         intmod = map(splits,
                   ~lm(rate ~ 1,
                       data = analysis(.))),
         )

What’s cool here is that we can use tidyr::pivot_longer() and then get the mean square for all of it.

roach_loo <- roach_loo %>%
  tidyr::pivot_longer(cols = c(tempmod, intmod),
                      names_to = "mod_name",
                      values_to = "mod")  %>%
  mutate(mse = map2_dbl(mod, splits,
         ~mse(model = .x, data = assessment(.y))))

Now we can look at the LOO MSE for each point.

ggplot(roach_loo,
       aes(x = id, y = sqrt(mse), color = mod_name)) + 
  geom_point() +
  scale_x_discrete(labels = NULL)

OK, can’t see a pattern there, but…

roach_loo %>%
  group_by(mod_name) %>%
  summarise(mse = mean(mse))

## # A tibble: 2 × 2
##   mod_name   mse
##   <chr>    <dbl>
## 1 intmod    819.
## 2 tempmod   570.

5. Using boot::cv.glm() for Cross-Validation

This is great. But, kind of annoying as we have to do all of this work. It’s wonderful for arbitrary out of sample error functions. But…. WOuldn’t it be nice if we could do a lot of this in one line?

Enter the {boot} package - which also has some nice functions for bootstrapping. It works with glm() fits to extract deviances and calculate an out of sample deviance for either a k-fold CV or LOO CV. It has a piece called delta whose first index is the average out of sample deviance.

Let’s see it in action for a LOO analysis.

library(boot)

roach_glm <- glm(rate ~ temperature, data = roaches)

loo_roach <- cv.glm(roaches, roach_glm, K = nrow(roaches))

loo_roach$delta[1]

## [1] 569.5849

Hey, it’s the same as above…. but less work! Note the K argument. We can lower it to be a more traditional k-fold analysis. Now, granted, it’s a bit ungainly to use, but we can wrap it up in a function and make it easier to use.

6. AIC

Last, AIC! Any fit model can give us an AIC score.

AIC(roach_lm)

## [1] 525.9157

Fantastic! This uses all of the data. And, hey, it’s not TOO far off of the LOOCV. Let’s compare it to the int only model.

roach_int <- lm(rate ~ 1, data = roaches)

AIC(roach_int)

## [1] 546.08

Nice - we can see the difference. But, wouldn’t it be nice if we got an easy to use table?

Enter {AICcmodavg}! It’s function aictab() takes a list of models and a vector of model names, and gives you an easy to read output.

library(AICcmodavg)

aictab(list(roach_lm, roach_int),
       c("Temperature Model", "Intercept Only"))

## 
## Model selection based on AICc:
## 
##                   K   AICc Delta_AICc AICcWt Cum.Wt      LL
## Temperature Model 3 526.37       0.00      1      1 -259.96
## Intercept Only    2 546.30      19.93      0      1 -271.04

Note that by default it gives the AICc. You can, however, use second.ord = TRUE to get the AIC.

This is somewhat boring (yes, temperature matters), so let’s make it more interesting!

roach_lm_2 <- lm(rate ~ poly(temperature,2), data = roaches)
roach_lm_3 <- lm(rate ~ poly(temperature,3), data = roaches)
roach_lm_4 <- lm(rate ~ poly(temperature,4), data = roaches)

modlist <- list(roach_lm, roach_int, roach_lm_2, roach_lm_3, roach_lm_4)
modnames <- 0:4

aictab(modlist,
       modnames)

## 
## Model selection based on AICc:
## 
##   K   AICc Delta_AICc AICcWt Cum.Wt      LL
## 3 5 523.54       0.00   0.54   0.54 -256.18
## 4 6 525.06       1.52   0.25   0.80 -255.69
## 0 3 526.37       2.83   0.13   0.93 -259.96
## 2 4 527.66       4.12   0.07   1.00 -259.44
## 1 2 546.30      22.76   0.00   1.00 -271.04

6.1 AIC model averaging

We can get the model averaged coefficients, as it were with these models, using modavg()

modavg(modlist, modnames, parm = "(Intercept)")

## 
## Multimodel inference on "(Intercept)" based on AICc
## 
## AICc table used to obtain model-averaged estimate:
## 
##   K   AICc Delta_AICc AICcWt Estimate   SE
## 0 3 526.37       2.83   0.13   133.95 8.69
## 1 2 546.30      22.76   0.00    92.48 3.76
## 2 4 527.66       4.12   0.07    92.48 3.12
## 3 5 523.54       0.00   0.54    92.48 2.98
## 4 6 525.06       1.52   0.25    92.48 2.98
## 
## Model-averaged estimate: 97.96 
## Unconditional SE: 14.67 
## 95% Unconditional confidence interval: 69.21, 126.71

There’s a lot of controversy, however, as to whether this is a good idea. Rather, we should focus on model predictions! We can get these and their fit intervals using modavgPred()

mod_pred <- tibble(temperature = seq(11, 34, length.out = 200)) |>
  modavgPred(modlist, modnames, newdata = _) |>
  as_tibble()

mod_pred

## # A tibble: 200 × 7
##    type     mod.avg.pred uncond.se conf.level lower.CL upper.CL matrix…¹ [,"un…²
##    <chr>           <dbl>     <dbl>      <dbl>    <dbl>    <dbl>    <dbl>   <dbl>
##  1 response         91.9     13.7        0.95     65.0     119.     91.9   13.7 
##  2 response         93.0     12.9        0.95     67.8     118.     93.0   12.9 
##  3 response         94.0     12.0        0.95     70.5     118.     94.0   12.0 
##  4 response         95.0     11.2        0.95     73.1     117.     95.0   11.2 
##  5 response         96.0     10.4        0.95     75.6     116.     96.0   10.4 
##  6 response         96.9      9.69       0.95     77.9     116.     96.9    9.69
##  7 response         97.8      9.00       0.95     80.2     115.     97.8    9.00
##  8 response         98.7      8.35       0.95     82.3     115.     98.7    8.35
##  9 response         99.5      7.75       0.95     84.3     115.     99.5    7.75
## 10 response        100.       7.19       0.95     86.2     114.    100.     7.19
## # … with 190 more rows, 1 more variable: matrix.output[3:4] <dbl>, and
## #   abbreviated variable names ¹matrix.output[,"mod.avg.pred"], ²[,"uncond.se"]

Prediction intervals are a bit trickier. They’re not specified in the function. Fortunately, uh, I figured this out a few years ago. And then forgot I figured it out. But past me had been smart and made https://rpubs.com/jebyrnes/aic_prediction to look at. It all comes down to getting prediction intervals from each model, and then calculating a weighted average.

pred_list <- modlist |>
  map(predict, 
      interval = "prediction", 
      newdata = tibble(temperature = seq(11, 34, length.out = 200))) |>
  map(as_tibble)

Every element of that list corresponds to a model. We can get the weighted value of each by multiplying them against the model weights for that model.

aicc_weights <- aictab(modlist, modnames, sort=FALSE)$AICcWt

weighted_pred_list <- pred_list |>
  map2(aicc_weights, `*`)

Great! Now we just need to sum up across the lists. We can use purrr::reduce() for this which iterates through a list, and applies a function to the cummulative result.

reduce(1:5, sum) #for example

## [1] 15

pred_intervals <- reduce(weighted_pred_list, `+`) |>
  select(-fit)

Now we can bind this to our model predictions and plot!

mod_pred <- mod_pred |>
  mutate(temperature = seq(11, 34, length.out = 200)) |>
  rename(rate = mod.avg.pred) |>
  select(temperature, rate, lower.CL, upper.CL) |>
  bind_cols(pred_intervals)

ggplot(mod_pred,
       aes(x = temperature, y = rate)) +
  geom_ribbon(fill = "lightblue",
              aes(ymin = lwr, ymax = upr)) +
  geom_ribbon(fill = "darkgrey",
              aes(ymin = lower.CL, ymax = upper.CL)) +
  geom_line() +
  geom_point(data = roaches)

6.2 Example

Load up the keeley data - data("keeley", package = "piecewiseSEM"). Fit the following five models.

data("keeley", package = "piecewiseSEM")

keeley_firecover <- lm(rich ~ firesev + cover, data=keeley)
keeley_firecover_int <- lm(rich ~ age * firesev, data=keeley)
keeley_agefirecover <- lm(rich ~ age + firesev + cover, data = keeley)
keeley_two_ints <- lm(rich ~ age * firesev + cover*firesev + age*cover, data = keeley)
keeley_three_ints <- lm(rich ~ age * firesev * cover, data = keeley)

Which model is the best? Would you use just that one, or a model averaged dataset? Plot the model averaged predictions for a range of age, firesev, or cover holding all other predictors at their mean.