ANOVA

For this lab, see the etherpad at https://etherpad.wikimedia.org/p/607-anova-2018

1.One-Way ANOVA Model

We’ll start today with the dataset 15e1KneesWhoSayNight.csv about an experiment to help resolve jetlag by having people shine lights at different parts of themselves to try and shift their internal clocks.

library(tidyverse)

knees <- read_csv("./data/10/15e1KneesWhoSayNight.csv")

We can see the outcomes with ggplot2

library(ggplot2)
ggplot(knees, mapping=aes(x=treatment, y=shift)) +
  stat_summary(color="red", size=1.3) +
    geom_point(alpha=0.7) +
  theme_bw(base_size=17)

1.1 LM, AOV, and Factors

As the underlying model of ANOVA is a linear one, we fit ANOVAs using lm() just as with linear regression.

knees <- read.csv("./data/10/15e1KneesWhoSayNight.csv")

knees_lm <- lm(shift ~ treatment, data=knees)

Now, there are two things to notice here. One, note that treatment is a either a character or factor. If it is not, because we are using lm(), it will be fit like a linear regression. So, beware!

There is an ANOVA-specific model fitting function - aov.

knees_aov <- aov(shift ~ treatment, data=knees)

It’s ok, I guess, and works with a few functions that lm() objects do not. But, in general, I find it too limiting. You can’t see coefficients, etc. Boooooring.

1.2 Assumption Evaluation

Because this is an lm, we can check our assumptions as before - with one new one. First, some oldies but goodies.

#The whole par thing lets me make a multi-panel plot
par(mfrow=c(2,2))
plot(knees_lm, which=c(1,2,5))
par(mfrow=c(1,1))

Now, the residuals v. fitted lets us see how the residuals are distributed by treatment, but I often find it insufficient, as spacing on the x-axis can get odd. I could roll my own plot of resudials versus treatment, but, there’s a wonderful package called car - which is from the book Companion to Applied Regression by John Fox. I recommend it highly! It has a function in it called residualPlots() which is useful here.

library(car)
residualPlots(knees_lm)

Note how it both does fitted v. residuals but also a boxplot by treatment. Handy, no?

1.3 F-Tests

OK, so, let’s see the ANOVA table! With the function….anova()!

anova(knees_lm)

## Analysis of Variance Table
## 
## Response: shift
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## treatment  2 7.2245  3.6122  7.2894 0.004472 **
## Residuals 19 9.4153  0.4955                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now….this is a type I sums of squares test. Which is fine for a 1-way ANOVA. If you want to start getting into the practice of using type II, car provides a function Anova() - note the capital A - which defaults to type II and I use instead. In fact, I use it all the time, as it handles a wide set of different models.

Anova(knees_lm)

## Anova Table (Type II tests)
## 
## Response: shift
##           Sum Sq Df F value   Pr(>F)   
## treatment 7.2245  2  7.2894 0.004472 **
## Residuals 9.4153 19                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here it matters not a whit as you get the same table.

1.4 Post-hoc Tests

So, there are a lot of things we can do with a fit model

1.4.0 Summary Output

summary(knees_lm)

## 
## Call:
## lm(formula = shift ~ treatment, data = knees)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27857 -0.36125  0.03857  0.61147  1.06571 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -0.30875    0.24888  -1.241  0.22988   
## treatmenteyes -1.24268    0.36433  -3.411  0.00293 **
## treatmentknee -0.02696    0.36433  -0.074  0.94178   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7039 on 19 degrees of freedom
## Multiple R-squared:  0.4342, Adjusted R-squared:  0.3746 
## F-statistic: 7.289 on 2 and 19 DF,  p-value: 0.004472

First, notice that we get the same information as a linear regression - including \(R^2\) and overall model F-test. THis is great. We also get coefficients, but, what do they mean?

Well, they are the treatment contrasts. Not super useful. R fits a model where treatment 1 is the intercept, and then we look at deviations from that initial treatment as your other coefficients. It’s efficient, but, hard to make sense of. To not get an intercept term, you need to refit the model without the intercept. You can fit a whole new model with -1 in the model formulation. Or, as I like to do to ensure I don’t frak anything up, you can update() your model. Just use . to signify what was there before.

knees_lm_no_int <- update(knees_lm, formula = . ~ . -1)

summary(knees_lm_no_int)

## 
## Call:
## lm(formula = shift ~ treatment - 1, data = knees)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27857 -0.36125  0.03857  0.61147  1.06571 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## treatmentcontrol  -0.3087     0.2489  -1.241    0.230    
## treatmenteyes     -1.5514     0.2661  -5.831 1.29e-05 ***
## treatmentknee     -0.3357     0.2661  -1.262    0.222    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7039 on 19 degrees of freedom
## Multiple R-squared:  0.6615, Adjusted R-squared:  0.6081 
## F-statistic: 12.38 on 3 and 19 DF,  p-value: 0.0001021

OK - that makes more sense. For a 1-way ANOVA, we can also see treatment means using the emmeans package - much more on that next week (and later below for contrasts).

library(emmeans)
library(multcompView)

emmeans(knees_lm, ~treatment)

##  treatment     emmean        SE df   lower.CL   upper.CL
##  control   -0.3087500 0.2488836 19 -0.8296694  0.2121694
##  eyes      -1.5514286 0.2660678 19 -2.1083148 -0.9945423
##  knee      -0.3357143 0.2660678 19 -0.8926006  0.2211720
## 
## Confidence level used: 0.95

I also like this because it outputs CIs.

We see means and if they are different from 0. But….what about post-hoc tests

1.4.1 A Priori Contrasts

If you have a priori contrasts, you can use the constrat library to test them. You give contrast an a list and a b list. Then we get all comparisons of a v. b, in order. It’s not great syntactically, but, it lets you do some pretty creative things.

contrast::contrast(knees_lm, 
         a = list(treatment = "control"), 
         b = list(treatment = "eyes"))

## lm model parameter contrast
## 
##   Contrast      S.E.     Lower    Upper    t df Pr(>|t|)
## 1 1.242679 0.3643283 0.4801306 2.005227 3.41 19   0.0029

1.4.2 Tukey’s HSD

Meh. 9 times out of 10 we want to do something more like a Tukey Test. There is a TukeyHSD function that works on aov objects, but, if you’re doing anything with an lm, it borks on you. Instead, let’s use emmeans. It is wonderful as it’s designed to work with ANOVA and ANCOVA models with complicated structures such that, for post-hocs, it adjusts to the mean or median level of all other factors. Very handy.

knees_em <- emmeans(knees_lm, ~treatment)

contrast(knees_em,
        method = "tukey")

##  contrast          estimate        SE df t.ratio p.value
##  control - eyes  1.24267857 0.3643283 19   3.411  0.0079
##  control - knee  0.02696429 0.3643283 19   0.074  0.9970
##  eyes - knee    -1.21571429 0.3762767 19  -3.231  0.0117
## 
## P value adjustment: tukey method for comparing a family of 3 estimates

We don’t need to worry about many of the fancier things that emmeans does for the moment - those will become more useful with other models. But, we can look at this test a few different ways. First, we can visualize it

plot(contrast(knees_em,
        method = "tukey")) +
  geom_vline(xintercept = 0, color = "red", lty=2)

We can also, using our tukey method of adjustment, get “groups” - i.e., see which groups are statistically the same versus different.

library(multcompView)
cld(knees_em, adjust="tukey")

##  treatment     emmean        SE df   lower.CL   upper.CL .group
##  eyes      -1.5514286 0.2660678 19 -2.2477700 -0.8550872  1    
##  knee      -0.3357143 0.2660678 19 -1.0320557  0.3606271   2   
##  control   -0.3087500 0.2488836 19 -0.9601178  0.3426178   2   
## 
## Confidence level used: 0.95 
## Conf-level adjustment: sidak method for 3 estimates 
## P value adjustment: tukey method for comparing a family of 3 estimates 
## significance level used: alpha = 0.05

This can be very useful in plotting. For example, we can use that output as a data frame for a ggplot in a few different ways.

cld(knees_em, adjust="tukey") %>%
  ggplot(aes(x = treatment, y = emmean, 
             ymin = lower.CL, ymax = upper.CL,
             color = factor(.group))) +
  geom_pointrange()

cld(knees_em, adjust="tukey") %>%
  mutate(.group = letters[as.numeric(.group)]) %>%
  ggplot(aes(x = treatment, y = emmean, 
             ymin = lower.CL, ymax = upper.CL)) +
  geom_pointrange() +
  geom_text(mapping = aes(label = .group), y = rep(1, 3)) +
  ylim(c(-2.5, 2))

knees_expanded <- left_join(knees, cld(knees_em, adjust="tukey"))
ggplot(knees_expanded,
       aes(x = treatment, y = shift, color = .group)) + 
  geom_point()

1.4.2 Dunnet’s Test

We can similarly use this to look at a Dunnett’s test, which compares against the control

contrast(knees_em,
        method = "dunnett")

##  contrast          estimate        SE df t.ratio p.value
##  eyes - control -1.24267857 0.3643283 19  -3.411  0.0057
##  knee - control -0.02696429 0.3643283 19  -0.074  0.9927
## 
## P value adjustment: dunnettx method for 2 tests

Note, if the “control” had not been the first treatment, you can either re-order the factor using forcats or just specify which of the levels is the control. For example, eyes is the second treatment. Let’s make it our new reference.

contrast(knees_em,
        method = "dunnett", ref=2)

##  contrast       estimate        SE df t.ratio p.value
##  control - eyes 1.242679 0.3643283 19   3.411  0.0057
##  knee - eyes    1.215714 0.3762767 19   3.231  0.0085
## 
## P value adjustment: dunnettx method for 2 tests

You can even plot these results

plot(contrast(knees_em,
        method = "dunnett", ref=2)) +
  geom_vline(xintercept = 0, color = "red", lty=2)

1.4.2 Bonferroni Correction and FDR

Let’s say you wanted to do all pairwise tests, but, compare using a Bonferroni correction or FDR. Or none! No problem! There’s an adjust argument

contrast(knees_em,
        method = "tukey", adjust="bonferroni")

##  contrast          estimate        SE df t.ratio p.value
##  control - eyes  1.24267857 0.3643283 19   3.411  0.0088
##  control - knee  0.02696429 0.3643283 19   0.074  1.0000
##  eyes - knee    -1.21571429 0.3762767 19  -3.231  0.0132
## 
## P value adjustment: bonferroni method for 3 tests

contrast(knees_em,
        method = "tukey", adjust="fdr")

##  contrast          estimate        SE df t.ratio p.value
##  control - eyes  1.24267857 0.3643283 19   3.411  0.0066
##  control - knee  0.02696429 0.3643283 19   0.074  0.9418
##  eyes - knee    -1.21571429 0.3762767 19  -3.231  0.0066
## 
## P value adjustment: fdr method for 3 tests

contrast(knees_em,
        method = "tukey", adjust="none")

##  contrast          estimate        SE df t.ratio p.value
##  control - eyes  1.24267857 0.3643283 19   3.411  0.0029
##  control - knee  0.02696429 0.3643283 19   0.074  0.9418
##  eyes - knee    -1.21571429 0.3762767 19  -3.231  0.0044

1.5 Faded Examples

Let’s try three ANOVAs! First - do landscape characteristics affect the number of generations plant species can exist before local extinction?

plants <- read.csv("./data/10/15q01PlantPopulationPersistence.csv")

#Visualize
qplot(treatment, generations, data=plants, geom="boxplot")

#fit
plant_lm <- lm(generations ~ treatment, data=plants)

#assumptions
plot(plant_lm, which=c(1,2,4,5))

#ANOVA
anova(plant_lm)

#Tukey's HSD
contrast(emmeans(plant_lm, ~treatment), method = "tukey")

Second, how do different host types affect nematode longevity?

worms <- read.csv("./data/10/15q19NematodeLifespan.csv")

#Visualize
qplot(treatment, lifespan, data=____, geom="____")

#fit
worm_lm <- lm(______ ~ ______, data=worms)

#assumptions
plot(______, which=c(1,2,4,5))

#ANOVA
anova(______)

#Tukey's HSD
contrast(emmeans(______, ~________), method = "tukey")

And last, how about how number of genotypes affect eelgrass productivity. Note, THERE IS A TRAP HERE. Look at your dataset before you do ANYTHING.

eelgrass <- read.csv("./data/10/15q05EelgrassGenotypes.csv")

#Visualize
________(treatment.genotypes, shoots, data=____, geom="____")

#fit
eelgrass_lm <- __(______ ~ ______, data=________)

#assumptions
________(______, which=c(1,2,4,5))

#ANOVA
________(______)

#Tukey's HSD
contrast(________(______, ~________), method = "________")

2. ANODEV (ANOVA + Likelihood)

If you had done this using likelihood, you could have done all of this with a LR Chisq also using Anova() from the car package.

knees_glm <- glm(shift ~ treatment, data=knees,
                 family=gaussian())

Anova(knees_glm)

## Analysis of Deviance Table (Type II tests)
## 
## Response: shift
##           LR Chisq Df Pr(>Chisq)    
## treatment   14.579  2  0.0006827 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Further, emmeans works here as well.

knees_glm_em <- emmeans(knees_glm, ~treatment)

cld(knees_glm_em)

##  treatment     emmean        SE  df  asymp.LCL  asymp.UCL .group
##  eyes      -1.5514286 0.2660678 Inf -2.0729118 -1.0299453  1    
##  knee      -0.3357143 0.2660678 Inf -0.8571976  0.1857690   2   
##  control   -0.3087500 0.2488836 Inf -0.7965529  0.1790529   2   
## 
## Confidence level used: 0.95 
## P value adjustment: tukey method for comparing a family of 3 estimates 
## significance level used: alpha = 0.05

And we can look at posthocs as usual

contrast(knees_glm_em, method = "dunnett")

##  contrast          estimate        SE  df z.ratio p.value
##  eyes - control -1.24267857 0.3643283 Inf  -3.411  0.0013
##  knee - control -0.02696429 0.3643283 Inf  -0.074  0.9927
## 
## P value adjustment: dunnettx method for 2 tests

3. BANOVA

3.1 Fit and Summary

Yes, we can fit this using Bayesian methods as well. Here I’ll use default priors (flat on the means).

library(brms)
library(tidybayes)

knees_banova <- brm(shift ~ treatment,
                         data = knees,
                         family=gaussian(),
                     file = "knees_banova.rds")

We can actually examine this in a few different ways. First, there’s a default method for looking at ANOVA-like models in brms using margina_effects

marginal_effects(knees_banova)

Neat, right? We can also use emmeans as before.

knees_b_em <- emmeans(knees_banova, ~treatment)
knees_b_em

##  treatment     emmean  lower.HPD  upper.HPD
##  control   -0.3105277 -0.8300968  0.2090493
##  eyes      -1.5522395 -2.1497296 -0.9876234
##  knee      -0.3246139 -0.9303752  0.1956249
## 
## HPD interval probability: 0.95

Note, now we’re lookign at posteriod intervals (and you define that interval!)

3.2 BANOVA and Variance Partioning

An analogue to F-tests, although philosophically different, is to look at how the finite population variance - i.e. the variation of just your dataset rather than the entire super-population - is explained by different elements of the model. We’ll explore this more next week, but, for now, let’s get a feel for it. We want to visualize the variability in the model due to each source - i.e. treatment and residual in this case. Then, let’s look at the posterior of this variance component, both in terms of raw numbers, and percentages.

emmeans helps us out here again as it provides ready access to the treatment levels.

sd_treatment <- gather_emmeans_draws(knees_b_em) %>%
  group_by(.draw) %>%
  summarize(sd_treatment = sd(.value))

Extracting residuals is a bit trickier, as tidybayes does not yet have a nice residual extractor. I wrote some code for that - https://gist.github.com/jebyrnes/c28d1f5523be392e4666da2f06110c10 - and submitted an issue, but it might be a while until they get it.

For now, we can use the residuals function with summary=FALSE which returns a giant matrix. We can manipulate the matrix a bit and summarise it to get the sd of residuals for each draw of the coefficients. Then, we can make a nice big tibble for plotting. I admit, this is a bit of a PITA, but, it will also help you get into the guts of BANOVA.

sd_residuals <- residuals(knees_banova, summary=FALSE) %>%
  t() %>%
  as.data.frame() %>%
  summarise_all(sd) %>%
  as.numeric
  
sd_groups <- tibble(type = c(rep("treatment", 4000),
                             rep("residual", 4000)),
                    value = c(sd_treatment$sd_treatment, sd_residuals))

Now we can plot this and look at it in a variety of ways.

ggplot(sd_groups, 
       aes(x = value, y = type)) +
  geom_halfeyeh()

We can also make a table using a tidier from broom. broom has a wonderful function for summarizing MCMC results called tidyMCMC - but we need each posterior to have it’s own column, so, we can make a new tibble.

sd_bycol <- tibble(treatment_sd = sd_treatment$sd_treatment,
                   residuals_sd = sd_residuals)

broom::tidyMCMC(sd_bycol, conf.int = TRUE, conf.method = "HPDinterval")

## # A tibble: 2 x 5
##   term         estimate std.error conf.low conf.high
##   <chr>           <dbl>     <dbl>    <dbl>     <dbl>
## 1 treatment_sd    0.738    0.197     0.350     1.13 
## 2 residuals_sd    0.709    0.0413    0.670     0.793

Last, we can also look at this so that we can get % of variance by transforming sd_bycol to percentages.

sd_percent_bycol <- sd_bycol/rowSums(sd_bycol) * 100

broom::tidyMCMC(sd_percent_bycol, estimate.method = "median",
         conf.int = TRUE, conf.method = "HPDinterval")

## # A tibble: 2 x 5
##   term         estimate std.error conf.low conf.high
##   <chr>           <dbl>     <dbl>    <dbl>     <dbl>
## 1 treatment_sd     51.7      7.32     35.7      60.9
## 2 residuals_sd     48.3      7.32     39.1      64.3

3.3 BANOVA post-hocs

We can use emmeans for contrasts, too. Here’s a tukey test.

contrast(knees_b_em, method = "tukey")

##  contrast          estimate  lower.HPD  upper.HPD
##  control - eyes  1.23715115  0.3824904  1.9957889
##  control - knee  0.01778766 -0.7885430  0.7810583
##  eyes - knee    -1.22103969 -2.0206449 -0.4301801
## 
## HPD interval probability: 0.95

We can visualize this using `tidybayes::gather_emmeans_draws`` to see the results of the contrast.

contrast(knees_b_em, method = "tukey") %>%
  gather_emmeans_draws() %>%
  ggplot(aes(x = .value, y = contrast)) +
  geom_halfeyeh() +
  geom_vline(xintercept = 0, color = "red", lty = 2)

Or maybe something that matches earlier visualizations

contrast(knees_b_em, method = "tukey") %>%
  gather_emmeans_draws() %>%
  ggplot(aes(y = .value, x = contrast)) +
  geom_jitter(alpha = 0.05)+
  geom_hline(yintercept = 0, color = "red", lty = 2)

As you can see, gather_emmeans_draws lets us to a lot with categorical variables in a BANOVA context very easily. We can even use it to generate some interesting and useful visualizations of the means themselves with some additional geoms.

gather_emmeans_draws(knees_b_em) %>%
  ggplot(aes(x = treatment, y = .value)) +
  stat_lineribbon(alpha = 0.25, fill = "gray25") +
  stat_pointinterval()

gather_emmeans_draws(knees_b_em) %>%
  ggplot(aes(x = treatment, y = .value)) +
  geom_line(aes(group = .draw), alpha = 0.01) +
  stat_pointinterval(color = "darkred")

We can even add this back to the original data.

ggplot(knees,
       aes(x = treatment, y = shift)) +
  geom_point() +
  stat_pointinterval(data = gather_emmeans_draws(knees_b_em), 
                     mapping = aes(y = .value), color = "red", alpha = 0.4)

4. Final Exercise

Take two of the examples from the faded examples section. Execute one analysis using likelihood. Execute the other using Bayesian methods. Be sure to do all of the usual checks!