For this lab, see the etherpad at https://etherpad.wikimedia.org/p/607-anova-2018
We’ll start today with the dataset 15e1KneesWhoSayNight.csv
about an experiment to help resolve jetlag by having people shine lights at different parts of themselves to try and shift their internal clocks.
library(tidyverse)
knees <- read_csv("./data/10/15e1KneesWhoSayNight.csv")
We can see the outcomes with ggplot2
library(ggplot2)
ggplot(knees, mapping=aes(x=treatment, y=shift)) +
stat_summary(color="red", size=1.3) +
geom_point(alpha=0.7) +
theme_bw(base_size=17)
As the underlying model of ANOVA is a linear one, we fit ANOVAs using lm()
just as with linear regression.
knees <- read.csv("./data/10/15e1KneesWhoSayNight.csv")
knees_lm <- lm(shift ~ treatment, data=knees)
Now, there are two things to notice here. One, note that treatment is a either a character or factor. If it is not, because we are using lm()
, it will be fit like a linear regression. So, beware!
There is an ANOVA-specific model fitting function - aov
.
knees_aov <- aov(shift ~ treatment, data=knees)
It’s ok, I guess, and works with a few functions that lm()
objects do not. But, in general, I find it too limiting. You can’t see coefficients, etc. Boooooring.
Because this is an lm, we can check our assumptions as before - with one new one. First, some oldies but goodies.
#The whole par thing lets me make a multi-panel plot
par(mfrow=c(2,2))
plot(knees_lm, which=c(1,2,5))
par(mfrow=c(1,1))
Now, the residuals v. fitted lets us see how the residuals are distributed by treatment, but I often find it insufficient, as spacing on the x-axis can get odd. I could roll my own plot of resudials versus treatment, but, there’s a wonderful package called car
- which is from the book Companion to Applied Regression by John Fox. I recommend it highly! It has a function in it called residualPlots()
which is useful here.
library(car)
residualPlots(knees_lm)
Note how it both does fitted v. residuals but also a boxplot by treatment. Handy, no?
OK, so, let’s see the ANOVA table! With the function….anova()
!
anova(knees_lm)
## Analysis of Variance Table
##
## Response: shift
## Df Sum Sq Mean Sq F value Pr(>F)
## treatment 2 7.2245 3.6122 7.2894 0.004472 **
## Residuals 19 9.4153 0.4955
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now….this is a type I sums of squares test. Which is fine for a 1-way ANOVA. If you want to start getting into the practice of using type II, car
provides a function Anova()
- note the capital A - which defaults to type II and I use instead. In fact, I use it all the time, as it handles a wide set of different models.
Anova(knees_lm)
## Anova Table (Type II tests)
##
## Response: shift
## Sum Sq Df F value Pr(>F)
## treatment 7.2245 2 7.2894 0.004472 **
## Residuals 9.4153 19
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here it matters not a whit as you get the same table.
So, there are a lot of things we can do with a fit model
summary(knees_lm)
##
## Call:
## lm(formula = shift ~ treatment, data = knees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.27857 -0.36125 0.03857 0.61147 1.06571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.30875 0.24888 -1.241 0.22988
## treatmenteyes -1.24268 0.36433 -3.411 0.00293 **
## treatmentknee -0.02696 0.36433 -0.074 0.94178
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7039 on 19 degrees of freedom
## Multiple R-squared: 0.4342, Adjusted R-squared: 0.3746
## F-statistic: 7.289 on 2 and 19 DF, p-value: 0.004472
First, notice that we get the same information as a linear regression - including \(R^2\) and overall model F-test. THis is great. We also get coefficients, but, what do they mean?
Well, they are the treatment contrasts. Not super useful. R fits a model where treatment 1 is the intercept, and then we look at deviations from that initial treatment as your other coefficients. It’s efficient, but, hard to make sense of. To not get an intercept term, you need to refit the model without the intercept. You can fit a whole new model with -1
in the model formulation. Or, as I like to do to ensure I don’t frak anything up, you can update()
your model. Just use .
to signify what was there before.
knees_lm_no_int <- update(knees_lm, formula = . ~ . -1)
summary(knees_lm_no_int)
##
## Call:
## lm(formula = shift ~ treatment - 1, data = knees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.27857 -0.36125 0.03857 0.61147 1.06571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## treatmentcontrol -0.3087 0.2489 -1.241 0.230
## treatmenteyes -1.5514 0.2661 -5.831 1.29e-05 ***
## treatmentknee -0.3357 0.2661 -1.262 0.222
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7039 on 19 degrees of freedom
## Multiple R-squared: 0.6615, Adjusted R-squared: 0.6081
## F-statistic: 12.38 on 3 and 19 DF, p-value: 0.0001021
OK - that makes more sense. For a 1-way ANOVA, we can also see treatment means using the emmeans
package - much more on that next week (and later below for contrasts).
library(emmeans)
library(multcompView)
emmeans(knees_lm, ~treatment)
## treatment emmean SE df lower.CL upper.CL
## control -0.3087500 0.2488836 19 -0.8296694 0.2121694
## eyes -1.5514286 0.2660678 19 -2.1083148 -0.9945423
## knee -0.3357143 0.2660678 19 -0.8926006 0.2211720
##
## Confidence level used: 0.95
I also like this because it outputs CIs.
We see means and if they are different from 0. But….what about post-hoc tests
If you have a priori contrasts, you can use the constrat
library to test them. You give contrast an a list and a b list. Then we get all comparisons of a v. b, in order. It’s not great syntactically, but, it lets you do some pretty creative things.
contrast::contrast(knees_lm,
a = list(treatment = "control"),
b = list(treatment = "eyes"))
## lm model parameter contrast
##
## Contrast S.E. Lower Upper t df Pr(>|t|)
## 1 1.242679 0.3643283 0.4801306 2.005227 3.41 19 0.0029
Meh. 9 times out of 10 we want to do something more like a Tukey Test. There is a TukeyHSD
function that works on aov
objects, but, if you’re doing anything with an lm
, it borks on you. Instead, let’s use emmeans
. It is wonderful as it’s designed to work with ANOVA and ANCOVA models with complicated structures such that, for post-hocs, it adjusts to the mean or median level of all other factors. Very handy.
knees_em <- emmeans(knees_lm, ~treatment)
contrast(knees_em,
method = "tukey")
## contrast estimate SE df t.ratio p.value
## control - eyes 1.24267857 0.3643283 19 3.411 0.0079
## control - knee 0.02696429 0.3643283 19 0.074 0.9970
## eyes - knee -1.21571429 0.3762767 19 -3.231 0.0117
##
## P value adjustment: tukey method for comparing a family of 3 estimates
We don’t need to worry about many of the fancier things that emmeans does for the moment - those will become more useful with other models. But, we can look at this test a few different ways. First, we can visualize it
plot(contrast(knees_em,
method = "tukey")) +
geom_vline(xintercept = 0, color = "red", lty=2)
We can also, using our tukey method of adjustment, get “groups” - i.e., see which groups are statistically the same versus different.
library(multcompView)
cld(knees_em, adjust="tukey")
## treatment emmean SE df lower.CL upper.CL .group
## eyes -1.5514286 0.2660678 19 -2.2477700 -0.8550872 1
## knee -0.3357143 0.2660678 19 -1.0320557 0.3606271 2
## control -0.3087500 0.2488836 19 -0.9601178 0.3426178 2
##
## Confidence level used: 0.95
## Conf-level adjustment: sidak method for 3 estimates
## P value adjustment: tukey method for comparing a family of 3 estimates
## significance level used: alpha = 0.05
This can be very useful in plotting. For example, we can use that output as a data frame for a ggplot
in a few different ways.
cld(knees_em, adjust="tukey") %>%
ggplot(aes(x = treatment, y = emmean,
ymin = lower.CL, ymax = upper.CL,
color = factor(.group))) +
geom_pointrange()
cld(knees_em, adjust="tukey") %>%
mutate(.group = letters[as.numeric(.group)]) %>%
ggplot(aes(x = treatment, y = emmean,
ymin = lower.CL, ymax = upper.CL)) +
geom_pointrange() +
geom_text(mapping = aes(label = .group), y = rep(1, 3)) +
ylim(c(-2.5, 2))
knees_expanded <- left_join(knees, cld(knees_em, adjust="tukey"))
ggplot(knees_expanded,
aes(x = treatment, y = shift, color = .group)) +
geom_point()
We can similarly use this to look at a Dunnett’s test, which compares against the control
contrast(knees_em,
method = "dunnett")
## contrast estimate SE df t.ratio p.value
## eyes - control -1.24267857 0.3643283 19 -3.411 0.0057
## knee - control -0.02696429 0.3643283 19 -0.074 0.9927
##
## P value adjustment: dunnettx method for 2 tests
Note, if the “control” had not been the first treatment, you can either re-order the factor using forcats
or just specify which of the levels is the control. For example, eyes is the second treatment. Let’s make it our new reference.
contrast(knees_em,
method = "dunnett", ref=2)
## contrast estimate SE df t.ratio p.value
## control - eyes 1.242679 0.3643283 19 3.411 0.0057
## knee - eyes 1.215714 0.3762767 19 3.231 0.0085
##
## P value adjustment: dunnettx method for 2 tests
You can even plot these results
plot(contrast(knees_em,
method = "dunnett", ref=2)) +
geom_vline(xintercept = 0, color = "red", lty=2)
Let’s say you wanted to do all pairwise tests, but, compare using a Bonferroni correction or FDR. Or none! No problem! There’s an adjust
argument
contrast(knees_em,
method = "tukey", adjust="bonferroni")
## contrast estimate SE df t.ratio p.value
## control - eyes 1.24267857 0.3643283 19 3.411 0.0088
## control - knee 0.02696429 0.3643283 19 0.074 1.0000
## eyes - knee -1.21571429 0.3762767 19 -3.231 0.0132
##
## P value adjustment: bonferroni method for 3 tests
contrast(knees_em,
method = "tukey", adjust="fdr")
## contrast estimate SE df t.ratio p.value
## control - eyes 1.24267857 0.3643283 19 3.411 0.0066
## control - knee 0.02696429 0.3643283 19 0.074 0.9418
## eyes - knee -1.21571429 0.3762767 19 -3.231 0.0066
##
## P value adjustment: fdr method for 3 tests
contrast(knees_em,
method = "tukey", adjust="none")
## contrast estimate SE df t.ratio p.value
## control - eyes 1.24267857 0.3643283 19 3.411 0.0029
## control - knee 0.02696429 0.3643283 19 0.074 0.9418
## eyes - knee -1.21571429 0.3762767 19 -3.231 0.0044
Let’s try three ANOVAs! First - do landscape characteristics affect the number of generations plant species can exist before local extinction?
plants <- read.csv("./data/10/15q01PlantPopulationPersistence.csv")
#Visualize
qplot(treatment, generations, data=plants, geom="boxplot")
#fit
plant_lm <- lm(generations ~ treatment, data=plants)
#assumptions
plot(plant_lm, which=c(1,2,4,5))
#ANOVA
anova(plant_lm)
#Tukey's HSD
contrast(emmeans(plant_lm, ~treatment), method = "tukey")
Second, how do different host types affect nematode longevity?
worms <- read.csv("./data/10/15q19NematodeLifespan.csv")
#Visualize
qplot(treatment, lifespan, data=____, geom="____")
#fit
worm_lm <- lm(______ ~ ______, data=worms)
#assumptions
plot(______, which=c(1,2,4,5))
#ANOVA
anova(______)
#Tukey's HSD
contrast(emmeans(______, ~________), method = "tukey")
And last, how about how number of genotypes affect eelgrass productivity. Note, THERE IS A TRAP HERE. Look at your dataset before you do ANYTHING.
eelgrass <- read.csv("./data/10/15q05EelgrassGenotypes.csv")
#Visualize
________(treatment.genotypes, shoots, data=____, geom="____")
#fit
eelgrass_lm <- __(______ ~ ______, data=________)
#assumptions
________(______, which=c(1,2,4,5))
#ANOVA
________(______)
#Tukey's HSD
contrast(________(______, ~________), method = "________")
If you had done this using likelihood, you could have done all of this with a LR Chisq also using Anova()
from the car
package.
knees_glm <- glm(shift ~ treatment, data=knees,
family=gaussian())
Anova(knees_glm)
## Analysis of Deviance Table (Type II tests)
##
## Response: shift
## LR Chisq Df Pr(>Chisq)
## treatment 14.579 2 0.0006827 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Further, emmeans
works here as well.
knees_glm_em <- emmeans(knees_glm, ~treatment)
cld(knees_glm_em)
## treatment emmean SE df asymp.LCL asymp.UCL .group
## eyes -1.5514286 0.2660678 Inf -2.0729118 -1.0299453 1
## knee -0.3357143 0.2660678 Inf -0.8571976 0.1857690 2
## control -0.3087500 0.2488836 Inf -0.7965529 0.1790529 2
##
## Confidence level used: 0.95
## P value adjustment: tukey method for comparing a family of 3 estimates
## significance level used: alpha = 0.05
And we can look at posthocs as usual
contrast(knees_glm_em, method = "dunnett")
## contrast estimate SE df z.ratio p.value
## eyes - control -1.24267857 0.3643283 Inf -3.411 0.0013
## knee - control -0.02696429 0.3643283 Inf -0.074 0.9927
##
## P value adjustment: dunnettx method for 2 tests
Yes, we can fit this using Bayesian methods as well. Here I’ll use default priors (flat on the means).
library(brms)
library(tidybayes)
knees_banova <- brm(shift ~ treatment,
data = knees,
family=gaussian(),
file = "knees_banova.rds")
We can actually examine this in a few different ways. First, there’s a default method for looking at ANOVA-like models in brms
using margina_effects
marginal_effects(knees_banova)
Neat, right? We can also use emmeans
as before.
knees_b_em <- emmeans(knees_banova, ~treatment)
knees_b_em
## treatment emmean lower.HPD upper.HPD
## control -0.3105277 -0.8300968 0.2090493
## eyes -1.5522395 -2.1497296 -0.9876234
## knee -0.3246139 -0.9303752 0.1956249
##
## HPD interval probability: 0.95
Note, now we’re lookign at posteriod intervals (and you define that interval!)
An analogue to F-tests, although philosophically different, is to look at how the finite population variance - i.e. the variation of just your dataset rather than the entire super-population - is explained by different elements of the model. We’ll explore this more next week, but, for now, let’s get a feel for it. We want to visualize the variability in the model due to each source - i.e. treatment and residual in this case. Then, let’s look at the posterior of this variance component, both in terms of raw numbers, and percentages.
emmeans
helps us out here again as it provides ready access to the treatment levels.
sd_treatment <- gather_emmeans_draws(knees_b_em) %>%
group_by(.draw) %>%
summarize(sd_treatment = sd(.value))
Extracting residuals is a bit trickier, as tidybayes
does not yet have a nice residual extractor. I wrote some code for that - https://gist.github.com/jebyrnes/c28d1f5523be392e4666da2f06110c10 - and submitted an issue, but it might be a while until they get it.
For now, we can use the residuals
function with summary=FALSE
which returns a giant matrix. We can manipulate the matrix a bit and summarise it to get the sd of residuals for each draw of the coefficients. Then, we can make a nice big tibble for plotting. I admit, this is a bit of a PITA, but, it will also help you get into the guts of BANOVA.
sd_residuals <- residuals(knees_banova, summary=FALSE) %>%
t() %>%
as.data.frame() %>%
summarise_all(sd) %>%
as.numeric
sd_groups <- tibble(type = c(rep("treatment", 4000),
rep("residual", 4000)),
value = c(sd_treatment$sd_treatment, sd_residuals))
Now we can plot this and look at it in a variety of ways.
ggplot(sd_groups,
aes(x = value, y = type)) +
geom_halfeyeh()
We can also make a table using a tidier from broom. broom
has a wonderful function for summarizing MCMC results called tidyMCMC
- but we need each posterior to have it’s own column, so, we can make a new tibble.
sd_bycol <- tibble(treatment_sd = sd_treatment$sd_treatment,
residuals_sd = sd_residuals)
broom::tidyMCMC(sd_bycol, conf.int = TRUE, conf.method = "HPDinterval")
## # A tibble: 2 x 5
## term estimate std.error conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 treatment_sd 0.738 0.197 0.350 1.13
## 2 residuals_sd 0.709 0.0413 0.670 0.793
Last, we can also look at this so that we can get % of variance by transforming sd_bycol
to percentages.
sd_percent_bycol <- sd_bycol/rowSums(sd_bycol) * 100
broom::tidyMCMC(sd_percent_bycol, estimate.method = "median",
conf.int = TRUE, conf.method = "HPDinterval")
## # A tibble: 2 x 5
## term estimate std.error conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 treatment_sd 51.7 7.32 35.7 60.9
## 2 residuals_sd 48.3 7.32 39.1 64.3
We can use emmeans
for contrasts, too. Here’s a tukey test.
contrast(knees_b_em, method = "tukey")
## contrast estimate lower.HPD upper.HPD
## control - eyes 1.23715115 0.3824904 1.9957889
## control - knee 0.01778766 -0.7885430 0.7810583
## eyes - knee -1.22103969 -2.0206449 -0.4301801
##
## HPD interval probability: 0.95
We can visualize this using `tidybayes::gather_emmeans_draws`` to see the results of the contrast.
contrast(knees_b_em, method = "tukey") %>%
gather_emmeans_draws() %>%
ggplot(aes(x = .value, y = contrast)) +
geom_halfeyeh() +
geom_vline(xintercept = 0, color = "red", lty = 2)
Or maybe something that matches earlier visualizations
contrast(knees_b_em, method = "tukey") %>%
gather_emmeans_draws() %>%
ggplot(aes(y = .value, x = contrast)) +
geom_jitter(alpha = 0.05)+
geom_hline(yintercept = 0, color = "red", lty = 2)
As you can see, gather_emmeans_draws
lets us to a lot with categorical variables in a BANOVA context very easily. We can even use it to generate some interesting and useful visualizations of the means themselves with some additional geoms.
gather_emmeans_draws(knees_b_em) %>%
ggplot(aes(x = treatment, y = .value)) +
stat_lineribbon(alpha = 0.25, fill = "gray25") +
stat_pointinterval()
gather_emmeans_draws(knees_b_em) %>%
ggplot(aes(x = treatment, y = .value)) +
geom_line(aes(group = .draw), alpha = 0.01) +
stat_pointinterval(color = "darkred")
We can even add this back to the original data.
ggplot(knees,
aes(x = treatment, y = shift)) +
geom_point() +
stat_pointinterval(data = gather_emmeans_draws(knees_b_em),
mapping = aes(y = .value), color = "red", alpha = 0.4)
Take two of the examples from the faded examples section. Execute one analysis using likelihood. Execute the other using Bayesian methods. Be sure to do all of the usual checks!