For this lab, see the etherpad at https://etherpad.wikimedia.org/p/607-anova-2018
Let’s start with some libraries from last time!
library(car)
library(tidyverse)
library(ggplot2)
library(emmeans)
We’ll work with the zooplankton depredation dataset for two-way ANOVA. This is a blocked experiment, so, each treatment is in each block just once.
zooplankton <- read.csv("./data/18e2ZooplanktonDepredation.csv")
qplot(treatment, zooplankton, data=zooplankton, geom="boxplot")
qplot(block, zooplankton, data=zooplankton, geom="boxplot")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
Oh. That’s odd. What is up with block? AH HA! It’s continuous. We need to make it discrete to work with it.
zooplankton$block <- factor(zooplankton$block)
qplot(block, zooplankton, data=zooplankton, geom="boxplot")
There we go. Always check!
Fit is quite easy. We just add one more factor to an lm model!
zooplankton_lm <- lm(zooplankton ~ treatment + block,
data = zooplankton)
We then evaluate residuals almost as usual…
par(mfrow=c(2,2))
plot(zooplankton_lm, which=c(1,2,5))
par(mfrow=c(1,1))
We want to look more deeply by treatment and block. For which we use car
’s residualPlots()
residualPlots(zooplankton_lm)
## Test stat Pr(>|Test stat|)
## treatment
## block
## Tukey test 0.4742 0.6354
Notice that this pops out a Tukey test, and we are looking…GOOD!
Given that we now have multiple factors, in case of unbalance, we should use type II sums of squares.
Anova(zooplankton_lm)
## Anova Table (Type II tests)
##
## Response: zooplankton
## Sum Sq Df F value Pr(>F)
## treatment 6.8573 2 16.3660 0.001488 **
## block 2.3400 4 2.7924 0.101031
## Residuals 1.6760 8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If we want to look at coefficients, we have to make means contrasts.
summary(update(zooplankton_lm, . ~ . -1))
##
## Call:
## lm(formula = zooplankton ~ treatment + block - 1, data = zooplankton)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.62 -0.20 -0.08 0.22 0.68
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## treatmentcontrol 3.420e+00 3.127e-01 10.938 4.33e-06 ***
## treatmenthigh 1.780e+00 3.127e-01 5.693 0.000458 ***
## treatmentlow 2.400e+00 3.127e-01 7.676 5.88e-05 ***
## block2 7.166e-16 3.737e-01 0.000 1.000000
## block3 -7.000e-01 3.737e-01 -1.873 0.097945 .
## block4 -1.000e+00 3.737e-01 -2.676 0.028108 *
## block5 -3.000e-01 3.737e-01 -0.803 0.445316
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4577 on 8 degrees of freedom
## Multiple R-squared: 0.9788, Adjusted R-squared: 0.9603
## F-statistic: 52.82 on 7 and 8 DF, p-value: 4.522e-06
This is still a little odd, though, as our treatments are evaluated in block 1. To truly get just the treatment effect, we need to look at the estimated marginal means - the emmeans
! The big thing about emmeans
is that it creates a reference grid based on the blocks. It then calculates the treatment effect averaged over all blocks, rather than just in block 1.
zoop_em <- emmeans(zooplankton_lm, ~treatment)
zoop_em
## treatment emmean SE df lower.CL upper.CL
## control 3.02 0.2046949 8 2.5479727 3.492027
## high 1.38 0.2046949 8 0.9079727 1.852027
## low 2.00 0.2046949 8 1.5279727 2.472027
##
## Results are averaged over the levels of: block
## Confidence level used: 0.95
Here, emmeans
gets interesting.
contrast(zoop_em, method = "tukey")
## contrast estimate SE df t.ratio p.value
## control - high 1.64 0.2894823 8 5.665 0.0012
## control - low 1.02 0.2894823 8 3.524 0.0190
## high - low -0.62 0.2894823 8 -2.142 0.1424
##
## Results are averaged over the levels of: block
## P value adjustment: tukey method for comparing a family of 3 estimates
Note the message that we’ve averaged over the levels of block. You can do any sort of posthoc as you wanted before. And, you could have done the same workflow for block as well.
Given then similarity with 1-way ANOVA, let’s just jump right into two examples, noting a key difference or two here and there.
To start with, let’s look at gene expression by different types of bees.
bees <- read.csv("./data/18q07BeeGeneExpression.csv")
#Visualize
________(type, Expression, data=____, geom="____")
#fit
bee_lm <- __(______ ~ ______ + _____, data=________)
#assumptions
________(______, which=c(1,2,4,5))
residualPlots(bee_lm)
#ANOVA
________(______)
#Tukey's HSD
contrast(________(______, ____, method = "________")
Wow, not that different, save adding one more term and the residualPlots.
OK, one more…. repeating an experiment in the intertidal?
intertidal <- read.csv("./data/18e3IntertidalAlgae.csv")
#Visualize
________(herbivores, sqrtarea, data=____, geom="____")
#fit
intertidal_lm <- __(______ ~ ______ + _____, data=________)
#assumptions
________(______, which=c(1,2,4,5))
residualPlots(______)
#ANOVA
________(______)
#Tukey's HSD
contrast(________(______, ____, method = "________")
Did that last one pass the test of non-additivity?
Going with that last intertidal example, if you really looked, it was a factorial design, with multiple treatments and conditions.
qplot(herbivores, sqrtarea, data=intertidal, fill=height, geom="boxplot")
We fit factorial models using one of two different notations - both expand to the same thing
intertidal_lm <- lm(sqrtarea ~ herbivores + height + herbivores:height, data=intertidal)
intertidal_lm <- lm(sqrtarea ~ herbivores*height, data=intertidal)
Both mean the same thing as :
is the interaction. *
just means, expand all the interactions.
But, after that’s done…all of the assumption tests are the same. Try them out.
Now, we can choose type II or III SS once we have >n=1 for simple effects. Let’s see the difference. Both are from Anova()
from the car package.
Anova(intertidal_lm)
## Anova Table (Type II tests)
##
## Response: sqrtarea
## Sum Sq Df F value Pr(>F)
## herbivores 1512.2 1 6.3579 0.014360 *
## height 89.0 1 0.3741 0.543096
## herbivores:height 2617.0 1 11.0029 0.001549 **
## Residuals 14270.5 60
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(intertidal_lm, method="III")
## Anova Table (Type II tests)
##
## Response: sqrtarea
## Sum Sq Df F value Pr(>F)
## herbivores 1512.2 1 6.3579 0.014360 *
## height 89.0 1 0.3741 0.543096
## herbivores:height 2617.0 1 11.0029 0.001549 **
## Residuals 14270.5 60
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Post-hocs are a bit funnier. But not by much. As we have an interaction, let’s look at the simple effects. There are a few ways we can do this
emmeans(intertidal_lm, ~ herbivores + height)
## herbivores height emmean SE df lower.CL upper.CL
## minus low 32.91450 3.855532 60 25.202290 40.62672
## plus low 10.40376 3.855532 60 2.691543 18.11597
## minus mid 22.48360 3.855532 60 14.771385 30.19581
## plus mid 25.55094 3.855532 60 17.838732 33.26316
##
## Confidence level used: 0.95
emmeans(intertidal_lm, ~ herbivores | height)
## height = low:
## herbivores emmean SE df lower.CL upper.CL
## minus 32.91450 3.855532 60 25.202290 40.62672
## plus 10.40376 3.855532 60 2.691543 18.11597
##
## height = mid:
## herbivores emmean SE df lower.CL upper.CL
## minus 22.48360 3.855532 60 14.771385 30.19581
## plus 25.55094 3.855532 60 17.838732 33.26316
##
## Confidence level used: 0.95
emmeans(intertidal_lm, ~ height | herbivores)
## herbivores = minus:
## height emmean SE df lower.CL upper.CL
## low 32.91450 3.855532 60 25.202290 40.62672
## mid 22.48360 3.855532 60 14.771385 30.19581
##
## herbivores = plus:
## height emmean SE df lower.CL upper.CL
## low 10.40376 3.855532 60 2.691543 18.11597
## mid 25.55094 3.855532 60 17.838732 33.26316
##
## Confidence level used: 0.95
Notice how each presents the information in a different way. The numbers are not different, they just show you information in different ways. The contrasts each reference grid implies do make s difference, though, in how p-value corrections for FWER is handled. Consider the first case.
contrast(emmeans(intertidal_lm, ~ herbivores + height), method = "tukey")
## contrast estimate SE df t.ratio p.value
## minus,low - plus,low 22.510748 5.452546 60 4.128 0.0006
## minus,low - minus,mid 10.430905 5.452546 60 1.913 0.2336
## minus,low - plus,mid 7.363559 5.452546 60 1.350 0.5350
## plus,low - minus,mid -12.079843 5.452546 60 -2.215 0.1307
## plus,low - plus,mid -15.147189 5.452546 60 -2.778 0.0357
## minus,mid - plus,mid -3.067346 5.452546 60 -0.563 0.9427
##
## P value adjustment: tukey method for comparing a family of 4 estimates
OK, cool. Every comparison is made. But what if we don’t want to do that? What if we just want to see if the herbivore effect differs by height?
contrast(emmeans(intertidal_lm, ~ herbivores |height), method = "tukey")
## height = low:
## contrast estimate SE df t.ratio p.value
## minus - plus 22.510748 5.452546 60 4.128 0.0001
##
## height = mid:
## contrast estimate SE df t.ratio p.value
## minus - plus -3.067346 5.452546 60 -0.563 0.5758
Notice that, because we’re only doing two contrasts, the correction is not as extreme. This method of contrasts might be more what you are interested in given your question. We can also see how this works visuall.
cont <- contrast(emmeans(intertidal_lm, ~ herbivores |height), method = "tukey")
plot(cont) +
geom_vline(xintercept = 0, color = "red", lty = 2)
You can then use CLD
or your own annotations to complete the visualization
Let’s just jump right in with an example, as you should have all of this well in your bones by now. This was from a kelp, predator-diversity experiment I ran ages ago. Note, some things that you want to be factors might be loaded as
kelp <- read.csv("./data/kelp_pred_div_byrnesetal2006.csv")
## Check and correct for non-factors
____________
_________
#Visualize
________(Treatment, Porp_Change, data=____, geom="____", fill=Trial)
#fit
kelp_lm <- __(______ ~ ______ * _____, data=________)
#assumptions
________(______, which=c(1,2,4,5))
residualPlots(_________)
#ANOVA
________(______)
#Tukey's HSD for simple effects
contrast(________(______, ____, method = "________")
So, the kelp example is an interesting one, as this standard workflow is not what I wanted when I ran this experiment. I was not interested in a Tukey test of all possible treatments. Run it with no adjustement - what do you see?
#Pariwise Comparison without P-Value adjustment - The LSD test
contrast(________(______, ____, method = "________", adjust = "_______")
Instead, I was interested in asking whether predator diversity - having a mixture versus only one species of predator - led to less kelp loss than any of the other treatments. There are a few ways to assess the answer to that question.
First, a Dunnet’s test with the Predator Mixture as the control. Try that out. Note, the default “control” is Dungeness crabs, so, you might want to revisit that.
#Dunnet's Test
contrast(________(______, ____, method = "________", ref = _____)
What did you learn?
So…. this was actually a replicated regression design. There are a few ways to deal with this. Note the column Predator_Diversity
Try this whole thing as a regression. What do you see?
Make a new column that is Predator_Diversity
as a factor. Refit the factorial ANOVA with this as your treatment. NOW try a Tukey test. What do you see?
OK, one more way to look at this. What we’re actually asking in comparing monocultures and polycultures is, do we explain more variation with a monoculture v. poyculture split than if not?
kelp_contr <- lm(Change_g ~ C(factor(Predator_Diversity), c(0,1,-1))*Trial, data=kelp)
Anova(kelp_contr)
## Anova Table (Type II tests)
##
## Response: Change_g
## Sum Sq Df F value
## C(factor(Predator_Diversity), c(0, 1, -1)) 87424 2 5.0689
## Trial 109084 1 12.6494
## C(factor(Predator_Diversity), c(0, 1, -1)):Trial 4440 2 0.2574
## Residuals 353570 41
## Pr(>F)
## C(factor(Predator_Diversity), c(0, 1, -1)) 0.0107863 *
## Trial 0.0009646 ***
## C(factor(Predator_Diversity), c(0, 1, -1)):Trial 0.7742620
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now we see that, yes, we explain variation when we partition things into monoculture versus polyculture than when we do not.
Setting up a priori ways of partitioning your sums of squares (that must be orthogonal) is a powerful way to test grouping hypotheses and worth keeping in your back pocket for future explorations.