Comparing Many Means: ANOVA

class: center, middle

# Comparing Many Levels of One Category with the Linear Model

![:scale 70%](./images/anova/cat-egorical.jpg)

---
class: center, middle

# Etherpad
<br><br>
<center><h3>https://etherpad.wikimedia.org/p/607-anova-2020</h3></center>

---
class: center, middle

## Is this what you think of when you hear categorical variables?
![:scale 75%](./images/19/comparison-of-means-anova-you-must-do.jpg)

---
class: center, middle

### Wait, wait wait. What is ANOVA, what is this terminology? New words! AH! (or for some of you, old words - but - you must unlearn what you have learned)

---
class: center, middle

# THROW IT ALL OUT!

### We are talking about linear regression with categorical variables

---
class: center, middle

## What Else Did You Expect?
![:scale 75%](./images/anova/spiderman_anova_regression.png)

---

# Categorical Variables with Many Levels

`$$mass_i =  \beta_1 adelie_i + \beta_2 chinstrap_i + \beta_3 gentoo_i + \epsilon_i$$`
---
class:center

![](images/categorical_lm/corporate_categorical_lm.jpg)

---
# Many Means, One Category

1. What are models with categorical predictors?

2. Assumptions of models with categorical predictors

3. Evaluating fit models

4. Query: How different are groups?

---
# Categorical Predictors: Gene Expression and Mental Disorders

.pull-left[
![](./images/19/neuron_label.jpeg)
]

.pull-right[
![](./images/19/myelin.jpeg)
]

---
# The data
<img src="anova_1_files/figure-html/boxplot-1.png" style="display: block; margin: auto;" />

---
# Traditional Way to Think About Categories
<img src="anova_1_files/figure-html/meansplot-1.png" style="display: block; margin: auto;" />

What is the variance between groups v. within groups?

---
# But What is the Underlying Model  ?

---
# But What is the Underlying Model?

Underlying linear model with control = intercept, dummy variable for bipolar

---

# But What is the Underlying Model?

Underlying linear model with control = intercept, dummy variable for bipolar

---
# But What is the Underlying Model  ?

Underlying linear model with control = intercept, dummy variable for schizo

---

# But What is the Underlying Model?
<img src="anova_1_files/figure-html/ctl_schizo-1.png" style="display: block; margin: auto;" />

Underlying linear model with control = intercept, dummy variable for schizo

---
# Linear Dummy Variable (Fixed Effect) Model
`$$\large y_{ij} = \beta_{0} + \sum \beta_{j}x_{ij} + \epsilon_{ij}, \qquad x_{i} = 0,1$$`  
`$$\epsilon_{ij} \sim N(0, \sigma^{2})$$`

- i = replicate, j = group

- `$x_{ij}$` inidicates presence/abscence (1/0) of level j for individual i  
     - This coding is called a **Dummy variable**

- Note similarities to a linear regression

- One category set to `$\beta_{0}$` for ease of fitting, and other `$\beta$`s are different from it

- Or `$\beta_{0}$` = 0

---
# A Simpler Way to Write: The Means Model
`$$\large y_{ij} = \alpha_{j} + \epsilon_{ij}$$`  
`$$\epsilon_{ij} \sim N(0, \sigma^{2} )$$`

- i = replicate, j = group

- Different mean for each group

- Focus is on specificity of a categorical predictor

---
# Partioning Model to See What Varies

`$$\large y_{ij} = \bar{y} + (\bar{y}_{j} - \bar{y}) + ({y}_{ij} - \bar{y}_{j})$$`

- i = replicate, j = group

- Shows partitioning of variation  
     - Between group v. within group variation

- Consider `$\bar{y}$` an intercept, deviations from intercept by treatment, and residuals

- Can Calculate this with a fit model to answer questions - it's a relic of a bygone era
     - That bygone era has some good papers, so, you should recognize this
     
     
---
# Let's Fit that Model

**Using Least Squares**

```r
brain_lm <- lm(PLP1.expression ~ group, data=brainGene)

tidy(brain_lm) |> 
  select(-c(4:5)) |>
  knitr::kable(digits = 3) |>
  kableExtra::kable_styling()
```

<table class="table" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> -0.004 </td>
   <td style="text-align:right;"> 0.048 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> groupschizo </td>
   <td style="text-align:right;"> -0.191 </td>
   <td style="text-align:right;"> 0.068 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> groupbipolar </td>
   <td style="text-align:right;"> -0.259 </td>
   <td style="text-align:right;"> 0.068 </td>
  </tr>
</tbody>
</table>

---
# Many Means, One Category

1. What are models with categorical predictors?

2. .red[Assumptions of models with categorical predictors]

3. Evaluating fit models

4. Query: How different are groups?

---
# Assumptions of Linear Models with Categorical Variables - Same as Linear Regression!

-   Independence of data points

-   No relationship between fitted and residual values

-   Homoscedasticity (homogeneity of variance) of groups  
       - This is just an extension of `$\epsilon_i \sim N(0, \sigma)$` where `$\sigma$` is constant across all groups
  
-   Normality within groups (of residuals)

- No excess leverage, etc....

---
# Fitted v. Residuals for Linearity
<img src="anova_1_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" />

Are **residual** variances consistent across groups?

---
# Testing HOV with Standardized Residuals

We are all good!

---
# Normality of Residuals

---
# Normality of Residuals: QQ

---
# What do I do if I Violate Assumptions?

-   Nonparametric Kruskal-Wallace (rank transform)

-   log(x+1), asinh(x), or otherwise transform

-   Model the variance or a GLM (two weeks!)
     - likely not a huge difference here

---
# Many Means, One Category

1. What are models with categorical predictors?

2. Assumptions of models with categorical predictors

3. .red[Evaluating fit models]

4. Query: How different are groups?

---
# R Fits with Treatment Contrasts
`$$y_{ij} = \beta_{0} + \sum \beta_{j}x_{ij} + \epsilon_{ij}$$`

What does this mean?

- Intercept ($\beta_{0}$) = the average value associated with being in the control group

- Others = the average difference between control and each other group

- Note: Order is alphabetical

---
# Actual Group Means

`$$y_{ij} = \alpha_{j} + \epsilon_{ij}$$`

<table class="table" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> group </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> control </td>
   <td style="text-align:right;"> -0.0040000 </td>
   <td style="text-align:right;"> 0.0479786 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> schizo </td>
   <td style="text-align:right;"> -0.1953333 </td>
   <td style="text-align:right;"> 0.0479786 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> bipolar </td>
   <td style="text-align:right;"> -0.2626667 </td>
   <td style="text-align:right;"> 0.0479786 </td>
  </tr>
</tbody>
</table>

What does this mean?

Being in group j is associated with an average outcome of y.

---

# What's the best way to see this?
.center[ ![:scale 75%](./images/anova/barplot_viz.jpg)]

---
# Many Ways to Visualize

---
# Many Ways to Visualize

---
# Many Ways to Visualize

---
# How Well Do Groups Explain Variation in Response Data?

We can look at fit to data - even in categorical data!

```
# R2 for Linear Regression
       R2: 0.271
  adj. R2: 0.237
```

But, remember, this is based on the sample at hand.

--
Adjusted R<sup>2</sup>: adjusts for sample size and model complexity (k = # params = # groups)

`$$R^2_{adj} = 1 - \frac{(1-R^2)(n-1)}{n-k-1}$$`

---
# Many Means, One Category

1. What are models with categorical predictors?

2. Assumptions of models with categorical predictors

3. Evaluating fit models

4. .red[Query: How different are groups?]

---
# Which groups are different from each other?
<img src="anova_1_files/figure-html/meansplot-1.png" style="display: block; margin: auto;" />

Many mini-linear models with two means....multiple comparisons!

---
# Post-Hoc Means Comparisons: Which groups are different from one another?

- Each group has a mean and SE

- We can calculate a comparison for each

- BUT, we lose precision as we keep resampling the model  
  
- Remember, for every time we look at a system, we have some % of our CI not overlapping the true value

- Each time we compare means, we have a chance of our CI not covering the true value

- To minimize this possibility, we correct (widen) our CIs for this **Family-Wise Error Rate**

---
# Solutions to Multiple Comparisons and Family-wise Error Rate?

1. Ignore it - 
     + Just a bunch of independent linear models

--
2. Increase your CI given m = # of comparisons
     + If 1 - CI of interest = `$\alpha$`  
     + Bonferroni Correction `$\alpha/ = \alpha/m$`
     + False Discovery Rate `$\alpha/ = k\alpha/m$` where k is rank of test

--
3. Other multiple comparison corrections
    + Tukey's Honestly Significant Difference  
    + Dunnett's Compare to Control

---

# No Correction: Least Square Differences
<table class="table table-striped" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> contrast </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> conf.low </th>
   <th style="text-align:right;"> conf.high </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> control - schizo </td>
   <td style="text-align:right;"> 0.1913333 </td>
   <td style="text-align:right;"> 0.0544024 </td>
   <td style="text-align:right;"> 0.3282642 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> control - bipolar </td>
   <td style="text-align:right;"> 0.2586667 </td>
   <td style="text-align:right;"> 0.1217358 </td>
   <td style="text-align:right;"> 0.3955976 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> schizo - bipolar </td>
   <td style="text-align:right;"> 0.0673333 </td>
   <td style="text-align:right;"> -0.0695976 </td>
   <td style="text-align:right;"> 0.2042642 </td>
  </tr>
</tbody>
</table>

---
# Bonferroni Corrections
<table class="table table-striped" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> contrast </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> conf.low </th>
   <th style="text-align:right;"> conf.high </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> control - schizo </td>
   <td style="text-align:right;"> 0.1913333 </td>
   <td style="text-align:right;"> 0.0221330 </td>
   <td style="text-align:right;"> 0.3605337 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> control - bipolar </td>
   <td style="text-align:right;"> 0.2586667 </td>
   <td style="text-align:right;"> 0.0894663 </td>
   <td style="text-align:right;"> 0.4278670 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> schizo - bipolar </td>
   <td style="text-align:right;"> 0.0673333 </td>
   <td style="text-align:right;"> -0.1018670 </td>
   <td style="text-align:right;"> 0.2365337 </td>
  </tr>
</tbody>
</table>

---
# Tukey's Honestly Significant Difference
<table class="table table-striped" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> contrast </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> conf.low </th>
   <th style="text-align:right;"> conf.high </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> control - schizo </td>
   <td style="text-align:right;"> 0.1913333 </td>
   <td style="text-align:right;"> 0.0264873 </td>
   <td style="text-align:right;"> 0.3561793 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> control - bipolar </td>
   <td style="text-align:right;"> 0.2586667 </td>
   <td style="text-align:right;"> 0.0938207 </td>
   <td style="text-align:right;"> 0.4235127 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> schizo - bipolar </td>
   <td style="text-align:right;"> 0.0673333 </td>
   <td style="text-align:right;"> -0.0975127 </td>
   <td style="text-align:right;"> 0.2321793 </td>
  </tr>
</tbody>
</table>

---
# Visualizing Comparisons (Tukey)
<img src="anova_1_files/figure-html/tukey-viz-1.png" style="display: block; margin: auto;" />

---
# Dunnett's Comparison to Controls
<table class="table table-striped" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> contrast </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> conf.low </th>
   <th style="text-align:right;"> conf.high </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> schizo - control </td>
   <td style="text-align:right;"> -0.1913333 </td>
   <td style="text-align:right;"> -0.3474491 </td>
   <td style="text-align:right;"> -0.0352176 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> bipolar - control </td>
   <td style="text-align:right;"> -0.2586667 </td>
   <td style="text-align:right;"> -0.4147824 </td>
   <td style="text-align:right;"> -0.1025509 </td>
  </tr>
</tbody>
</table>

---

# So, Many Levels of a category

- At the end of the dat, it's just another linear model  
  
- We can understand a lot about groups, though

- We can also query the model to compare groups

- To do more, we need an inferential framework