class: center, middle # Comparing Many Levels of One Category with the Linear Model  --- class: center, middle # Etherpad <br><br> <center><h3>https://etherpad.wikimedia.org/p/607-anova-2020</h3></center> --- class: center, middle ## Is this what you think of when you hear categorical variables?  --- class: center, middle ### Wait, wait wait. What is ANOVA, what is this terminology? New words! AH! (or for some of you, old words - but - you must unlearn what you have learned) --- class: center, middle # THROW IT ALL OUT! -- ### We are talking about linear regression with categorical variables --- class: center, middle ## What Else Did You Expect?  --- # Categorical Variables with Many Levels <img src="anova_1_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" /> `$$mass_i = \beta_1 adelie_i + \beta_2 chinstrap_i + \beta_3 gentoo_i + \epsilon_i$$` --- class:center  --- # Many Means, One Category 1. What are models with categorical predictors? 2. Assumptions of models with categorical predictors 3. Evaluating fit models 4. Query: How different are groups? --- # Categorical Predictors: Gene Expression and Mental Disorders .pull-left[  ] .pull-right[  ] --- # The data <img src="anova_1_files/figure-html/boxplot-1.png" style="display: block; margin: auto;" /> --- # Traditional Way to Think About Categories <img src="anova_1_files/figure-html/meansplot-1.png" style="display: block; margin: auto;" /> What is the variance between groups v. within groups? --- # But What is the Underlying Model ? <img src="anova_1_files/figure-html/brainGene_points-1.png" style="display: block; margin: auto;" /> --- # But What is the Underlying Model? <img src="anova_1_files/figure-html/brainGene_points_fit-1.png" style="display: block; margin: auto;" /> -- Underlying linear model with control = intercept, dummy variable for bipolar --- # But What is the Underlying Model? <img src="anova_1_files/figure-html/brainGene_points_fit1-1.png" style="display: block; margin: auto;" /> Underlying linear model with control = intercept, dummy variable for bipolar --- # But What is the Underlying Model ? <img src="anova_1_files/figure-html/brainGene_points_fit_2-1.png" style="display: block; margin: auto;" /> Underlying linear model with control = intercept, dummy variable for schizo --- # But What is the Underlying Model? <img src="anova_1_files/figure-html/ctl_schizo-1.png" style="display: block; margin: auto;" /> Underlying linear model with control = intercept, dummy variable for schizo --- # Linear Dummy Variable (Fixed Effect) Model `$$\large y_{ij} = \beta_{0} + \sum \beta_{j}x_{ij} + \epsilon_{ij}, \qquad x_{i} = 0,1$$` `$$\epsilon_{ij} \sim N(0, \sigma^{2})$$` - i = replicate, j = group - `\(x_{ij}\)` inidicates presence/abscence (1/0) of level j for individual i - This coding is called a **Dummy variable** - Note similarities to a linear regression - One category set to `\(\beta_{0}\)` for ease of fitting, and other `\(\beta\)`s are different from it - Or `\(\beta_{0}\)` = 0 --- # A Simpler Way to Write: The Means Model `$$\large y_{ij} = \alpha_{j} + \epsilon_{ij}$$` `$$\epsilon_{ij} \sim N(0, \sigma^{2} )$$` - i = replicate, j = group - Different mean for each group - Focus is on specificity of a categorical predictor --- # Partioning Model to See What Varies `$$\large y_{ij} = \bar{y} + (\bar{y}_{j} - \bar{y}) + ({y}_{ij} - \bar{y}_{j})$$` - i = replicate, j = group - Shows partitioning of variation - Between group v. within group variation - Consider `\(\bar{y}\)` an intercept, deviations from intercept by treatment, and residuals - Can Calculate this with a fit model to answer questions - it's a relic of a bygone era - That bygone era has some good papers, so, you should recognize this --- # Let's Fit that Model **Using Least Squares** ```r brain_lm <- lm(PLP1.expression ~ group, data=brainGene) tidy(brain_lm) |> select(-c(4:5)) |> knitr::kable(digits = 3) |> kableExtra::kable_styling() ``` <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> -0.004 </td> <td style="text-align:right;"> 0.048 </td> </tr> <tr> <td style="text-align:left;"> groupschizo </td> <td style="text-align:right;"> -0.191 </td> <td style="text-align:right;"> 0.068 </td> </tr> <tr> <td style="text-align:left;"> groupbipolar </td> <td style="text-align:right;"> -0.259 </td> <td style="text-align:right;"> 0.068 </td> </tr> </tbody> </table> --- # Many Means, One Category 1. What are models with categorical predictors? 2. .red[Assumptions of models with categorical predictors] 3. Evaluating fit models 4. Query: How different are groups? --- # Assumptions of Linear Models with Categorical Variables - Same as Linear Regression! - Independence of data points - No relationship between fitted and residual values - Homoscedasticity (homogeneity of variance) of groups - This is just an extension of `\(\epsilon_i \sim N(0, \sigma)\)` where `\(\sigma\)` is constant across all groups - Normality within groups (of residuals) - No excess leverage, etc.... --- # Fitted v. Residuals for Linearity <img src="anova_1_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> Are **residual** variances consistent across groups? --- # Testing HOV with Standardized Residuals <img src="anova_1_files/figure-html/brainGene_levene-1.png" style="display: block; margin: auto;" /> We are all good! --- # Normality of Residuals <img src="anova_1_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- # Normality of Residuals: QQ <img src="anova_1_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- # What do I do if I Violate Assumptions? - Nonparametric Kruskal-Wallace (rank transform) - log(x+1), asinh(x), or otherwise transform - Model the variance or a GLM (two weeks!) - likely not a huge difference here --- # Many Means, One Category 1. What are models with categorical predictors? 2. Assumptions of models with categorical predictors 3. .red[Evaluating fit models] 4. Query: How different are groups? --- # R Fits with Treatment Contrasts `$$y_{ij} = \beta_{0} + \sum \beta_{j}x_{ij} + \epsilon_{ij}$$` <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> -0.004 </td> <td style="text-align:right;"> 0.048 </td> </tr> <tr> <td style="text-align:left;"> groupschizo </td> <td style="text-align:right;"> -0.191 </td> <td style="text-align:right;"> 0.068 </td> </tr> <tr> <td style="text-align:left;"> groupbipolar </td> <td style="text-align:right;"> -0.259 </td> <td style="text-align:right;"> 0.068 </td> </tr> </tbody> </table> -- What does this mean? -- - Intercept ($\beta_{0}$) = the average value associated with being in the control group - Others = the average difference between control and each other group - Note: Order is alphabetical --- # Actual Group Means `$$y_{ij} = \alpha_{j} + \epsilon_{ij}$$` <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> group </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> control </td> <td style="text-align:right;"> -0.0040000 </td> <td style="text-align:right;"> 0.0479786 </td> </tr> <tr> <td style="text-align:left;"> schizo </td> <td style="text-align:right;"> -0.1953333 </td> <td style="text-align:right;"> 0.0479786 </td> </tr> <tr> <td style="text-align:left;"> bipolar </td> <td style="text-align:right;"> -0.2626667 </td> <td style="text-align:right;"> 0.0479786 </td> </tr> </tbody> </table> -- What does this mean? -- Being in group j is associated with an average outcome of y. --- # What's the best way to see this? .center[ ] --- # Many Ways to Visualize <img src="anova_1_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- # Many Ways to Visualize <img src="anova_1_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- # Many Ways to Visualize <img src="anova_1_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- # How Well Do Groups Explain Variation in Response Data? We can look at fit to data - even in categorical data! ``` # R2 for Linear Regression R2: 0.271 adj. R2: 0.237 ``` -- But, remember, this is based on the sample at hand. -- Adjusted R<sup>2</sup>: adjusts for sample size and model complexity (k = # params = # groups) `$$R^2_{adj} = 1 - \frac{(1-R^2)(n-1)}{n-k-1}$$` --- # Many Means, One Category 1. What are models with categorical predictors? 2. Assumptions of models with categorical predictors 3. Evaluating fit models 4. .red[Query: How different are groups?] --- # Which groups are different from each other? <img src="anova_1_files/figure-html/meansplot-1.png" style="display: block; margin: auto;" /> -- Many mini-linear models with two means....multiple comparisons! --- # Post-Hoc Means Comparisons: Which groups are different from one another? - Each group has a mean and SE - We can calculate a comparison for each - BUT, we lose precision as we keep resampling the model - Remember, for every time we look at a system, we have some % of our CI not overlapping the true value - Each time we compare means, we have a chance of our CI not covering the true value - To minimize this possibility, we correct (widen) our CIs for this **Family-Wise Error Rate** --- # Solutions to Multiple Comparisons and Family-wise Error Rate? 1. Ignore it - + Just a bunch of independent linear models -- 2. Increase your CI given m = # of comparisons + If 1 - CI of interest = `\(\alpha\)` + Bonferroni Correction `\(\alpha/ = \alpha/m\)` + False Discovery Rate `\(\alpha/ = k\alpha/m\)` where k is rank of test -- 3. Other multiple comparison corrections + Tukey's Honestly Significant Difference + Dunnett's Compare to Control --- # No Correction: Least Square Differences <table class="table table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> contrast </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> conf.low </th> <th style="text-align:right;"> conf.high </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> control - schizo </td> <td style="text-align:right;"> 0.1913333 </td> <td style="text-align:right;"> 0.0544024 </td> <td style="text-align:right;"> 0.3282642 </td> </tr> <tr> <td style="text-align:left;"> control - bipolar </td> <td style="text-align:right;"> 0.2586667 </td> <td style="text-align:right;"> 0.1217358 </td> <td style="text-align:right;"> 0.3955976 </td> </tr> <tr> <td style="text-align:left;"> schizo - bipolar </td> <td style="text-align:right;"> 0.0673333 </td> <td style="text-align:right;"> -0.0695976 </td> <td style="text-align:right;"> 0.2042642 </td> </tr> </tbody> </table> --- # Bonferroni Corrections <table class="table table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> contrast </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> conf.low </th> <th style="text-align:right;"> conf.high </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> control - schizo </td> <td style="text-align:right;"> 0.1913333 </td> <td style="text-align:right;"> 0.0221330 </td> <td style="text-align:right;"> 0.3605337 </td> </tr> <tr> <td style="text-align:left;"> control - bipolar </td> <td style="text-align:right;"> 0.2586667 </td> <td style="text-align:right;"> 0.0894663 </td> <td style="text-align:right;"> 0.4278670 </td> </tr> <tr> <td style="text-align:left;"> schizo - bipolar </td> <td style="text-align:right;"> 0.0673333 </td> <td style="text-align:right;"> -0.1018670 </td> <td style="text-align:right;"> 0.2365337 </td> </tr> </tbody> </table> --- # Tukey's Honestly Significant Difference <table class="table table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> contrast </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> conf.low </th> <th style="text-align:right;"> conf.high </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> control - schizo </td> <td style="text-align:right;"> 0.1913333 </td> <td style="text-align:right;"> 0.0264873 </td> <td style="text-align:right;"> 0.3561793 </td> </tr> <tr> <td style="text-align:left;"> control - bipolar </td> <td style="text-align:right;"> 0.2586667 </td> <td style="text-align:right;"> 0.0938207 </td> <td style="text-align:right;"> 0.4235127 </td> </tr> <tr> <td style="text-align:left;"> schizo - bipolar </td> <td style="text-align:right;"> 0.0673333 </td> <td style="text-align:right;"> -0.0975127 </td> <td style="text-align:right;"> 0.2321793 </td> </tr> </tbody> </table> --- # Visualizing Comparisons (Tukey) <img src="anova_1_files/figure-html/tukey-viz-1.png" style="display: block; margin: auto;" /> --- # Dunnett's Comparison to Controls <table class="table table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> contrast </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> conf.low </th> <th style="text-align:right;"> conf.high </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> schizo - control </td> <td style="text-align:right;"> -0.1913333 </td> <td style="text-align:right;"> -0.3474491 </td> <td style="text-align:right;"> -0.0352176 </td> </tr> <tr> <td style="text-align:left;"> bipolar - control </td> <td style="text-align:right;"> -0.2586667 </td> <td style="text-align:right;"> -0.4147824 </td> <td style="text-align:right;"> -0.1025509 </td> </tr> </tbody> </table> <img src="anova_1_files/figure-html/dunnett-1.png" style="display: block; margin: auto;" /> --- # So, Many Levels of a category - At the end of the dat, it's just another linear model - We can understand a lot about groups, though - We can also query the model to compare groups - To do more, we need an inferential framework