class: center, middle # Linear Regression ![image](./images/11/correlation_xkcd.jpg) --- # The Steps of Statistical Modeling 1. What is your question? 2. What model of the world matches your question? 3. Is your model valid? 4. Query your model to answer your question. --- # Our question of the day: What is the relationship between inbreeding coefficient and litter size in wolves? .pull-left[ <img src="linear_regression_files/figure-html/wolf_scatterplot-1.png" style="display: block; margin: auto;" /> ] .pull-right[ <br><br> ![](./images/11/CUTE_WOLF_PUPS_by_horsesrock44.jpeg) ] --- # Roll that beautiful linear regression with 95% CI footage <img src="linear_regression_files/figure-html/fit-1.png" style="display: block; margin: auto;" /> --- # Regression to Be Mean 1. What is regression? 2. What do regression coefficients mean? 3. What do the error coefficients of a regression mean? 4. Correlation and Regression 5. Transformation and Model Structure for More Sensible Coefficients --- # What is a regression? .center[.Large[y = a + bx + error]] -- This is 90% of the modeling you will ever do because... -- Everything is a linear model! - multiple parameters (x1, x2, etc...) - nonlinear transformations of y or x - multiplicative terms (b * x1 * x2) are still additive - generalized linear models with non-normal error - and so much more.... --- class:center, middle # EVERYTHING IS A LINEAR MODEL --- # Linear Regression <br> `\(\Large y_i = \beta_0 + \beta_1 x_i + \epsilon_i\)` <br> `\(\Large \epsilon_i \sim^{i.i.d.} N(0, \sigma)\)` <Br><br> .large[ Then it’s code in the data, give the keyboard a punch Then cross-correlate and break for some lunch Correlate, tabulate, process and screen Program, printout, regress to the mean -White Coller Holler by Nigel Russell ] --- # Regressions You Have Seen .large[ Classic style: `$$y_i = \beta_0 + \beta_1 x_i + \epsilon_i$$` `$$\epsilon_i \sim N(0, \sigma)$$` ] -- ----- .large[ Prediction as Part of Error: `$$\hat{y_i} = \beta_0 + \beta_1 x_i$$` `$$y_i \sim N(\hat{y_i}, \sigma)$$` ] -- ----- .large[ Matrix Style: `$$Y = X \beta + \epsilon$$` ] --- # These All Are Equation-Forms of This Relationship <img src="linear_regression_files/figure-html/fit-1.png" style="display: block; margin: auto;" /> --- # Regression to Be Mean 1. What is regression? 2. .red[What do regression coefficients mean?] 3. What do the error coefficients of a regression mean? 4. Correlation and Regression 5. Transformation and Model Structure for More Sensible Coefficients --- # What are we doing with regression? ### Goals: -- 1. Association - What is the strength of a relationship between two quantities - Not causal -- 2. Prediction - If we have two groups that differ in their X value by 1 unit, what is the average difference in their Y unit? - Not causal -- 3. Counterfactual - What would happen to an individual if their value of X increased by one unit? - Causal reasoning! --- # What Can We Say About This? <img src="linear_regression_files/figure-html/fit-1.png" style="display: block; margin: auto;" /> --- # Model Coefficients: Slope <table class="table table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 6.567 </td> <td style="text-align:right;"> 0.791 </td> </tr> <tr> <td style="text-align:left;"> inbreeding.coefficient </td> <td style="text-align:right;"> -11.447 </td> <td style="text-align:right;"> 3.189 </td> </tr> </tbody> </table> -- 1. **Association:** A one unit increase in inbreeding coefficient is associated with ~11 fewer pups, on average. 2. **Prediction:** A new wolf with an inbreeding coefficient 1 unit greater than a second new wolf will have ~11 fewer pups, on average. 3. **Counterfactual:** If an individual wolf had had its inbreeding coefficient 1 unit higher, it would have ~11 fewer pups. --- # Which of these is the correct thing to say? When? 1. **Association:** A one unit increase in inbreeding coefficient is associated with ~11 fewer pups, on average. 2. **Prediction:** A new wolf with an inbreeding coefficient 1 unit greater than a second new wolf will have ~11 fewer pups, on average. 3. **Counterfactual:** If an individual wolf had had its inbreeding coefficient 1 unit higher, it would have ~11 fewer pups. --- # 11 Fewer Pups? What would be, then, a Better Way to Talk About this Slope? <img src="linear_regression_files/figure-html/fit-1.png" style="display: block; margin: auto;" /> --- # Model Coefficients: Intercept <table class="table table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 6.567 </td> <td style="text-align:right;"> 0.791 </td> </tr> <tr> <td style="text-align:left;"> inbreeding.coefficient </td> <td style="text-align:right;"> -11.447 </td> <td style="text-align:right;"> 3.189 </td> </tr> </tbody> </table> <br><br> -- When the inbreeding coefficient is 0, a wolves will have ~6.6 pups, on average. --- # Intercept Has Direct Interpretation on the Visualization <img src="linear_regression_files/figure-html/fit-1.png" style="display: block; margin: auto;" /> --- # Regression to Be Mean 1. What is regression? 2. What do regression coefficients mean? 3. .red[What do the error coefficients of a regression mean?] 4. Correlation and Regression 5. Transformation and Model Structure for More Sensible Coefficients --- # Two kinds of error 1. Fit error - error due to lack of precision in estimates - Coefficient SE - Precision of estimates 2. Residual error - error due to variability not explained by X. - Residual SD (from `\(\epsilon_i\)`) --- # Precision: coefficient SEs <table class="table table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 6.567 </td> <td style="text-align:right;"> 0.791 </td> </tr> <tr> <td style="text-align:left;"> inbreeding.coefficient </td> <td style="text-align:right;"> -11.447 </td> <td style="text-align:right;"> 3.189 </td> </tr> </tbody> </table> -- <br><br> - Shows precision of ability to estimate coefficients - Gets smaller with bigger sample size! - Remember, ~ 2 SE covered 95% CI - Comes from likelihood surface...but we'll get there --- # Visualizing Precision: 95% CI (~2 SE) <img src="linear_regression_files/figure-html/fit-1.png" style="display: block; margin: auto;" /> --- # Visualizing Precision with Simulation from your Model <img src="linear_regression_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" /> --- # Residual Error <table class="table table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> r.squared </th> <th style="text-align:right;"> sigma </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.369 </td> <td style="text-align:right;"> 1.523 </td> </tr> </tbody> </table> - Sigma is the SD of the residual `$$\Large \epsilon_i \sim N(0,\sigma)$$` - How much does does # of pups vary beyond the relationship with inbreeding coefficient? - For any number of pups estimated on average, ~68% of the # of pups observed will fall within ~1.5 of that number --- # Visualizing Residual Error's Implications <img src="linear_regression_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- # Residual Error -> Variance Explained <table class="table table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> r.squared </th> <th style="text-align:right;"> sigma </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.369 </td> <td style="text-align:right;"> 1.523 </td> </tr> </tbody> </table> - `\(\large{R^2 = 1 - \frac{\sigma^2_{residual}}{\sigma^2_y}}\)` - Fraction of the variation in Y related to X. - Here, 36.9% of the variation in pups is related to variation in Inbreeding Coefficient - Relates to r, the Pearson correlation coefficient --- # Regression to Be Mean 1. What is regression? 2. What do regression coefficients mean? 3. What do the error coefficients of a regression mean? 4. .red[Correlation and Regression] 5. Transformation and Model Structure for More Sensible Coefficients --- # What is Correlation? * The change in standard deviations of variable x per change in 1 SD of variable y * Clear, right? * Assesses the degree of association between two variables * But, unitless (sort of) * Between -1 and 1 --- # Calculating Correlation: Start with Covariance Describes the relationship between two variables. Not scaled. -- `\(\sigma_{xy}\)` = population level covariance `\(s_{xy}\)` = covariance in your sample -- .pull-left[ <br><br><br> `$$\sigma_{XY} = \frac{\sum (X-\bar{X})(y-\bar{Y})}{n-1}$$` ] -- .pull-right[ <img src="linear_regression_files/figure-html/rnormPlot_cov-1.png" style="display: block; margin: auto;" /> ] --- # Pearson Correlation Describes the relationship between two variables. Scaled between -1 and 1. <br> `\(\large \rho_{xy}\)` = population level correlation, `\(\large r_{xy}\)` = correlation in your sample <div id="left" class="fragment"> <br><br><br> `$$\Large\rho_{xy} = \frac{\sigma_{xy}}{\sigma_{x}\sigma_{y}}$$` </div> <div id="right" class="fragment"> <img src="linear_regression_files/figure-html/rnormPlot_cor-1.png" style="display: block; margin: auto;" /> </div> --- # Assumptions of Pearson Correlation .pull-left[ - Observations are from a **random sample** - Each observation is **independent** - X and Y are from a **Normal Distribution** - Weaker assumption ] .pull-right[ <img src="linear_regression_files/figure-html/mvnorm_persp-1.png" style="display: block; margin: auto;" /> ] --- # The meaning of r Y is perfectly predicted by X if r = -1 or 1. <br><br> `\(R^2\)` = the porportion of variation in y explained by x <img src="linear_regression_files/figure-html/corLevels-1.png" style="display: block; margin: auto;" /> --- # Get r in your bones... <br><br><br> <center>.large[.middle[http://guessthecorrelation.com/]]</center> --- # Example: Wolf Breeding and Litter Size .pull-left[ <img src="linear_regression_files/figure-html/wolf_scatterplot-1.png" style="display: block; margin: auto;" /> ] .pull-right[ <br><br> ![](./images/11/CUTE_WOLF_PUPS_by_horsesrock44.jpeg) ] --- # Example: Wolf Inbreeding and Litter Size Covariance Matrix: ``` inbreeding.coefficient pups inbreeding.coefficient 0.01 -0.11 pups -0.11 3.52 ``` -- Correlation Matrix: ``` inbreeding.coefficient pups inbreeding.coefficient 1.00 -0.61 pups -0.61 1.00 ``` -- Yes, you can estimate a SE (`cor.test()` or bootstrapping) --- # Wait, so, how does Correlation relate to Regression? Slope versus r... `\(\LARGE b=\frac{s_{xy}}{s_{x}^2}\)` `\(= \frac{cov(x,y)}{var(x)}\)` -- <br><br> `\(\LARGE = r_{xy}\frac{s_{y}}{s_{x}}\)` --- # Correlation v. Regression Coefficients <img src="linear_regression_files/figure-html/cor_and_reg-1.png" style="display: block; margin: auto;" /> --- # Or really, r is just the coefficient of a fit lm with a z-transform of our predictors `$$\Large z_i = \frac{x_i - \bar{x}}{\sigma_x}$$` .large[ - When we z-transform variables, we put them on *the same scale* - The covariance between two z-transformed variables is their correlation! ] --- # Correlation versus Standardized Regression: It's the Same Picture `$$z(y_i) = \beta_0 + \beta_1 z(x_i) + \epsilon_i$$` <table class="table table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> 0.166 </td> </tr> <tr> <td style="text-align:left;"> inbreeding_std </td> <td style="text-align:right;"> -0.608 </td> <td style="text-align:right;"> 0.169 </td> </tr> </tbody> </table> versus correlation: -0.608 --- class:center, middle # EVERYTHING IS A LINEAR MODEL --- # Regression to Be Mean 1. What is regression? 2. What do regression coefficients mean? 3. What do the error coefficients of a regression mean? 4. Correlation and Regression 5. .red[Transformation and Model Structure for More Sensible Coefficients] --- # Modifying (transformating) Your Regression: Centering you X - Many times X = 0 is silly - E.g., if you use year, are you going to regress back to 0? - Centering X allows you to evaluate a meaningful intercept - what is Y at the mean of X --- # Centering X to generate a meaningful intercept `$$x_{i \space centered} = x_i - mean(x)$$` <table class="table table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 3.958 </td> <td style="text-align:right;"> 0.311 </td> </tr> <tr> <td style="text-align:left;"> inbreeding.centered </td> <td style="text-align:right;"> -11.447 </td> <td style="text-align:right;"> 3.189 </td> </tr> </tbody> </table> Intercept implies wolves with the average level of inbreeding in this study have ~4 pups. Wolves with higher inbreeding have fewer pups, wolves with lower inbreeding have more. --- # Centering X to generate a meaningful intercept <img src="linear_regression_files/figure-html/centplot-1.png" style="display: block; margin: auto;" /> --- # Modifying (transformating) Your Regression: Log Transform of Y - Often, Y cannot be negative - And/or the process generating Y is *multiplicative* - Log(Y) can fix this and other sins. - **VERY** common, but, what do the coefficients mean? - `\(exp(\beta_1) - 1 \approx\)` percent change in Y for chance in 1 unit of X --- # Other Ways of Looking at This Relationship: Log Transformation of Y `$$log(y_i) = \beta_0 + \beta_1 x_i + \epsilon_i$$` - relationship is now curved - cannot have negative pups (yay!) <img src="linear_regression_files/figure-html/logplot-1.png" style="display: block; margin: auto;" /> --- # Model Coefficients: Log Slope <table class="table table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 1.944 </td> <td style="text-align:right;"> 0.215 </td> </tr> <tr> <td style="text-align:left;"> inbreeding.coefficient </td> <td style="text-align:right;"> -2.994 </td> <td style="text-align:right;"> 0.869 </td> </tr> </tbody> </table> -- To understand the coefficient, remember `$$y_i = e^{\beta_0 + \beta_1 x_i + \epsilon_i}$$` exp(-2.994)-1 = -0.95, so, a 1 unit increase in x causes y to lose 95% of its value, so... -- **Association:** A one unit increase in inbreeding coefficient is associated with having 95% fewer pups, on average. --- # You are now a Statistical Wizard. Be Careful. Your Model is a Golem. (sensu Richard McElreath) .center[.middle[![:scale 45%](images/09/golem.png)]]