Linear Regression

Believe it or not, despite all of the complexity under the hood, fitting a linear model in R with least squares is quite simple with a straightfoward workflow.

Load the data
Visualize the data - just to detect problems and perform a cursory test of assumptions!
Fit the model.
Use the fit model to test assumptions
Evaluate the model
Visualize the fit model

Let’s go through each step with an example of seals. Are older seals larger?

0. Load and visualize the data

OK, this should be second nature by now. Load it. Make a plot of just the data.

library(dplyr)
library(ggplot2)
theme_set(theme_bw(base_size = 16))

seals <- read.csv("lab/data/17e8ShrinkingSeals Trites 1996.csv")

seal_base <- ggplot(seals, aes(x=age.days, y=length.cm)) +
  geom_point() 

seal_base

Neat data set, no?

Now, looking at this from the get-go, we can see it’s likely nonlinear. Maybe even non-normal! Let’s ignore that for now, as it will make the results great fodder for our diagnostics!

1. Fitting a model

OK, so, you have the data. And in your model, you want to see how age is a predictor of length. How do you fit it?

Nothing could be simpler. R has a function called lm() which stands for linear model. It works just like base plot, with the y ~ x syntax. And we create a fit model object.

seal_lm <- lm(length.cm ~ age.days, data=seals)

That’s it.

Now, if we want to just peak at the fit, before going forward, we can use coef() which is pretty standard across all R fit model functions.

coef(seal_lm)

 (Intercept)     age.days 
1.157668e+02 2.370559e-03

But that’s getting ahead of ourselves…

2. Evaluating Assumptions

There are a number of base R ways to test assumptions. I’ll show some to reference them, but, really, the easystats::performance library is where it is at to check them all. I’ll go over pieces one by one, but, the check_model() function is your one stop shop. Let’s look at it briefly here before digging into details.

library(performance)
check_model(seal_lm)

There’s a corresponding base method, but I like this one better

par(mfrow = c(2,2))
plot(seal_lm)

par(mfrow=c(1,1))

2a. Does our model match our data?

Yes, we can simulate by hand (maybe on the homework)! but, the check_predictions() function will run simulations for us to look at. Let’s do it with 100 iterations.

check_predictions(seal_lm,
                  iterations = 100)

Nice! The model and data match very well.

2b. Linearity

There are a lot of ways we might want to look at the fit and residual. Looking at the raw fitted versus residual value is a great check for linearity. There should be no pattern to the result. We do get this in the main check_model() - but right now there’s not a simple function to get this in performance alone. So, we can do it ourselves with the excellent modelr package!

library(modelr)

seals <- seals |>
  add_predictions(seal_lm) |>
  add_residuals(seal_lm)

This is useful, as we can do a LOT of things on our own with predictions and residuals. For the moment, lets plot them with a spline to aid in visualization!

ggplot(seals,
       mapping = aes(x = pred, y = resid)) +
  geom_point() +
  stat_smooth()

Or, you can do this with a part of check_model()

check_model(seal_lm, check = "ncv") |> plot()

Note anything odd? It is a little football shaped. This could be a nonlinearity, or not quite fitting into a normal distribution with a constant variance. Let’s explore those!

2c. Normality

Time for my favorite set of plots - looking at the density of the residuals and the qq plot. These are both produced by the same function in performance - check_normality() and only differ in how we call the plot method.

check_normality(seal_lm) |> plot()

Well that looks PRETTY good. A but flat at the top, but overall, wow. Let’s check out the qq plot

check_normality(seal_lm) |> plot(type = "qq")

Again, no real problems.

We might suspect heteroskedasticity, and we can dig deeper into that by looking at the fitted v. sart(abs(std. residuals)) relationship. In essence, for a standardized residual (q) of point i

\(q_i = r_i/s^2\)

This scales the residuals by the std dev, so that they should be closer to a standard normal, and therefore easier to see deviations. Square-rooting also reduces the effects of large swings in residuals.

check_heteroscedasticity(seal_lm) |>
  plot()

This - looks pretty good. If we are really worried about our fit-residual plot, it might therefore be more worth investigating nonlinearity. Still, other things might be afoot, such as

2d. Outliers

We can first look at Cook’s D for each individual data point and see if any of those should be flagged. The Cook’s D values

plot(seal_lm, which=4)

All quite small!

check_outliers(seal_lm) |> plot(type = "bar")

These are standardized to 1, but, if anything was bad, performance would flag it. Same pattern of no real pattern, though.

If we want to combine it with where a variable is located to look at the leverage plot

check_outliers(seal_lm) |> plot()

Eh, just fine!

3. What does our model mean?

OK, ok, everything looks fine. Now, what are the implications of the model?

We can get a lot of information from summary() - just like looking at any R object.

summary(seal_lm)


Call:
lm(formula = length.cm ~ age.days, data = seals)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.8700  -3.8279   0.0304   3.7541  21.6874 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.158e+02  1.764e-01  656.37   <2e-16 ***
age.days    2.371e-03  4.467e-05   53.06   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.681 on 9663 degrees of freedom
Multiple R-squared:  0.2256,    Adjusted R-squared:  0.2256 
F-statistic:  2816 on 1 and 9663 DF,  p-value: < 2.2e-16

This is a lot of information to drink in - function call, distribution of residuals, coefficient estimates and error, and multiple pieces of information about total fit.

We may want to get this information in a more condensed form for use in other contexts - particularly to compare against other models. For that, there’s a wonderful packages called broom that sweeps up your model into easy digestable pieces.

First, the coefficient table - let’s make it pretty.

library(broom)

tidy(seal_lm)

# A tibble: 2 × 5
  term         estimate std.error statistic p.value
  <chr>           <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept) 116.      0.176         656.        0
2 age.days      0.00237 0.0000447      53.1       0

Nice. Very tidy, and easy to use. We can also lop off those last two columns if we do not want them.

If we want to get information about fit, there’s glance()

glance(seal_lm)

# A tibble: 1 × 12
  r.squared adj.r.sq…¹ sigma stati…² p.value    df  logLik    AIC    BIC devia…³
      <dbl>      <dbl> <dbl>   <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>   <dbl>
1     0.226      0.226  5.68   2816.       0     1 -30502. 61009. 61031. 311807.
# … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
#   variable names ¹adj.r.squared, ²statistic, ³deviance

That’s…. a lot. But, it’s all sorted well.

Of course, performance can give us some of this as well, but not in a tidy format.

model_performance(seal_lm)

# Indices of model performance

AIC       |       BIC |    R2 | R2 (adj.) |  RMSE | Sigma
---------------------------------------------------------
61009.052 | 61030.581 | 0.226 |     0.226 | 5.680 | 5.681

This goes well in Markdown. But, you can use lots of things to put a table into markdown, so, meh.

4. Visualization

Lovely! Now, how do we visualize the fit and fit prediction error?

In ggplot2 we can use the smoother, stat_smooth in conjunction with method = "lm" to get the job done.

seal_fit_plot <- ggplot(data=seals) +
  aes(x=age.days, y=length.cm) +
  geom_point() +
  stat_smooth(method="lm")

seal_fit_plot

Note - you cannot see the fit interval because our SE is so small with such a large N.

What about prediction intervals? There are a few different ways to do this. We can kludge something together with predict() from our basic stats package in R. We can use performance::get_predictions. Or we can use broom::augment which gives a wealth of information. I really like broom::augment() because it also works to give a lot of raw info on the fit model itself.

augment(seal_lm) |> head()

# A tibble: 6 × 8
  length.cm age.days .fitted .resid     .hat .sigma     .cooksd .std.resid
      <int>    <int>   <dbl>  <dbl>    <dbl>  <dbl>       <dbl>      <dbl>
1       131     5337    128.  2.58  0.000263   5.68 0.0000272       0.455 
2       123     2081    121.  2.30  0.000272   5.68 0.0000223       0.405 
3       122     2085    121.  1.29  0.000271   5.68 0.00000700      0.227 
4       136     4299    126. 10.0   0.000123   5.68 0.000193        1.77  
5       122     2861    123. -0.549 0.000150   5.68 0.000000701    -0.0966
6       131     5052    128.  3.26  0.000212   5.68 0.0000348       0.573

And it’s nice and tidy!

For new data, though, we need to start by making a new data frame, and go from there. You can make it yourself:

pred_frame <- data.frame(
  age.days = min(seals$age.days):max(seals$age.days))

Or use modelr::data_grid() do it for you based on the properties of the data. This function can be REALLY convenient when it comes to generating predictions over lots of different predictors - we’ll see those in two weeks!

pred_frame <- seals |>
  data_grid(age.days = seq_range(age.days, 100))

This is nice as you can have a more tidy workflow. Now, let’s see broom::augment() in action

seal_pred <- pred_frame |>
  augment(x = seal_lm, 
          data = .,
          interval = "prediction")

head(seal_pred)

# A tibble: 6 × 4
  age.days .fitted .lower .upper
     <dbl>   <dbl>  <dbl>  <dbl>
1     958     118.   107.   129.
2    1033.    118.   107.   129.
3    1107.    118.   107.   130.
4    1182.    119.   107.   130.
5    1257.    119.   108.   130.
6    1331.    119.   108.   130.

Nice! We have upper and lower CIs for prediction!

We can then add this to our plot using the geom_ribbon which takes a ymin and ymax argument to generate a ribbon - like the fit standard error ribbon.

seal_fit_plot +
  geom_ribbon(data=seal_pred, 
              mapping=aes(y = .fitted, 
                          ymin=.lower, 
                          ymax=.upper),
              fill="grey", alpha=0.4) +
  theme_bw()

Now we can better see the prediction 95% interval - and that we do have some points that fall outside of it. Note, we can pass an argument conf.level to see things other than the 95%

5. Faded Examples.

A Fat Model

Fist, the relationship between how lean you are and how quickly you lose fat. Implement this to get a sense ot the general workflow for analysis

library(ggplot2)
fat <- read.csv("lab/data/17q04BodyFatHeatLoss Sloan and Keatinge 1973 replica.csv")

# initial visualization to determine if lm is appropriate
fat_plot <- ggplot(data=fat, aes(x=leanness, y=lossrate)) + 
  geom_point()
fat_plot

# fit the model
fat_mod <- lm(lossrate ~ leanness, data=fat)

# assumptions
check_model(fat_mod)

# look at parameters
tidy(fat_mod)

# plot with line and prediction CI
fat_plot + 
  stat_smooth(method = lm, formula = y ~ x)

An Itchy Followup

For your first faded example, let’s look at the relationship between DEET and mosquito bites.

deet <- read.csv("lab/data/17q24DEETMosquiteBites.csv")

# initial visualization to determine if lm is appropriate
deet_plot <- ggplot(data=___, 
                    mapping = aes(x=dose, y=bites)) + 
  geom_point()

deet_plot

# fit the model
deet_mod <- lm(___ ~ ___, data=deet)

# assumptions
check_model(___)

#look at parameters
tidy(___)

#plot with line
deet_plot + 
  stat_smooth(method = lm, formula = y~x)

6. Transformation for Nonlinearity

One of the most common reasons for a linear model to not fit is a nonlinearity in the data generating process. Often in nature we encounter exponential processes with a log-normal error structure. This is common in count data. Now, really, it’s often a poisson distributed variable, but, to a first approximation, log-transformation can often help fix what ails your model and give you reasonable estimates for our tests. We’ll talk later in the course about why this isn’t the best idea, and why you should start with a nonlinear/non-normal model to begin with.

Let’s practice the workflow of how we handle this log transformation.

Was that relationship linear?

We might suspect that the relationship was nonlinear. Let’s see how a simple log transform works here. Note the modifications to model fitting and stat_smooth.

# fit the model
deet_mod_log <- lm(log(bites) ~ dose, data = deet)

# assumptions
check_model(___)

# look at parameters
tidy(___)

# plot with line
deet_plot + 
  scale_y_continuous(trans="log") +
  stat_smooth(method=lm, formula=y~x)

You might want to plot the nonlinear curve on the original scale. This doesn’t work great in ggplot2 for transforms of the y variable. But, we can have broom::augment() come to our rescue again and do the work of back-transforming the fitted values and CIs.

# make a data frame for the curve, then add y values and CI
log_curve <- data_grid(deet,
          dose = seq_range(dose, n = 100)) |>
  augment(x = deet_mod_log, 
          data = ., 
          interval = "confidence") |>
  mutate(across(c(.fitted:.upper), exp))

# now plot
deet_plot +
  geom_line(data = log_curve,
            mapping = aes(y = .fitted)) +
  geom_ribbon(data = log_curve,
              mapping = aes(y = .fitted,
                  ymin = .lower,
                  ymax = .upper),
              alpha = 0.5)

As I mentioned in class, this doesn’t always work well if we have 0 values. We might want the inverse hyperbolic sine - see Bellemare and Wichman. Remember

\(asinh(x) = log(x + \sqrt{x^2 + 1}\)

Which, hey, if x = 0, asinh(0) = 0. Further, the coefficients have approximately the same meaning - \(exp(\beta_1)-1 \approx\) percent change in y for 1 unit change in x (this is called a semi-elasticity). Let’s look at the difference between the two transformations.

tidy(deet_mod_log)

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    1.48     0.147      10.1  1.22e-13
2 dose          -0.184    0.0402     -4.57 3.25e- 5

lm(asinh(bites) ~ dose, data = deet) |>
  tidy()

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    2.15     0.129      16.7  4.47e-22
2 dose          -0.162    0.0353     -4.58 3.14e- 5

Close-ish, no? Take a look at check_model() and let me know what you see.

Long-Lived species and Home Ranges

Do species with larger home ranges have higher mortality rates in zoos? Let’s test this!

zoo <- read.csv("lab/data/17q02ZooMortality Clubb and Mason 2003 replica.csv")

# initial visualization to determine if lm is appropriate
zoo_plot <- ggplot(data=___, aes(x=mortality, y=homerange)) + 
  ___()

___

# fit the model
zoo_mod <- lm(___, data=___)

#assumptions
_____(_______)

#look at of parameters
___(___)

#plot with line - is there a transform you want?
zoo_plot + 
  stat_smooth(method=___, formula=___)

Nonlinear home-ranges

That definitely wasn’t linear. Look at that outlier! Let’s log our y and see how things change.

#fit the model
zoo_mod_log <- lm(___(___) ~ ___, ___=___)

# assumptions
_____(_______)

# look at of parameters
___(___)

#plot with line
zoo_plot + 
  scale_y_continuous(trans="___")+
  ___(method=___, formula=___)

Fantastic. As a final exercise, redo the above, but with the asinh() transformation (whose inverse is sinh()) and then plot the fitted curve and CI on the ORIGINAL scale.