Believe it or not, despite all of the complexity under the hood, fitting a linear model in R with least squares is quite simple with a straightfoward workflow.

  1. Load the data
  2. Visualize the data - just to detect problems and perform a cursory test of assumptions!
  3. Fit the model.
  4. Use the fit model to test assumptions
  5. Evaluate the model
  6. Visualize the fit model

Let’s go through each step with an example of seals. Are older seals larger?

0. Load and visualize the data

Are Older Seals Bigger?

library(dplyr)
library(ggplot2)
theme_set(theme_bw(base_size = 16))

seals <- read.csv("./data/17e8ShrinkingSeals Trites 1996.csv")

seal_base <- ggplot(seals, aes(x=age.days, y=length.cm)) +
  geom_point() +
  theme_grey(base_size=14)

seal_base

Neat data set, no?

Now, looking at this from the get-go, we can see it’s likely nonlinear. Maybe even non-normal! Let’s ignore that for now, as it will make the results great fodder for our diagnostics!

1. Fitting a model

OK, so, you have the data. And in your model, you want to see how age is a predictor of length. How do you fit it?

Nothing could be simpler. R has a function called lm() which stands for linear model. It works just like base plot, with the y ~ x syntax. And we create a fit model object.

seal_lm <- lm(length.cm ~ age.days, data=seals)

That’s it.

Now, if we want to just peak at the fit, before going forward, we can use coef() which is pretty standard across all R fit model functions.

coef(seal_lm)
 (Intercept)     age.days 
1.157668e+02 2.370559e-03 

But that’s getting ahead of ourselves…

2. Evaluating Assumptions

First, simulating! R provides a method for pulling the fit from the model and simulating different residuals. Now, granted, we really want to also incorporate error in the coefficients - which you can do - but…. gimme time to write a package that does that (haven’t found one yet - see here). But, to simulate from our model with residual error

seal_sims <- simulate(seal_lm, nsim = 100)

So, what’s in this object?

dim(seal_sims)
[1] 9665  100

Ah, so, 100 sims as columns. But to plot, we want them as rows. Oh wait - a job for pivot!

library(tidyr)

seal_sims <- seal_sims %>%
  pivot_longer(
    cols = everything(),
    names_to = "sim",
    values_to = "length.cm"
  )

We can now use this to plot the distribution of length in our data versus our model.

ggplot() +
  geom_density(data = seal_sims,
               aes(x = length.cm, group = sim),
               lwd = 0.02) +
  geom_density(data = seals, 
               aes(x = length.cm),
               color = "blue", lwd = 2)

NOICE!

R provides a 1-stop shop for evaluating all other model assumptions Fit model objects can typically be plotted. Now, it uses base plot, so, we’ll use the par function to setup a 2x2 plotting area.

par(mfrow = c(2,2)) #2 rows, 2 columns

plot(seal_lm)

par(mfrow=c(1,1)) #reset back to 1 panel

Whoah - that’s a lot! And, there’s no figure with Cook’s D or a histogram of residuals.

OK, breathe.

plot.lm() actually generates even more plots than shown here. You can specify what plot you want with the which argument, but will need to look at ?plot.lm to know just what to look at.

I have five plots I really like to look at - four of which plot.lm() will generate. Those four are the fitted versus residuals:

plot(seal_lm, which=1)

Note the curvature of the line? Troubling, or a high n?

A QQ plot of the residuals

plot(seal_lm, which=2)

Not bad!

The Cook’s D values

plot(seal_lm, which=4)

All quite small!

And last, leverage

plot(seal_lm, which=5)

I also like to look at the histogram of residuals.There is a function called residuals that will work on nearly any fit model object in R. So we can just…

hist(residuals(seal_lm))

Note, there’s also a library called modelr which can add the appropriate residuals to your data frame using a dplyr-like syntax.

library(modelr)

seals <- seals %>%
  add_residuals(seal_lm)

head(seals)
  age.days length.cm      resid
1     5337       131  2.5815746
2     2081       123  2.3001156
3     2085       122  1.2906334
4     4299       136 10.0422152
5     2861       122 -0.5489206
6     5052       131  3.2571840

Check out that new column. You can now plot your predictor versus residuals, which should show no trend, which you can use a spline with stat_smooth to see.

qplot(age.days, resid, data=seals) +
  stat_smooth()

And you can also add fitted values and look at fitted versus residual.

seals <- seals %>%
  add_predictions(seal_lm)

qplot(pred, resid, data=seals) +
  stat_smooth(method="lm")

Finally, if you didn’t like this without ggplot2, the package {ggfortify} provides an autoplot method to make these plots. Using the same which argument.

library(ggfortify)
autoplot(seal_lm, ncol = 1, which = 1)

3. Putting it to the test

OK, ok, everything looks fine. Now, how do we test our model.

First, F-tests! R has a base method called anova which - well, it’s for looking at analysis of partitioning variance, but really will take on a wide variety of forms as we go forward. For now, it will produce F tables for us

anova(seal_lm)
Analysis of Variance Table

Response: length.cm
            Df Sum Sq Mean Sq F value    Pr(>F)    
age.days     1  90862   90862  2815.8 < 2.2e-16 ***
Residuals 9663 311807      32                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Boom! P values! And they are low. Simple, no?

For more information - t tests, R2, and more, we can use summary() - again, a function that is a go-to for nearly any fit model.

summary(seal_lm)

Call:
lm(formula = length.cm ~ age.days, data = seals)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.8700  -3.8279   0.0304   3.7541  21.6874 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.158e+02  1.764e-01  656.37   <2e-16 ***
age.days    2.371e-03  4.467e-05   53.06   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.681 on 9663 degrees of freedom
Multiple R-squared:  0.2256,    Adjusted R-squared:  0.2256 
F-statistic:  2816 on 1 and 9663 DF,  p-value: < 2.2e-16

This is a lot of information to drink in - function call, distribution of residuals, coefficient t-tests, and multiple pieces of information about total fit.

We may want to get this information in a more condensed form for use in other contexts - particularly to compare against other models. For that, there’s a wonderful packages called broom that sweeps up your model into easy digestable pieces.

First, the coefficient table - let’s make it pretty.

library(broom)

tidy(seal_lm)
# A tibble: 2 x 5
  term         estimate std.error statistic p.value
  <chr>           <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept) 116.      0.176         656.        0
2 age.days      0.00237 0.0000447      53.1       0

Nice.

We can also do this for the F-test.

tidy(anova(seal_lm))
# A tibble: 2 x 6
  term         df   sumsq  meansq statistic p.value
  <chr>     <int>   <dbl>   <dbl>     <dbl>   <dbl>
1 age.days      1  90862. 90862.      2816.       0
2 Residuals  9663 311807.    32.3       NA       NA

If we want to get information about fit, there’s glance()

glance(seal_lm)
# A tibble: 1 x 12
  r.squared adj.r.squared sigma statistic p.value    df  logLik    AIC    BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>
1     0.226         0.226  5.68     2816.       0     1 -30502. 61009. 61031.
# … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

4. Visualization

Lovely! Now, how do we visualize the fit and fit prediction error?

In ggplot2 we can use the smoother, stat_smooth in conjunction with method = "lm" to get the job done.

seal_fit_plot <- ggplot(data=seals) +
  aes(x=age.days, y=length.cm) +
  geom_point() +
  stat_smooth(method="lm")

seal_fit_plot

Note - you cannot see the fit interval because our SE is so small with such a large N.

What about prediction intervals? That’s a bit harder, and we need to dig into the specific prediction function for lms, predict(). This function is very useful as you can use it to generate new predicted values, and it will also give you error as well. Now, modelr::add_prediction will give you values, but no error yet. YET.

So how does predict work? We begin with a new data frame that has columns matching our predictors in the model. Then add a new column. As the data frame is not the first argument, we can’t use pipes here. Booo.

We then cbind the predictors and predicted values - along with the interval we’ve created. We’ll also do some renaming, simply to make it easier to use this new data in our ggplot coming up. Note, there are also arguments to get intervals other than the 95%.

predFrame <- data.frame(age.days = min(seals$age.days):max(seals$age.days))
predVals <- predict(seal_lm, newdata=predFrame, interval="prediction")

predFrame <- cbind(predFrame, predVals) %>%
  rename(length.cm = fit)

head(predFrame)
  age.days length.cm      lwr      upr
1      958  118.0377 106.8996 129.1759
2      959  118.0401 106.9019 129.1783
3      960  118.0425 106.9043 129.1807
4      961  118.0449 106.9067 129.1830
5      962  118.0472 106.9090 129.1854
6      963  118.0496 106.9114 129.1878

We can then add this to our plot using the geom_ribbon which takes a ymin and ymax argument to generate a ribbon - like the fit standard error ribbon.

seal_fit_plot +
  stat_smooth(method="lm", color="blue") +
  geom_ribbon(data=predFrame, mapping=aes(x=age.days, ymin=lwr, ymax=upr),
              fill="grey", alpha=0.6) +
  theme_bw()

Now we can better see the prediction 95% interval - and that we do have some points that fall outside of it.

5. Faded Examples.

A Fat Model

Fist, the relationship between how lean you are and how quickly you lose fat. Implement this to get a sense ot the general workflow for analysis

library(ggplot2)
fat <- read.csv("./data/17q04BodyFatHeatLoss Sloan and Keatinge 1973 replica.csv")

#initial visualization to determine if lm is appropriate
fat_plot <- ggplot(data=fat, aes(x=leanness, y=lossrate)) + 
  geom_point()
fat_plot

fat_mod <- lm(lossrate ~ leanness, data=fat)

#assumptions
simulate(fat_mod, nsim = 100) %>%
  pivot_longer(cols = everything(), 
               names_to = "sim", values_to = "lossrate") %>%
  ggplot(aes(x = lossrate)) +
  geom_density(aes(group = sim), lwd = 0.2) +
  geom_density(data = fat, color = "blue", lwd = 2)

plot(fat_mod, which=1)
plot(fat_mod, which=2)

#f-tests of model
anova(fat_mod)

#t-tests of parameters
summary(fat_mod)

#plot with line
fat_plot + 
  stat_smooth(method=lm, formula=y~x)

An Itchy Followup

For your first faded example, let’s look at the relationship between DEET and mosquito bites.

deet <- read.csv("./data/17q24DEETMosquiteBites.csv")

deet_plot <- ggplot(data=___, aes(x=dose, y=bites)) + 
  geom_point()

deet_plot

deet_mod <- lm(bites ~ dose, data=deet)

#assumptions
simulate(___, nsim = 100) %>%
  pivot_longer(cols = everything(), 
               names_to = "sim", values_to = "___") %>%
  ggplot(aes(x = ___)) +
  geom_density(aes(group = sim), lwd = 0.2) +
  geom_density(data = ___, color = "blue", lwd = 2)


plot(___, which=1)
plot(___, which=2)

#f-tests of model
anova(___)

#t-tests of parameters
summary(___)

#plot with line
deet_plot + 
  stat_smooth(method=lm, formula=y~x)

6. Log-Transformation for Nonlinearity

One of the most common reasons for a linear model to not fit is a nonlinearity in the data generating process. Often in nature we encounter exponential processes with a log-normal error structure. This is common in count data. Now, really, it’s often a poisson distribted variable, but, to a first approximation, log-transformation can often help fix what ails your model and give you reasonable estimates for our tests. We’ll talk later in the course about why this isn’t the best idea, and why you should start with a nonlinear/non-normal model to begin with.

Let’s practice the workflow of how we handle this log transformation.

Was that relationship linear?

We might suspect that the relationship was nonlinear. Let’s see how a simple log transform works here. Note the modifications to model fitting and stat_smooth.

deet_mod_log <- lm(log(bites) ~ dose, data=deet)

#assumptions
simulate(___, nsim = 100) %>%
  pivot_longer(cols = everything(), 
               names_to = "sim", values_to = "log_bites") %>%
  mutate(bites = exp(log_bites)) %>%
  ggplot(aes(x = ___)) +
  geom_density(aes(group = sim), lwd = 0.2) +
  geom_density(data = ___, color = "blue", lwd = 2)

plot(___, which=1)
plot(___, which=2)

#f-tests of model
anova(___)

#t-tests of parameters
summary(___)

#plot with line
deet_plot + 
  scale_y_continuous(trans="log") +
  stat_smooth(method=lm, formula=y~x)

Long-Lived species and Home Ranges

Do longer lived species also have larger home ranges? Let’s test this!

zoo <- read.csv("./data/17q02ZooMortality Clubb and Mason 2003 replica.csv")

zoo_plot <- ggplot(data=___, aes(x=mortality, y=homerange)) + 
  ___()

___

zoo_mod <- lm(___, data=___)

#assumptions
simulate(___, nsim = 100) %>%
  ___(cols = ___(), 
               names_to = "___", values_to = "___") %>%
  ggplot(aes(x = ___)) +
  ___(aes(group = ___), lwd = 0.2) +
  ___(data = ___, color = "blue", lwd = 2)

plot(___, which=1)
plot(___, which=2)

#f-tests of model
anova(___)

#t-tests of parameters
summary(___)

#plot with line
zoo_plot + 
  stat_smooth(method=___, formula=___)

Nonlinear home-ranges

That definitely wasn’t linear. Look at that outlier! Let’s log our y and see how things change.

zoo_mod_log <- lm(log(___) ~ ___, ___=___)

#assumptions
____(___, nsim = 100) %>%
  ___(cols = ___(), 
               names_to = "___", values_to = "___") %>%
  mutate(____ = ___(____))
  ggplot(aes(x = ___)) +
  ___(aes(____ = ___), lwd = 0.2) +
  ___(data = ___, color = "blue", lwd = 2)
  
___(___)
___(___)

#f-tests of model
___(___)

#t-tests of parameters
___(___)

#plot with line
zoo_plot + 
  scale_y_continuous(trans="___")+
  ___(method=___, formula=___)