Believe it or not, despite all of the complexity under the hood, fitting a linear model in R with least squares is quite simple with a straightfoward workflow.

  1. Load the data
  2. Visualize the data - just to detect problems and perform a cursory test of assumptions!
  3. Fit the model.
  4. Use the fit model to test assumptions
  5. Evaluate the model
  6. Visualize the fit model

Let’s go through each step with an example of seals. Are older seals larger?

0. Load and visualize the data

Are Older Seals Bigger?

## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##     filter, lag
## The following objects are masked from 'package:base':
##     intersect, setdiff, setequal, union

seals <- read.csv("./data/06/17e8ShrinkingSeals Trites 1996.csv")

seal_base <- ggplot(seals, aes(x=age.days, +
  geom_point() +


Neat data set, no?

Now, looking at this from the get-go, we can see it’s likely nonlinear. Maybe even non-normal! Let’s ignore that for now, as it will make the results great fodder for our diagnostics!

1. Fitting a model

OK, so, you have the data. And in your model, you want to see how age is a predictor of length. How do you fit it?

Nothing could be simpler. R has a function called lm() which stands for linear model. It works just like base plot, with the y ~ x syntax. And we create a fit model object.

seal_lm <- lm( ~ age.days, data=seals)

That’s it.

Now, if we want to just peak at the fit, before going forward, we can use coef() which is pretty standard across all R fit model functions.

##  (Intercept)     age.days 
## 1.157668e+02 2.370559e-03

But that’s getting ahead of ourselves…

2. Evaluating Assumptions

R also provides a 1-stop shop for evaluating functions. Fit model objects can typically be plotted. Now, it uses base plot, so, we’ll use the par function to setup a 2x2 plotting area.

par(mfrow = c(2,2)) #2 rows, 2 columns