Believe it or not, despite all of the complexity under the hood, fitting a linear model in R with least squares is quite simple with a straightfoward workflow.
Let’s go through each step with an example of seals. Are older seals larger?
library(dplyr)
library(ggplot2)
theme_set(theme_bw(base_size = 16))
seals <- read.csv("./data/17e8ShrinkingSeals Trites 1996.csv")
seal_base <- ggplot(seals, aes(x=age.days, y=length.cm)) +
geom_point() +
theme_grey(base_size=14)
seal_base
Neat data set, no?
Now, looking at this from the get-go, we can see it’s likely nonlinear. Maybe even non-normal! Let’s ignore that for now, as it will make the results great fodder for our diagnostics!
OK, so, you have the data. And in your model, you want to see how age is a predictor of length. How do you fit it?
Nothing could be simpler. R has a function called lm()
which stands for linear model. It works just like base plot, with the y ~ x
syntax. And we create a fit model object.
seal_lm <- lm(length.cm ~ age.days, data=seals)
That’s it.
Now, if we want to just peak at the fit, before going forward, we can use coef()
which is pretty standard across all R fit model functions.
coef(seal_lm)
(Intercept) age.days
1.157668e+02 2.370559e-03
But that’s getting ahead of ourselves…
First, simulating! R provides a method for pulling the fit from the model and simulating different residuals. Now, granted, we really want to also incorporate error in the coefficients - which you can do - but…. gimme time to write a package that does that (haven’t found one yet - see here). But, to simulate from our model with residual error
seal_sims <- simulate(seal_lm, nsim = 100)
So, what’s in this object?
dim(seal_sims)
[1] 9665 100
Ah, so, 100 sims as columns. But to plot, we want them as rows. Oh wait - a job for pivot!
library(tidyr)
seal_sims <- seal_sims %>%
pivot_longer(
cols = everything(),
names_to = "sim",
values_to = "length.cm"
)
We can now use this to plot the distribution of length in our data versus our model.
ggplot() +
geom_density(data = seal_sims,
aes(x = length.cm, group = sim),
lwd = 0.02) +
geom_density(data = seals,
aes(x = length.cm),
color = "blue", lwd = 2)
NOICE!
R provides a 1-stop shop for evaluating all other model assumptions Fit model objects can typically be plotted. Now, it uses base plot, so, we’ll use the par
function to setup a 2x2 plotting area.
par(mfrow = c(2,2)) #2 rows, 2 columns
plot(seal_lm)
par(mfrow=c(1,1)) #reset back to 1 panel
Whoah - that’s a lot! And, there’s no figure with Cook’s D or a histogram of residuals.
OK, breathe.
plot.lm()
actually generates even more plots than shown here. You can specify what plot you want with the which
argument, but will need to look at ?plot.lm
to know just what to look at.
I have five plots I really like to look at - four of which plot.lm()
will generate. Those four are the fitted versus residuals:
plot(seal_lm, which=1)
Note the curvature of the line? Troubling, or a high n?
A QQ plot of the residuals
plot(seal_lm, which=2)
Not bad!
The Cook’s D values
plot(seal_lm, which=4)
All quite small!
And last, leverage
plot(seal_lm, which=5)
I also like to look at the histogram of residuals.There is a function called residuals
that will work on nearly any fit model object in R. So we can just…
hist(residuals(seal_lm))
Note, there’s also a library called modelr which can add the appropriate residuals to your data frame using a dplyr-like syntax.
library(modelr)
seals <- seals %>%
add_residuals(seal_lm)
head(seals)
age.days length.cm resid
1 5337 131 2.5815746
2 2081 123 2.3001156
3 2085 122 1.2906334
4 4299 136 10.0422152
5 2861 122 -0.5489206
6 5052 131 3.2571840
Check out that new column. You can now plot your predictor versus residuals, which should show no trend, which you can use a spline with stat_smooth to see.
qplot(age.days, resid, data=seals) +
stat_smooth()
And you can also add fitted values and look at fitted versus residual.
seals <- seals %>%
add_predictions(seal_lm)
qplot(pred, resid, data=seals) +
stat_smooth(method="lm")
Finally, if you didn’t like this without ggplot2, the package {ggfortify} provides an autoplot method to make these plots. Using the same which argument.
library(ggfortify)
autoplot(seal_lm, ncol = 1, which = 1)
OK, ok, everything looks fine. Now, how do we test our model.
First, F-tests! R has a base method called anova
which - well, it’s for looking at analysis of partitioning variance, but really will take on a wide variety of forms as we go forward. For now, it will produce F tables for us
anova(seal_lm)
Analysis of Variance Table
Response: length.cm
Df Sum Sq Mean Sq F value Pr(>F)
age.days 1 90862 90862 2815.8 < 2.2e-16 ***
Residuals 9663 311807 32
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Boom! P values! And they are low. Simple, no?
For more information - t tests, Rsummary()
- again, a function that is a go-to for nearly any fit model.
summary(seal_lm)
Call:
lm(formula = length.cm ~ age.days, data = seals)
Residuals:
Min 1Q Median 3Q Max
-19.8700 -3.8279 0.0304 3.7541 21.6874
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.158e+02 1.764e-01 656.37 <2e-16 ***
age.days 2.371e-03 4.467e-05 53.06 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.681 on 9663 degrees of freedom
Multiple R-squared: 0.2256, Adjusted R-squared: 0.2256
F-statistic: 2816 on 1 and 9663 DF, p-value: < 2.2e-16
This is a lot of information to drink in - function call, distribution of residuals, coefficient t-tests, and multiple pieces of information about total fit.
We may want to get this information in a more condensed form for use in other contexts - particularly to compare against other models. For that, there’s a wonderful packages called broom
that sweeps up your model into easy digestable pieces.
First, the coefficient table - let’s make it pretty.
library(broom)
tidy(seal_lm)
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 116. 0.176 656. 0
2 age.days 0.00237 0.0000447 53.1 0
Nice.
We can also do this for the F-test.
tidy(anova(seal_lm))
# A tibble: 2 x 6
term df sumsq meansq statistic p.value
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 age.days 1 90862. 90862. 2816. 0
2 Residuals 9663 311807. 32.3 NA NA
If we want to get information about fit, there’s glance()
glance(seal_lm)
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.226 0.226 5.68 2816. 0 1 -30502. 61009. 61031.
# … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Lovely! Now, how do we visualize the fit and fit prediction error?
In ggplot2
we can use the smoother, stat_smooth
in conjunction with method = "lm"
to get the job done.
seal_fit_plot <- ggplot(data=seals) +
aes(x=age.days, y=length.cm) +
geom_point() +
stat_smooth(method="lm")
seal_fit_plot
Note - you cannot see the fit interval because our SE is so small with such a large N.
What about prediction intervals? That’s a bit harder, and we need to dig into the specific prediction function for lm
s, predict()
. This function is very useful as you can use it to generate new predicted values, and it will also give you error as well. Now, modelr::add_prediction
will give you values, but no error yet. YET.
So how does predict work? We begin with a new data frame that has columns matching our predictors in the model. Then add a new column. As the data frame is not the first argument, we can’t use pipes here. Booo.
We then cbind the predictors and predicted values - along with the interval we’ve created. We’ll also do some renaming, simply to make it easier to use this new data in our ggplot coming up. Note, there are also arguments to get intervals other than the 95%.
predFrame <- data.frame(age.days = min(seals$age.days):max(seals$age.days))
predVals <- predict(seal_lm, newdata=predFrame, interval="prediction")
predFrame <- cbind(predFrame, predVals) %>%
rename(length.cm = fit)
head(predFrame)
age.days length.cm lwr upr
1 958 118.0377 106.8996 129.1759
2 959 118.0401 106.9019 129.1783
3 960 118.0425 106.9043 129.1807
4 961 118.0449 106.9067 129.1830
5 962 118.0472 106.9090 129.1854
6 963 118.0496 106.9114 129.1878
We can then add this to our plot using the geom_ribbon
which takes a ymin
and ymax
argument to generate a ribbon - like the fit standard error ribbon.
seal_fit_plot +
stat_smooth(method="lm", color="blue") +
geom_ribbon(data=predFrame, mapping=aes(x=age.days, ymin=lwr, ymax=upr),
fill="grey", alpha=0.6) +
theme_bw()
Now we can better see the prediction 95% interval - and that we do have some points that fall outside of it.
Fist, the relationship between how lean you are and how quickly you lose fat. Implement this to get a sense ot the general workflow for analysis
library(ggplot2)
fat <- read.csv("./data/17q04BodyFatHeatLoss Sloan and Keatinge 1973 replica.csv")
#initial visualization to determine if lm is appropriate
fat_plot <- ggplot(data=fat, aes(x=leanness, y=lossrate)) +
geom_point()
fat_plot
fat_mod <- lm(lossrate ~ leanness, data=fat)
#assumptions
simulate(fat_mod, nsim = 100) %>%
pivot_longer(cols = everything(),
names_to = "sim", values_to = "lossrate") %>%
ggplot(aes(x = lossrate)) +
geom_density(aes(group = sim), lwd = 0.2) +
geom_density(data = fat, color = "blue", lwd = 2)
plot(fat_mod, which=1)
plot(fat_mod, which=2)
#f-tests of model
anova(fat_mod)
#t-tests of parameters
summary(fat_mod)
#plot with line
fat_plot +
stat_smooth(method=lm, formula=y~x)
For your first faded example, let’s look at the relationship between DEET and mosquito bites.
deet <- read.csv("./data/17q24DEETMosquiteBites.csv")
deet_plot <- ggplot(data=___, aes(x=dose, y=bites)) +
geom_point()
deet_plot
deet_mod <- lm(bites ~ dose, data=deet)
#assumptions
simulate(___, nsim = 100) %>%
pivot_longer(cols = everything(),
names_to = "sim", values_to = "___") %>%
ggplot(aes(x = ___)) +
geom_density(aes(group = sim), lwd = 0.2) +
geom_density(data = ___, color = "blue", lwd = 2)
plot(___, which=1)
plot(___, which=2)
#f-tests of model
anova(___)
#t-tests of parameters
summary(___)
#plot with line
deet_plot +
stat_smooth(method=lm, formula=y~x)
One of the most common reasons for a linear model to not fit is a nonlinearity in the data generating process. Often in nature we encounter exponential processes with a log-normal error structure. This is common in count data. Now, really, it’s often a poisson distribted variable, but, to a first approximation, log-transformation can often help fix what ails your model and give you reasonable estimates for our tests. We’ll talk later in the course about why this isn’t the best idea, and why you should start with a nonlinear/non-normal model to begin with.
Let’s practice the workflow of how we handle this log transformation.
We might suspect that the relationship was nonlinear. Let’s see how a simple log transform works here. Note the modifications to model fitting and stat_smooth
.
deet_mod_log <- lm(log(bites) ~ dose, data=deet)
#assumptions
simulate(___, nsim = 100) %>%
pivot_longer(cols = everything(),
names_to = "sim", values_to = "log_bites") %>%
mutate(bites = exp(log_bites)) %>%
ggplot(aes(x = ___)) +
geom_density(aes(group = sim), lwd = 0.2) +
geom_density(data = ___, color = "blue", lwd = 2)
plot(___, which=1)
plot(___, which=2)
#f-tests of model
anova(___)
#t-tests of parameters
summary(___)
#plot with line
deet_plot +
scale_y_continuous(trans="log") +
stat_smooth(method=lm, formula=y~x)
Do longer lived species also have larger home ranges? Let’s test this!
zoo <- read.csv("./data/17q02ZooMortality Clubb and Mason 2003 replica.csv")
zoo_plot <- ggplot(data=___, aes(x=mortality, y=homerange)) +
___()
___
zoo_mod <- lm(___, data=___)
#assumptions
simulate(___, nsim = 100) %>%
___(cols = ___(),
names_to = "___", values_to = "___") %>%
ggplot(aes(x = ___)) +
___(aes(group = ___), lwd = 0.2) +
___(data = ___, color = "blue", lwd = 2)
plot(___, which=1)
plot(___, which=2)
#f-tests of model
anova(___)
#t-tests of parameters
summary(___)
#plot with line
zoo_plot +
stat_smooth(method=___, formula=___)
That definitely wasn’t linear. Look at that outlier! Let’s log our y and see how things change.
zoo_mod_log <- lm(log(___) ~ ___, ___=___)
#assumptions
____(___, nsim = 100) %>%
___(cols = ___(),
names_to = "___", values_to = "___") %>%
mutate(____ = ___(____))
ggplot(aes(x = ___)) +
___(aes(____ = ___), lwd = 0.2) +
___(data = ___, color = "blue", lwd = 2)
___(___)
___(___)
#f-tests of model
___(___)
#t-tests of parameters
___(___)
#plot with line
zoo_plot +
scale_y_continuous(trans="___")+
___(method=___, formula=___)