Believe it or not, despite all of the complexity under the hood, fitting a linear model in R with least squares is quite simple with a straightfoward workflow.
Let’s go through each step with an example of seals. Are older seals larger?
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
seals <- read.csv("./data/06/17e8ShrinkingSeals Trites 1996.csv")
seal_base <- ggplot(seals, aes(x=age.days, y=length.cm)) +
geom_point() +
theme_grey(base_size=14)
seal_base
Neat data set, no?
Now, looking at this from the get-go, we can see it’s likely nonlinear. Maybe even non-normal! Let’s ignore that for now, as it will make the results great fodder for our diagnostics!
OK, so, you have the data. And in your model, you want to see how age is a predictor of length. How do you fit it?
Nothing could be simpler. R has a function called lm()
which stands for linear model. It works just like base plot, with the y ~ x
syntax. And we create a fit model object.
seal_lm <- lm(length.cm ~ age.days, data=seals)
That’s it.
Now, if we want to just peak at the fit, before going forward, we can use coef()
which is pretty standard across all R fit model functions.
coef(seal_lm)
## (Intercept) age.days
## 1.157668e+02 2.370559e-03
But that’s getting ahead of ourselves…
R also provides a 1-stop shop for evaluating functions. Fit model objects can typically be plotted. Now, it uses base plot, so, we’ll use the par
function to setup a 2x2 plotting area.
par(mfrow = c(2,2)) #2 rows, 2 columns
plot(seal_lm)
par(mfrow=c(1,1)) #reset back to 1 panel
Whoah - that’s a lot! And, there’s no figure with Cook’s D or a histogram of residuals.
OK, breathe.
plot.lm()
actualyl generates even more plots than shown here. You can specify what plot you want with the which
argument, but will need to look at ?plot.lm
to know just what to look at.
I have five plots I really like to look at - four of which plot.lm()
will generate. Those four are the fitted versus residuals:
plot(seal_lm, which=1)
Note the curvature of the line? Troubling, or a high n?
A QQ plot of the residuals
plot(seal_lm, which=2)
Not bad!
The Cook’s D values
plot(seal_lm, which=4)
All quite small!
And last, leverage
plot(seal_lm, which=5)
I also like to look at the histogram of residuals.There is a function called residuals
that will work on nearly any fit model object in R. So we can just…
hist(residuals(seal_lm))
Note, there’s also a library called modelr which can add the appropriate residuals to your data frame using a dplyr-like syntax.
library(modelr)
seals <- seals %>%
add_residuals(seal_lm)
head(seals)
## age.days length.cm resid
## 1 5337 131 2.5815746
## 2 2081 123 2.3001156
## 3 2085 122 1.2906334
## 4 4299 136 10.0422152
## 5 2861 122 -0.5489206
## 6 5052 131 3.2571840
Check out that new column. You can now plot your predictor versus residuals, which should show no trend, which you can use a spline with stat_smooth to see.
qplot(age.days, resid, data=seals) +
stat_smooth()
And you can also add fitted values and look at fitted versus residual.
seals <- seals %>%
add_predictions(seal_lm)
qplot(pred, resid, data=seals) +
stat_smooth(method="lm")
OK, ok, everything looks fine. Now, how do we test our model.
First, F-tests! R has a base method called anova
which - well, it’s for looking at analysis of partitioning variance, but really will take on a wide variety of forms as we go forward. For now, it will produce F tables for us
anova(seal_lm)
## Analysis of Variance Table
##
## Response: length.cm
## Df Sum Sq Mean Sq F value Pr(>F)
## age.days 1 90862 90862 2815.8 < 2.2e-16 ***
## Residuals 9663 311807 32
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Boom! P values! And they are low. Simple, no?
For more information - t tests, Rsummary()
- again, a function that is a go-to for nearly any fit model.
summary(seal_lm)
##
## Call:
## lm(formula = length.cm ~ age.days, data = seals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.8700 -3.8279 0.0304 3.7541 21.6874
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.158e+02 1.764e-01 656.37 <2e-16 ***
## age.days 2.371e-03 4.467e-05 53.06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.681 on 9663 degrees of freedom
## Multiple R-squared: 0.2256, Adjusted R-squared: 0.2256
## F-statistic: 2816 on 1 and 9663 DF, p-value: < 2.2e-16
This is a lot of information to drink in - function call, distribution of residuals, coefficient t-tests, and multiple pieces of information about total fit.
We may want to get this information in a more condensed form for use in other contexts - particularly to compare against other models. For that, there’s a wonderful packages called broom
that sweeps up your model into easy digestable pieces.
First, the coefficient table - let’s make it pretty.
library(broom)
##
## Attaching package: 'broom'
## The following object is masked from 'package:modelr':
##
## bootstrap
tidy(seal_lm)
## term estimate std.error statistic p.value
## 1 (Intercept) 1.157668e+02 1.763751e-01 656.3668 0
## 2 age.days 2.370559e-03 4.467317e-05 53.0645 0
Nice.
We can also do this for the F-test.
tidy(anova(seal_lm))
## term df sumsq meansq statistic p.value
## 1 age.days 1 90861.81 90861.80648 2815.841 0
## 2 Residuals 9663 311806.52 32.26809 NA NA
If we want to get information about fit, there’s glance()
glance(seal_lm)
## r.squared adj.r.squared sigma statistic p.value df logLik AIC
## 1 0.2256493 0.2255691 5.680501 2815.841 0 2 -30501.53 61009.05
## BIC deviance df.residual
## 1 61030.58 311806.5 9663
Lovely! Now, how do we visualize the fit and fit prediction error?
In ggplot2
we can use the smoother, stat_smooth
in conjunction with method = "lm"
to get the job done.
seal_fit_plot <- ggplot(data=seals) +
aes(x=age.days, y=length.cm) +
geom_point() +
stat_smooth(method="lm")
seal_fit_plot
Note - you cannot see the fit interval because our SE is so small with such a large N.
What about prediction intervals? That’s a bit harder, and we need to dig into the specific prediction function for lm
s, predict()
. This function is very useful as you can use it to generate new predicted values, and it will also give you error as well. Now, modelr::add_prediction
will give you values, but no error yet. YET.
So how does predict work? We begin with a new data frame that has columns matching our predictors in the model. Then add a new column. As the data frame is not the first argument, we can’t use pipes here. Booo.
We then cbind the predictors and predicted values - along with the interval we’ve created. We’ll also do some renaming, simply to make it easier to use this new data in our ggplot coming up. Note, there are also arguments to get intervals other than the 95%.
predFrame <- data.frame(age.days = min(seals$age.days):max(seals$age.days))
predVals <- predict(seal_lm, newdata=predFrame, interval="prediction")
predFrame <- cbind(predFrame, predVals) %>%
rename(length.cm = fit)
head(predFrame)
## age.days length.cm lwr upr
## 1 958 118.0377 106.8996 129.1759
## 2 959 118.0401 106.9019 129.1783
## 3 960 118.0425 106.9043 129.1807
## 4 961 118.0449 106.9067 129.1830
## 5 962 118.0472 106.9090 129.1854
## 6 963 118.0496 106.9114 129.1878
We can then add this to our plot using the geom_ribbon
which takes a ymin
and ymax
argument to generate a ribbon - like the fit standard error ribbon.
seal_fit_plot +
stat_smooth(method="lm", color="blue") +
geom_ribbon(data=predFrame, mapping=aes(x=age.days, ymin=lwr, ymax=upr),
fill="grey", alpha=0.6) +
theme_bw()
Now we can better see the prediction 95% interval - and that we do have some points that fall outside of it.
Fist, the relationship between how lean you are and how quickly you lose fat. Implement this to get a sense ot the general workflow for analysis
library(ggplot2)
fat <- read.csv("./data/06/17q04BodyFatHeatLoss Sloan and Keatinge 1973 replica.csv")
#initial visualization to determine if lm is appropriate
fat_plot <- ggplot(data=fat, aes(x=leanness, y=lossrate)) +
geom_point()
fat_plot
fat_mod <- lm(lossrate ~ leanness, data=fat)
#assumptions
plot(fat_mod, which=1)
plot(fat_mod, which=2)
#f-tests of model
anova(fat_mod)
#t-tests of parameters
summary(fat_mod)
#plot with line
fat_plot +
stat_smooth(method=lm, formula=y~x)
For your first faded example, let’s look at the relationship between DEET and mosquito bites.
deet <- read.csv("./data/06/17q24DEETMosquiteBites.csv")
deet_plot <- ggplot(data=___, aes(x=dose, y=bites)) +
geom_point()
deet_plot
deet_mod <- lm(bites ~ dose, data=deet)
#assumptions
plot(___, which=1)
plot(___, which=2)
#f-tests of model
anova(___)
#t-tests of parameters
summary(___)
#plot with line
deet_plot +
stat_smooth(method=lm, formula=y~x)
One of the most common reasons for a linear model to not fit is a nonlinearity in the data generating process. Often in nature we encounter exponential processes with a log-normal error structure. This is common in count data. Now, really, it’s often a poisson distribted variable, but, to a first approximation, log-transformation can often help fix what ails your model and give you reasonable estimates for our tests. We’ll talk later in the course about why this isn’t the best idea, and why you should start with a nonlinear/non-normal model to begin with.
Let’s practice the workflow of how we handle this log transformation.
We might suspect that the relationship was nonlinear. Let’s see how a simple log transform works here. Note the modifications to model fitting and stat_smooth
.
deet_mod_log <- lm(log(bites) ~ dose, data=deet)
#assumptions
plot(___, which=1)
plot(___, which=2)
#f-tests of model
anova(___)
#t-tests of parameters
summary(___)
#plot with line
deet_plot +
scale_y_continuous(trans="log") +
stat_smooth(method=lm, formula=y~x)
Do longer lived species also have larger home ranges? Let’s test this!
zoo <- read.csv("./data/06/17q02ZooMortality Clubb and Mason 2003 replica.csv")
zoo_plot <- ggplot(data=___, aes(x=mortality, y=homerange)) +
___()
___
zoo_mod <- lm(___, data=___)
#assumptions
plot(___, which=1)
plot(___, which=2)
#f-tests of model
anova(___)
#t-tests of parameters
summary(___)
#plot with line
zoo_plot +
stat_smooth(method=___, formula=___)
That definitely wasn’t linear. Look at that outlier! Let’s log our y and see how things change.
zoo_mod_log <- lm(log(___) ~ ___, ___=___)
#assumptions
___(___)
___(___)
#f-tests of model
___(___)
#t-tests of parameters
___(___)
#plot with line
zoo_plot +
scale_y_continuous(trans="___")+
___(method=___, formula=___)
OK, this is great, but what about power? Let’s look at our seal fit again - maybe we want to know what would have happened with a lower sample size?
Note, I’m going to use a knitr function to neaten up the table for markdown. Try it in your own homeworks!
knitr::kable(tidy(seal_lm))
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 115.7667505 | 0.1763751 | 656.3668 | 0 |
age.days | 0.0023706 | 0.0000447 | 53.0645 | 0 |
What about the residual SD?
knitr::kable(glance(seal_lm)[,1:3])
r.squared | adj.r.squared | sigma |
---|---|---|
0.2256493 | 0.2255691 | 5.6805 |
All right - there are our target values. Now, we can change a lof ot things. The effect size (slope), range of values, sigma, and more. But let’s try sample size.
The first step of a power analysis is to setup a data frame with all of the different possibilities that you might want to assess for power. For linear regression, we have the following parameters, all of which might vary:
To do a power analysis via simulation for a linear regression, we begin by building our simulation data frame with the parameters and information that varies. In this case, sample size.
set.seed(607)
simPopN <- data.frame(slope = 0.00237,
intercept=115.767,
sigma = 5.6805,
n=10:100)
OK, now we need to expand this out, and have some number of samples for each n.For that, we can use the function in tidyr
(same library as crossings), expand()
.
library(tidyr)
simPopN <- simPopN %>%
group_by(slope, intercept, sigma, n) %>%
expand(reps = 1:n) %>%
ungroup()
Now, if we want to simulate each of these, say, 500 times, we need to assign unique sim numbers, so for each n and sim number we have a unique data set.
simPopN <- simPopN %>%
crossing(sim = 1:500)
Great - almost ready to go! Now we just need to add in fitted values. Fortunately, as rnorm()
works with vectors, we can just use a mutate here. We’ll also need to simulate random draws of ages, but that’s just another random number.
simPopN <- simPopN %>%
mutate(age.days = runif(n(), 1000, 8500)) %>%
mutate(length.cm = rnorm(n(), intercept + slope*age.days, sigma))
Yatzee! Ready to run!
First, we now need to generate a lot of fit models. Dplyr doesn’t take too kindly to including fit things, so, we can use two powerful functions here - first, nest
and unnest()
allow us to collapse grouped data down into little pieces and re-expand it.
fits <- simPopN %>%
group_by(slope, intercept, sigma, n, sim) %>%
nest()
fits
## # A tibble: 45,500 × 6
## slope intercept sigma n sim data
## <dbl> <dbl> <dbl> <int> <int> <list>
## 1 0.00237 115.767 5.6805 10 1 <tibble [10 × 3]>
## 2 0.00237 115.767 5.6805 10 2 <tibble [10 × 3]>
## 3 0.00237 115.767 5.6805 10 3 <tibble [10 × 3]>
## 4 0.00237 115.767 5.6805 10 4 <tibble [10 × 3]>
## 5 0.00237 115.767 5.6805 10 5 <tibble [10 × 3]>
## 6 0.00237 115.767 5.6805 10 6 <tibble [10 × 3]>
## 7 0.00237 115.767 5.6805 10 7 <tibble [10 × 3]>
## 8 0.00237 115.767 5.6805 10 8 <tibble [10 × 3]>
## 9 0.00237 115.767 5.6805 10 9 <tibble [10 × 3]>
## 10 0.00237 115.767 5.6805 10 10 <tibble [10 × 3]>
## # ... with 45,490 more rows
Second, the map
function in the purrr library allows us to iterate over different levels or grouped data frames, and perform some function. In this case, we’ll fit a model, get it’s coefficients using broom
.
fits <- fits %>%
mutate(mod = purrr:::map(data, ~lm(length.cm ~ age.days, data=.))) %>%
mutate(coefs = purrr::map(mod, ~tidy(.)))
fits
## # A tibble: 45,500 × 8
## slope intercept sigma n sim data mod
## <dbl> <dbl> <dbl> <int> <int> <list> <list>
## 1 0.00237 115.767 5.6805 10 1 <tibble [10 × 3]> <S3: lm>
## 2 0.00237 115.767 5.6805 10 2 <tibble [10 × 3]> <S3: lm>
## 3 0.00237 115.767 5.6805 10 3 <tibble [10 × 3]> <S3: lm>
## 4 0.00237 115.767 5.6805 10 4 <tibble [10 × 3]> <S3: lm>
## 5 0.00237 115.767 5.6805 10 5 <tibble [10 × 3]> <S3: lm>
## 6 0.00237 115.767 5.6805 10 6 <tibble [10 × 3]> <S3: lm>
## 7 0.00237 115.767 5.6805 10 7 <tibble [10 × 3]> <S3: lm>
## 8 0.00237 115.767 5.6805 10 8 <tibble [10 × 3]> <S3: lm>
## 9 0.00237 115.767 5.6805 10 9 <tibble [10 × 3]> <S3: lm>
## 10 0.00237 115.767 5.6805 10 10 <tibble [10 × 3]> <S3: lm>
## # ... with 45,490 more rows, and 1 more variables: coefs <list>
Last, we cleanup - we unnest
, which takes one set of coefficients out. We’ll also filter for just the slope.
fits <- fits %>%
unnest(coefs) %>%
ungroup() %>%
filter(term == "age.days")
fits
## # A tibble: 45,500 × 10
## slope intercept sigma n sim term estimate std.error
## <dbl> <dbl> <dbl> <int> <int> <chr> <dbl> <dbl>
## 1 0.00237 115.767 5.6805 10 1 age.days 0.0036983284 0.0005383329
## 2 0.00237 115.767 5.6805 10 2 age.days 0.0031738767 0.0010134023
## 3 0.00237 115.767 5.6805 10 3 age.days 0.0035555894 0.0006222090
## 4 0.00237 115.767 5.6805 10 4 age.days 0.0015086981 0.0008623719
## 5 0.00237 115.767 5.6805 10 5 age.days 0.0030754009 0.0009175490
## 6 0.00237 115.767 5.6805 10 6 age.days 0.0018232261 0.0007620678
## 7 0.00237 115.767 5.6805 10 7 age.days 0.0030549324 0.0008115292
## 8 0.00237 115.767 5.6805 10 8 age.days 0.0021580987 0.0011734079
## 9 0.00237 115.767 5.6805 10 9 age.days 0.0005435461 0.0008337824
## 10 0.00237 115.767 5.6805 10 10 age.days 0.0027293950 0.0004923838
## # ... with 45,490 more rows, and 2 more variables: statistic <dbl>,
## # p.value <dbl>
Notice that we do indeed have p-values, so we can use these fits to get power for each sample size. We can now do our normal process - in this case grouping by sample size - to get power. And then we can plot the result!
pow <- fits %>%
group_by(n) %>%
summarise(power = 1-sum(p.value>0.05)/n()) %>%
ungroup()
qplot(n, power, data=pow, geom=c("point", "line")) +
theme_bw(base_size=17) +
geom_hline(yintercept=0.8, lty=2)