Linear Regression

Believe it or not, despite all of the complexity under the hood, fitting a linear model in R with least squares is quite simple with a straightfoward workflow.

Are Older Seals Bigger?

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

seals <- read.csv("./data/06/17e8ShrinkingSeals Trites 1996.csv")

seal_base <- ggplot(seals, aes(x=age.days, y=length.cm)) +
  geom_point() +
  theme_grey(base_size=14)

seal_base

Neat data set, no?

Now, looking at this from the get-go, we can see it’s likely nonlinear. Maybe even non-normal! Let’s ignore that for now, as it will make the results great fodder for our diagnostics!

1. Fitting a model

OK, so, you have the data. And in your model, you want to see how age is a predictor of length. How do you fit it?

Nothing could be simpler. R has a function called lm() which stands for linear model. It works just like base plot, with the y ~ x syntax. And we create a fit model object.

seal_lm <- lm(length.cm ~ age.days, data=seals)

That’s it.

Now, if we want to just peak at the fit, before going forward, we can use coef() which is pretty standard across all R fit model functions.

coef(seal_lm)

##  (Intercept)     age.days 
## 1.157668e+02 2.370559e-03

But that’s getting ahead of ourselves…

2. Evaluating Assumptions

R also provides a 1-stop shop for evaluating functions. Fit model objects can typically be plotted. Now, it uses base plot, so, we’ll use the par function to setup a 2x2 plotting area.

par(mfrow = c(2,2)) #2 rows, 2 columns

plot(seal_lm)

par(mfrow=c(1,1)) #reset back to 1 panel

Whoah - that’s a lot! And, there’s no figure with Cook’s D or a histogram of residuals.

OK, breathe.

plot.lm() actualyl generates even more plots than shown here. You can specify what plot you want with the which argument, but will need to look at ?plot.lm to know just what to look at.

I have five plots I really like to look at - four of which plot.lm() will generate. Those four are the fitted versus residuals:

plot(seal_lm, which=1)

Note the curvature of the line? Troubling, or a high n?

A QQ plot of the residuals

plot(seal_lm, which=2)

Not bad!

The Cook’s D values

plot(seal_lm, which=4)

All quite small!

And last, leverage

plot(seal_lm, which=5)

I also like to look at the histogram of residuals.There is a function called residuals that will work on nearly any fit model object in R. So we can just…

hist(residuals(seal_lm))

Note, there’s also a library called modelr which can add the appropriate residuals to your data frame using a dplyr-like syntax.

library(modelr)

seals <- seals %>%
  add_residuals(seal_lm)

head(seals)

##   age.days length.cm      resid
## 1     5337       131  2.5815746
## 2     2081       123  2.3001156
## 3     2085       122  1.2906334
## 4     4299       136 10.0422152
## 5     2861       122 -0.5489206
## 6     5052       131  3.2571840

Check out that new column. You can now plot your predictor versus residuals, which should show no trend, which you can use a spline with stat_smooth to see.

qplot(age.days, resid, data=seals) +
  stat_smooth()

And you can also add fitted values and look at fitted versus residual.

seals <- seals %>%
  add_predictions(seal_lm)

qplot(pred, resid, data=seals) +
  stat_smooth(method="lm")

3. Putting it to the test

OK, ok, everything looks fine. Now, how do we test our model.

First, F-tests! R has a base method called anova which - well, it’s for looking at analysis of partitioning variance, but really will take on a wide variety of forms as we go forward. For now, it will produce F tables for us

anova(seal_lm)

## Analysis of Variance Table
## 
## Response: length.cm
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## age.days     1  90862   90862  2815.8 < 2.2e-16 ***
## Residuals 9663 311807      32                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Boom! P values! And they are low. Simple, no?

For more information - t tests, R2, and more, we can use summary() - again, a function that is a go-to for nearly any fit model.

summary(seal_lm)

## 
## Call:
## lm(formula = length.cm ~ age.days, data = seals)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.8700  -3.8279   0.0304   3.7541  21.6874 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.158e+02  1.764e-01  656.37   <2e-16 ***
## age.days    2.371e-03  4.467e-05   53.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.681 on 9663 degrees of freedom
## Multiple R-squared:  0.2256, Adjusted R-squared:  0.2256 
## F-statistic:  2816 on 1 and 9663 DF,  p-value: < 2.2e-16

This is a lot of information to drink in - function call, distribution of residuals, coefficient t-tests, and multiple pieces of information about total fit.

We may want to get this information in a more condensed form for use in other contexts - particularly to compare against other models. For that, there’s a wonderful packages called broom that sweeps up your model into easy digestable pieces.

First, the coefficient table - let’s make it pretty.

library(broom)

## 
## Attaching package: 'broom'

## The following object is masked from 'package:modelr':
## 
##     bootstrap

tidy(seal_lm)

##          term     estimate    std.error statistic p.value
## 1 (Intercept) 1.157668e+02 1.763751e-01  656.3668       0
## 2    age.days 2.370559e-03 4.467317e-05   53.0645       0

Nice.

We can also do this for the F-test.

tidy(anova(seal_lm))

##        term   df     sumsq      meansq statistic p.value
## 1  age.days    1  90861.81 90861.80648  2815.841       0
## 2 Residuals 9663 311806.52    32.26809        NA      NA

If we want to get information about fit, there’s glance()

glance(seal_lm)

##   r.squared adj.r.squared    sigma statistic p.value df    logLik      AIC
## 1 0.2256493     0.2255691 5.680501  2815.841       0  2 -30501.53 61009.05
##        BIC deviance df.residual
## 1 61030.58 311806.5        9663

4. Visualization

Lovely! Now, how do we visualize the fit and fit prediction error?

In ggplot2 we can use the smoother, stat_smooth in conjunction with method = "lm" to get the job done.

seal_fit_plot <- ggplot(data=seals) +
  aes(x=age.days, y=length.cm) +
  geom_point() +
  stat_smooth(method="lm")

seal_fit_plot

Note - you cannot see the fit interval because our SE is so small with such a large N.

What about prediction intervals? That’s a bit harder, and we need to dig into the specific prediction function for lms, predict(). This function is very useful as you can use it to generate new predicted values, and it will also give you error as well. Now, modelr::add_prediction will give you values, but no error yet. YET.

So how does predict work? We begin with a new data frame that has columns matching our predictors in the model. Then add a new column. As the data frame is not the first argument, we can’t use pipes here. Booo.

We then cbind the predictors and predicted values - along with the interval we’ve created. We’ll also do some renaming, simply to make it easier to use this new data in our ggplot coming up. Note, there are also arguments to get intervals other than the 95%.

predFrame <- data.frame(age.days = min(seals$age.days):max(seals$age.days))
predVals <- predict(seal_lm, newdata=predFrame, interval="prediction")

predFrame <- cbind(predFrame, predVals) %>%
  rename(length.cm = fit)

head(predFrame)

##   age.days length.cm      lwr      upr
## 1      958  118.0377 106.8996 129.1759
## 2      959  118.0401 106.9019 129.1783
## 3      960  118.0425 106.9043 129.1807
## 4      961  118.0449 106.9067 129.1830
## 5      962  118.0472 106.9090 129.1854
## 6      963  118.0496 106.9114 129.1878

We can then add this to our plot using the geom_ribbon which takes a ymin and ymax argument to generate a ribbon - like the fit standard error ribbon.

seal_fit_plot +
  stat_smooth(method="lm", color="blue") +
  geom_ribbon(data=predFrame, mapping=aes(x=age.days, ymin=lwr, ymax=upr),
              fill="grey", alpha=0.6) +
  theme_bw()

Now we can better see the prediction 95% interval - and that we do have some points that fall outside of it.

5. Faded Examples.

A Fat Model

Fist, the relationship between how lean you are and how quickly you lose fat. Implement this to get a sense ot the general workflow for analysis

library(ggplot2)
fat <- read.csv("./data/06/17q04BodyFatHeatLoss Sloan and Keatinge 1973 replica.csv")

#initial visualization to determine if lm is appropriate
fat_plot <- ggplot(data=fat, aes(x=leanness, y=lossrate)) + 
  geom_point()
fat_plot

fat_mod <- lm(lossrate ~ leanness, data=fat)

#assumptions
plot(fat_mod, which=1)
plot(fat_mod, which=2)

#f-tests of model
anova(fat_mod)

#t-tests of parameters
summary(fat_mod)

#plot with line
fat_plot + 
  stat_smooth(method=lm, formula=y~x)

An Itchy Followup

For your first faded example, let’s look at the relationship between DEET and mosquito bites.

deet <- read.csv("./data/06/17q24DEETMosquiteBites.csv")

deet_plot <- ggplot(data=___, aes(x=dose, y=bites)) + 
  geom_point()

deet_plot

deet_mod <- lm(bites ~ dose, data=deet)

#assumptions
plot(___, which=1)
plot(___, which=2)

#f-tests of model
anova(___)

#t-tests of parameters
summary(___)

#plot with line
deet_plot + 
  stat_smooth(method=lm, formula=y~x)

6. Log-Transformation for Nonlinearity

One of the most common reasons for a linear model to not fit is a nonlinearity in the data generating process. Often in nature we encounter exponential processes with a log-normal error structure. This is common in count data. Now, really, it’s often a poisson distribted variable, but, to a first approximation, log-transformation can often help fix what ails your model and give you reasonable estimates for our tests. We’ll talk later in the course about why this isn’t the best idea, and why you should start with a nonlinear/non-normal model to begin with.

Let’s practice the workflow of how we handle this log transformation.

Was that relationship linear?

We might suspect that the relationship was nonlinear. Let’s see how a simple log transform works here. Note the modifications to model fitting and stat_smooth.

deet_mod_log <- lm(log(bites) ~ dose, data=deet)

#assumptions
plot(___, which=1)
plot(___, which=2)

#f-tests of model
anova(___)

#t-tests of parameters
summary(___)

#plot with line
deet_plot + 
  scale_y_continuous(trans="log") +
  stat_smooth(method=lm, formula=y~x)

Long-Lived species and Home Ranges

Do longer lived species also have larger home ranges? Let’s test this!

zoo <- read.csv("./data/06/17q02ZooMortality Clubb and Mason 2003 replica.csv")

zoo_plot <- ggplot(data=___, aes(x=mortality, y=homerange)) + 
  ___()

___

zoo_mod <- lm(___, data=___)

#assumptions
plot(___, which=1)
plot(___, which=2)

#f-tests of model
anova(___)

#t-tests of parameters
summary(___)

#plot with line
zoo_plot + 
  stat_smooth(method=___, formula=___)

Nonlinear home-ranges

That definitely wasn’t linear. Look at that outlier! Let’s log our y and see how things change.

zoo_mod_log <- lm(log(___) ~ ___, ___=___)

#assumptions
___(___)
___(___)

#f-tests of model
___(___)

#t-tests of parameters
___(___)

#plot with line
zoo_plot + 
  scale_y_continuous(trans="___")+
  ___(method=___, formula=___)

5. Power

OK, this is great, but what about power? Let’s look at our seal fit again - maybe we want to know what would have happened with a lower sample size?

Note, I’m going to use a knitr function to neaten up the table for markdown. Try it in your own homeworks!

knitr::kable(tidy(seal_lm))

term	estimate	std.error	statistic	p.value
(Intercept)	115.7667505	0.1763751	656.3668	0
age.days	0.0023706	0.0000447	53.0645	0

What about the residual SD?

knitr::kable(glance(seal_lm)[,1:3])

r.squared	adj.r.squared	sigma
0.2256493	0.2255691	5.6805

All right - there are our target values. Now, we can change a lof ot things. The effect size (slope), range of values, sigma, and more. But let’s try sample size.

5.1 Make a Table of Parameters, some of which vary

The first step of a power analysis is to setup a data frame with all of the different possibilities that you might want to assess for power. For linear regression, we have the following parameters, all of which might vary:

Slope
Intercept
Residual variance
Sample Size
Range of X values

To do a power analysis via simulation for a linear regression, we begin by building our simulation data frame with the parameters and information that varies. In this case, sample size.

set.seed(607)

simPopN <- data.frame(slope = 0.00237, 
                      intercept=115.767,
                      sigma = 5.6805,
                      n=10:100)

5.2 Expand to have a number of rows for each simulated sample with each parameter combination

OK, now we need to expand this out, and have some number of samples for each n.For that, we can use the function in tidyr (same library as crossings), expand().

library(tidyr)

simPopN <- simPopN %>%
  group_by(slope, intercept, sigma, n) %>%
  expand(reps = 1:n) %>%
  ungroup()

5.3 Expand to create repeated simulated data sets for each combination of parameters

Now, if we want to simulate each of these, say, 500 times, we need to assign unique sim numbers, so for each n and sim number we have a unique data set.

simPopN <- simPopN  %>%
  crossing(sim = 1:500)

Great - almost ready to go! Now we just need to add in fitted values. Fortunately, as rnorm() works with vectors, we can just use a mutate here. We’ll also need to simulate random draws of ages, but that’s just another random number.

5.3 Simulate the data

simPopN <- simPopN %>%
  mutate(age.days = runif(n(), 1000, 8500)) %>%
  mutate(length.cm = rnorm(n(), intercept + slope*age.days, sigma))

Yatzee! Ready to run!

5.4 Fit models and extract coefficients

First, we now need to generate a lot of fit models. Dplyr doesn’t take too kindly to including fit things, so, we can use two powerful functions here - first, nest and unnest() allow us to collapse grouped data down into little pieces and re-expand it.

fits <- simPopN %>%
    group_by(slope, intercept, sigma, n, sim) %>%
    nest()

fits

## # A tibble: 45,500 × 6
##      slope intercept  sigma     n   sim              data
##      <dbl>     <dbl>  <dbl> <int> <int>            <list>
## 1  0.00237   115.767 5.6805    10     1 <tibble [10 × 3]>
## 2  0.00237   115.767 5.6805    10     2 <tibble [10 × 3]>
## 3  0.00237   115.767 5.6805    10     3 <tibble [10 × 3]>
## 4  0.00237   115.767 5.6805    10     4 <tibble [10 × 3]>
## 5  0.00237   115.767 5.6805    10     5 <tibble [10 × 3]>
## 6  0.00237   115.767 5.6805    10     6 <tibble [10 × 3]>
## 7  0.00237   115.767 5.6805    10     7 <tibble [10 × 3]>
## 8  0.00237   115.767 5.6805    10     8 <tibble [10 × 3]>
## 9  0.00237   115.767 5.6805    10     9 <tibble [10 × 3]>
## 10 0.00237   115.767 5.6805    10    10 <tibble [10 × 3]>
## # ... with 45,490 more rows

Second, the map function in the purrr library allows us to iterate over different levels or grouped data frames, and perform some function. In this case, we’ll fit a model, get it’s coefficients using broom.

fits <- fits %>%
    mutate(mod = purrr:::map(data, ~lm(length.cm ~ age.days, data=.))) %>%
    mutate(coefs = purrr::map(mod, ~tidy(.)))

fits

## # A tibble: 45,500 × 8
##      slope intercept  sigma     n   sim              data      mod
##      <dbl>     <dbl>  <dbl> <int> <int>            <list>   <list>
## 1  0.00237   115.767 5.6805    10     1 <tibble [10 × 3]> <S3: lm>
## 2  0.00237   115.767 5.6805    10     2 <tibble [10 × 3]> <S3: lm>
## 3  0.00237   115.767 5.6805    10     3 <tibble [10 × 3]> <S3: lm>
## 4  0.00237   115.767 5.6805    10     4 <tibble [10 × 3]> <S3: lm>
## 5  0.00237   115.767 5.6805    10     5 <tibble [10 × 3]> <S3: lm>
## 6  0.00237   115.767 5.6805    10     6 <tibble [10 × 3]> <S3: lm>
## 7  0.00237   115.767 5.6805    10     7 <tibble [10 × 3]> <S3: lm>
## 8  0.00237   115.767 5.6805    10     8 <tibble [10 × 3]> <S3: lm>
## 9  0.00237   115.767 5.6805    10     9 <tibble [10 × 3]> <S3: lm>
## 10 0.00237   115.767 5.6805    10    10 <tibble [10 × 3]> <S3: lm>
## # ... with 45,490 more rows, and 1 more variables: coefs <list>

Last, we cleanup - we unnest, which takes one set of coefficients out. We’ll also filter for just the slope.

fits <- fits %>%
  unnest(coefs) %>%
  ungroup() %>%
  filter(term == "age.days")

fits

## # A tibble: 45,500 × 10
##      slope intercept  sigma     n   sim     term     estimate    std.error
##      <dbl>     <dbl>  <dbl> <int> <int>    <chr>        <dbl>        <dbl>
## 1  0.00237   115.767 5.6805    10     1 age.days 0.0036983284 0.0005383329
## 2  0.00237   115.767 5.6805    10     2 age.days 0.0031738767 0.0010134023
## 3  0.00237   115.767 5.6805    10     3 age.days 0.0035555894 0.0006222090
## 4  0.00237   115.767 5.6805    10     4 age.days 0.0015086981 0.0008623719
## 5  0.00237   115.767 5.6805    10     5 age.days 0.0030754009 0.0009175490
## 6  0.00237   115.767 5.6805    10     6 age.days 0.0018232261 0.0007620678
## 7  0.00237   115.767 5.6805    10     7 age.days 0.0030549324 0.0008115292
## 8  0.00237   115.767 5.6805    10     8 age.days 0.0021580987 0.0011734079
## 9  0.00237   115.767 5.6805    10     9 age.days 0.0005435461 0.0008337824
## 10 0.00237   115.767 5.6805    10    10 age.days 0.0027293950 0.0004923838
## # ... with 45,490 more rows, and 2 more variables: statistic <dbl>,
## #   p.value <dbl>

5.5 Calculate your Power

Notice that we do indeed have p-values, so we can use these fits to get power for each sample size. We can now do our normal process - in this case grouping by sample size - to get power. And then we can plot the result!

pow <- fits %>%
    group_by(n) %>%
    summarise(power = 1-sum(p.value>0.05)/n()) %>%
    ungroup() 


qplot(n, power,  data=pow, geom=c("point", "line")) +
  theme_bw(base_size=17) +
  geom_hline(yintercept=0.8, lty=2)