General Linear Models

1. ANCOVA

Combining categorical and continuous variables is not that different from multiway ANOVA. To start with, let’s look at the neanderthal data.

neand <- read.csv("./data/11/18q09NeanderthalBrainSize.csv")
head(neand)

##   lnmass lnbrain species
## 1   4.00    7.19  recent
## 2   3.96    7.23  recent
## 3   3.95    7.26  recent
## 4   4.04    7.23  recent
## 5   4.11    7.22  recent
## 6   4.10    7.24  recent

We can clearly see both the categorical variable we’re interested in, and the covariate.

To get a preliminary sense of what’s going on here, do some exploratory visualization with ggplot2 why doncha!

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.3.2

qplot(lnmass, lnbrain, color=species, data=neand) +
  stat_smooth(method="lm")

Now, the CIs are going to be off as this wasn’t all tested in the same model, but you begin to get a sense of whether things are parallel or not, and whether this covariate is important.

What other plots might you produce?

As this is a general linear model, good olde lm() is still there for us.

neand_lm <- lm(lnbrain ~ species + lnmass, data=neand)

1.1 Testing Assumptions of ANCOVA

In addition to the ususal tests, we need to make sure that the slopes really are parallel. We do that by fitting a model with an interaction and testing it (which, well, if there was and interaction, might that be interesting).

First, the usual

par(mfrow=c(2,2))
plot(neand_lm, which=c(1,2,5))
par(mfrow=c(1,1))

#And now look at residuals by group/predictors
library(car)
residualPlots(neand_lm, tests=FALSE)

To test the parallel presumption

neand_int <- lm(lnbrain ~ species*lnmass, data=neand)

Anova(neand_int)

## Anova Table (Type II tests)
## 
## Response: lnbrain
##                  Sum Sq Df F value    Pr(>F)    
## species        0.027553  1  6.2203    0.0175 *  
## lnmass         0.130018  1 29.3527 4.523e-06 ***
## species:lnmass 0.004845  1  1.0938    0.3028    
## Residuals      0.155033 35                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

1.2 Assessing results

So, first we have our F-tests.

Anova(neand_lm)

## Anova Table (Type II tests)
## 
## Response: lnbrain
##             Sum Sq Df F value    Pr(>F)    
## species   0.027553  1  6.2041   0.01749 *  
## lnmass    0.130018  1 29.2764 4.262e-06 ***
## Residuals 0.159878 36                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both the treatment and covariate matter.

Second, we might want to compare covariate adjusted groups and/or look at covariate adjusted means.

library(lsmeans)

## Loading required package: estimability

adj_means <- lsmeans(neand_lm, specs="species")

#adjusted means
adj_means

##  species       lsmean         SE df lower.CL upper.CL
##  neanderthal 7.271581 0.02418546 36 7.222530 7.320631
##  recent      7.341859 0.01250077 36 7.316506 7.367212
## 
## Confidence level used: 0.95

#comparisons
contrast(adj_means, method="tukey", adjust="none")

##  contrast               estimate         SE df t.ratio p.value
##  neanderthal - recent -0.0702784 0.02821518 36  -2.491  0.0175

If you have an interaction, this method is no longer valid. Instead, you’ll need to monkey with your posthocs (if you even want to use them - often we don’t) to look at tests at different levels of the covariate.

1.3 Visualization

Visualization is funny, as you want to make parallel lines and also get the CIs right. Rather than rely on ggplot2 to do this natively, we need to futz around a bit with generating predictions

neand_predict <- predict(neand_lm, interval="confidence")

head(neand_predict)

##        fit      lwr      upr
## 1 7.243614 7.203988 7.283240
## 2 7.223761 7.178077 7.269445
## 3 7.218798 7.171538 7.266059
## 4 7.263467 7.229347 7.297586
## 5 7.298209 7.271375 7.325042
## 6 7.293246 7.265628 7.320863

So, here we have fit values, lower confidence interval, and upper confidence intervals. As we have not fed predict() any new data, these values line up with our neand data frame, so we can cbind it all together, and then use these values to make a prediction plot.

neand <- cbind(neand, neand_predict)

ggplot(data=neand) +
  geom_point(mapping=aes(x=lnmass, y=lnbrain, color=species)) +
  geom_line(mapping = aes(x = lnmass, y=fit, color=species)) + 
  geom_ribbon(data=neand, aes(x = lnmass, 
                              ymin=lwr, 
                              ymax=upr,
                              group = species), 
              fill="lightgrey", 
              alpha=0.5) +
  theme_bw()

And there we have nice parallel lines with model predicted confidence intervals!

1.4 Examples

I’ve provided two data sets:
1) 18e4MoleRatLayabouts.csv looking at how caste and mass affect the energy mole rates expend

2) 18q11ExploitedLarvalFish.csv looking at how status of a marine area - protected or not - influences the CV around age of maturity of a number of different fish (so, age is a predictor)

Using the following workflow, analyze these data sets.

# Load the data

# Perform a preliminary visualization

# Fit an ANCOVA model

# Test Asssumptions and modeify model if needed

# Evaluate results

# Post-hocs if you can

# Visualize results

2. Multiple Linear Regression

Multiple linear regression is conceptially very similar to ANCOVA. Let’s use the keeley fire severity plant richness data to see it in action.

keeley <- read.csv("data/11/Keeley_rawdata_select4.csv")

head(keeley)

##   distance elev  abiotic age   hetero firesev     cover rich
## 1 53.40900 1225 60.67103  40 0.757065    3.50 1.0387974   51
## 2 37.03745   60 40.94291  25 0.491340    4.05 0.4775924   31
## 3 53.69565  200 50.98805  15 0.844485    2.60 0.9489357   71
## 4 53.69565  200 61.15633  15 0.690847    2.90 1.1949002   64
## 5 51.95985  970 46.66807  23 0.545628    4.30 1.2981890   68
## 6 51.95985  970 39.82357  24 0.652895    4.00 1.1734866   34

For our purposes, we’ll focus on fire severity and plant cover as predictors.

2.1 Visualizing

I’m not going to lie, visualizing multiple continuous variables is as much of an art as a science. One can use colors and sizes of points, or slice up the data into chunks and facet that. Here are a few examples.

qplot(cover, rich, color=firesev, data=keeley) +
  scale_color_gradient(low="yellow", high="red") +
  theme_bw()

qplot(cover, rich, color=firesev, data=keeley) +
  scale_color_gradient(low="yellow", high="red") +
  theme_bw() +
  facet_wrap(~cut_number(firesev, 4))

Note the new faceting otion. Cool, no?

What other plots can you make?

2.2 Fit and Evaluate Assumptions

Fitting is straightforward for an additive MLR model. It’s just a linear model!

keeley_mlr <- lm(rich ~ firesev + cover, data=keeley)

As for assumptions, we have the usual

par(mfrow=c(2,2))
plot(keeley_mlr, which=c(1,2,5))
par(mfrow=c(1,1))

But now we also need to think about how the residuals relate to each predictor. Fortunately, there’s still residualPlots.

residualPlots(keeley_mlr, test=FALSE)

Odd bow shape - but not too bad. Maybe there’s an interaction? Maybe we want to log transform something? Who knows!

We also want to look at multicollinearity. There are two steps for that. First, the vif

vif(keeley_mlr)

##  firesev    cover 
## 1.236226 1.236226

Not bad. We might also want to look at the correlation matrix. Dplyr can help us here as we want to select down to just relevant columns.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

keeley %>%
  select(firesev, cover) %>%
  cor()

##            firesev      cover
## firesev  1.0000000 -0.4371346
## cover   -0.4371346  1.0000000

2.3 Evaluation

We evaluate the same way as usual

Anova(keeley_mlr)

## Anova Table (Type II tests)
## 
## Response: rich
##            Sum Sq Df F value  Pr(>F)  
## firesev    1258.9  1  6.5004 0.01254 *
## cover       711.6  1  3.6744 0.05854 .
## Residuals 16849.1 87                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And then the coefficients and R²

summary(keeley_mlr)

## 
## Call:
## lm(formula = rich ~ firesev + cover, data = keeley)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.029 -11.583  -1.655  11.759  28.722 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  53.9358     7.0437   7.657 2.45e-11 ***
## firesev      -2.5308     0.9926  -2.550   0.0125 *  
## cover         9.9105     5.1701   1.917   0.0585 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.92 on 87 degrees of freedom
## Multiple R-squared:  0.1703, Adjusted R-squared:  0.1513 
## F-statistic:  8.93 on 2 and 87 DF,  p-value: 0.0002968

Not amazing fit, but, the coefficients are clearly different from 0.

2.3 Visualization

This is where things get sticky. We have two main approaches. First, visualizing with component + residual plots

crPlots(keeley_mlr, smooth=FALSE)

But the values on the y axis are….odd. We get a sense of what’s going on and the scatter after accounting for our predictor of interest, but we might want to look at, say, evaluation of a variable at the mean of the other.

For that, we have visreg

library(visreg)
par(mfrow=c(1,2))
visreg(keeley_mlr)

par(mfrow=c(1,1))

Now the axes make far more sense, and we have a sense of the relationship.

We can actually whip this up on our own using crossing, the median of each value, and predict.

k_med_firesev <- data.frame(firesev = median(keeley$firesev),
                                 cover = seq(0,1.5, length.out = 100))
k_med_firesev <- cbind(k_med_firesev, predict(keeley_mlr, 
                                              newdata = k_med_firesev,
                                              interval="confidence"))
  
ggplot() +
  geom_point(data=keeley, mapping = aes(x=cover, y = rich)) +
  geom_line(data = k_med_firesev, mapping=aes(x=cover, y=fit), color="blue") +
  geom_ribbon(data = k_med_firesev, mapping = aes(x=cover, 
                                                  ymin = lwr,
                                                  ymax = upr),
              fill="lightgrey", alpha=0.5)

2.4 Examples

OK, here are two datasets to work with:
1) planktonSummary.csv showing plankton from Lake Biakal (thanks, Stephanie Hampton). Evluate how Chlorophyll (CHLFa) is affected by other predictors.

2) SwaddleWestNile2002NCEAS_shortnames.csv is about the prevalence of West Nile virus in Birds around Sacramento county in California. What predicts human WNV?
Using the following workflow, analyze these data sets.

# Load the data

# Perform a preliminary visualization. Play with this and choose two predictors

# Fit a MLR model

# Test Asssumptions and modeify model if needed

# Evaluate results

# Visualize results