Combining categorical and continuous variables is not that different from multiway ANOVA. To start with, let’s look at the neanderthal data.
neand <- read.csv("./data/11/18q09NeanderthalBrainSize.csv")
head(neand)
## lnmass lnbrain species
## 1 4.00 7.19 recent
## 2 3.96 7.23 recent
## 3 3.95 7.26 recent
## 4 4.04 7.23 recent
## 5 4.11 7.22 recent
## 6 4.10 7.24 recent
We can clearly see both the categorical variable we’re interested in, and the covariate.
To get a preliminary sense of what’s going on here, do some exploratory visualization with ggplot2
why doncha!
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2
qplot(lnmass, lnbrain, color=species, data=neand) +
stat_smooth(method="lm")
Now, the CIs are going to be off as this wasn’t all tested in the same model, but you begin to get a sense of whether things are parallel or not, and whether this covariate is important.
What other plots might you produce?
As this is a general linear model, good olde lm()
is still there for us.
neand_lm <- lm(lnbrain ~ species + lnmass, data=neand)
In addition to the ususal tests, we need to make sure that the slopes really are parallel. We do that by fitting a model with an interaction and testing it (which, well, if there was and interaction, might that be interesting).
First, the usual
par(mfrow=c(2,2))
plot(neand_lm, which=c(1,2,5))
par(mfrow=c(1,1))
#And now look at residuals by group/predictors
library(car)
residualPlots(neand_lm, tests=FALSE)
To test the parallel presumption
neand_int <- lm(lnbrain ~ species*lnmass, data=neand)
Anova(neand_int)
## Anova Table (Type II tests)
##
## Response: lnbrain
## Sum Sq Df F value Pr(>F)
## species 0.027553 1 6.2203 0.0175 *
## lnmass 0.130018 1 29.3527 4.523e-06 ***
## species:lnmass 0.004845 1 1.0938 0.3028
## Residuals 0.155033 35
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So, first we have our F-tests.
Anova(neand_lm)
## Anova Table (Type II tests)
##
## Response: lnbrain
## Sum Sq Df F value Pr(>F)
## species 0.027553 1 6.2041 0.01749 *
## lnmass 0.130018 1 29.2764 4.262e-06 ***
## Residuals 0.159878 36
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Both the treatment and covariate matter.
Second, we might want to compare covariate adjusted groups and/or look at covariate adjusted means.
library(lsmeans)
## Loading required package: estimability
adj_means <- lsmeans(neand_lm, specs="species")
#adjusted means
adj_means
## species lsmean SE df lower.CL upper.CL
## neanderthal 7.271581 0.02418546 36 7.222530 7.320631
## recent 7.341859 0.01250077 36 7.316506 7.367212
##
## Confidence level used: 0.95
#comparisons
contrast(adj_means, method="tukey", adjust="none")
## contrast estimate SE df t.ratio p.value
## neanderthal - recent -0.0702784 0.02821518 36 -2.491 0.0175
If you have an interaction, this method is no longer valid. Instead, you’ll need to monkey with your posthocs (if you even want to use them - often we don’t) to look at tests at different levels of the covariate.
Visualization is funny, as you want to make parallel lines and also get the CIs right. Rather than rely on ggplot2
to do this natively, we need to futz around a bit with generating predictions
neand_predict <- predict(neand_lm, interval="confidence")
head(neand_predict)
## fit lwr upr
## 1 7.243614 7.203988 7.283240
## 2 7.223761 7.178077 7.269445
## 3 7.218798 7.171538 7.266059
## 4 7.263467 7.229347 7.297586
## 5 7.298209 7.271375 7.325042
## 6 7.293246 7.265628 7.320863
So, here we have fit values, lower confidence interval, and upper confidence intervals. As we have not fed predict()
any new data, these values line up with our neand
data frame, so we can cbind it all together, and then use these values to make a prediction plot.
neand <- cbind(neand, neand_predict)
ggplot(data=neand) +
geom_point(mapping=aes(x=lnmass, y=lnbrain, color=species)) +
geom_line(mapping = aes(x = lnmass, y=fit, color=species)) +
geom_ribbon(data=neand, aes(x = lnmass,
ymin=lwr,
ymax=upr,
group = species),
fill="lightgrey",
alpha=0.5) +
theme_bw()
And there we have nice parallel lines with model predicted confidence intervals!
I’ve provided two data sets:
1) 18e4MoleRatLayabouts.csv
looking at how caste and mass affect the energy mole rates expend
2) 18q11ExploitedLarvalFish.csv
looking at how status of a marine area - protected or not - influences the CV around age of maturity of a number of different fish (so, age is a predictor)
Using the following workflow, analyze these data sets.
# Load the data
# Perform a preliminary visualization
# Fit an ANCOVA model
# Test Asssumptions and modeify model if needed
# Evaluate results
# Post-hocs if you can
# Visualize results
Multiple linear regression is conceptially very similar to ANCOVA. Let’s use the keeley fire severity plant richness data to see it in action.
keeley <- read.csv("data/11/Keeley_rawdata_select4.csv")
head(keeley)
## distance elev abiotic age hetero firesev cover rich
## 1 53.40900 1225 60.67103 40 0.757065 3.50 1.0387974 51
## 2 37.03745 60 40.94291 25 0.491340 4.05 0.4775924 31
## 3 53.69565 200 50.98805 15 0.844485 2.60 0.9489357 71
## 4 53.69565 200 61.15633 15 0.690847 2.90 1.1949002 64
## 5 51.95985 970 46.66807 23 0.545628 4.30 1.2981890 68
## 6 51.95985 970 39.82357 24 0.652895 4.00 1.1734866 34
For our purposes, we’ll focus on fire severity and plant cover as predictors.
I’m not going to lie, visualizing multiple continuous variables is as much of an art as a science. One can use colors and sizes of points, or slice up the data into chunks and facet that. Here are a few examples.
qplot(cover, rich, color=firesev, data=keeley) +
scale_color_gradient(low="yellow", high="red") +
theme_bw()
qplot(cover, rich, color=firesev, data=keeley) +
scale_color_gradient(low="yellow", high="red") +
theme_bw() +
facet_wrap(~cut_number(firesev, 4))
Note the new faceting otion. Cool, no?
What other plots can you make?
Fitting is straightforward for an additive MLR model. It’s just a linear model!
keeley_mlr <- lm(rich ~ firesev + cover, data=keeley)
As for assumptions, we have the usual
par(mfrow=c(2,2))
plot(keeley_mlr, which=c(1,2,5))
par(mfrow=c(1,1))
But now we also need to think about how the residuals relate to each predictor. Fortunately, there’s still residualPlots
.
residualPlots(keeley_mlr, test=FALSE)
Odd bow shape - but not too bad. Maybe there’s an interaction? Maybe we want to log transform something? Who knows!
We also want to look at multicollinearity. There are two steps for that. First, the vif
vif(keeley_mlr)
## firesev cover
## 1.236226 1.236226
Not bad. We might also want to look at the correlation matrix. Dplyr can help us here as we want to select down to just relevant columns.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
keeley %>%
select(firesev, cover) %>%
cor()
## firesev cover
## firesev 1.0000000 -0.4371346
## cover -0.4371346 1.0000000
We evaluate the same way as usual
Anova(keeley_mlr)
## Anova Table (Type II tests)
##
## Response: rich
## Sum Sq Df F value Pr(>F)
## firesev 1258.9 1 6.5004 0.01254 *
## cover 711.6 1 3.6744 0.05854 .
## Residuals 16849.1 87
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
And then the coefficients and R2
summary(keeley_mlr)
##
## Call:
## lm(formula = rich ~ firesev + cover, data = keeley)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.029 -11.583 -1.655 11.759 28.722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.9358 7.0437 7.657 2.45e-11 ***
## firesev -2.5308 0.9926 -2.550 0.0125 *
## cover 9.9105 5.1701 1.917 0.0585 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.92 on 87 degrees of freedom
## Multiple R-squared: 0.1703, Adjusted R-squared: 0.1513
## F-statistic: 8.93 on 2 and 87 DF, p-value: 0.0002968
Not amazing fit, but, the coefficients are clearly different from 0.
This is where things get sticky. We have two main approaches. First, visualizing with component + residual plots
crPlots(keeley_mlr, smooth=FALSE)
But the values on the y axis are….odd. We get a sense of what’s going on and the scatter after accounting for our predictor of interest, but we might want to look at, say, evaluation of a variable at the mean of the other.
For that, we have visreg
library(visreg)
par(mfrow=c(1,2))
visreg(keeley_mlr)
par(mfrow=c(1,1))
Now the axes make far more sense, and we have a sense of the relationship.
We can actually whip this up on our own using crossing
, the median of each value, and predict.
k_med_firesev <- data.frame(firesev = median(keeley$firesev),
cover = seq(0,1.5, length.out = 100))
k_med_firesev <- cbind(k_med_firesev, predict(keeley_mlr,
newdata = k_med_firesev,
interval="confidence"))
ggplot() +
geom_point(data=keeley, mapping = aes(x=cover, y = rich)) +
geom_line(data = k_med_firesev, mapping=aes(x=cover, y=fit), color="blue") +
geom_ribbon(data = k_med_firesev, mapping = aes(x=cover,
ymin = lwr,
ymax = upr),
fill="lightgrey", alpha=0.5)
OK, here are two datasets to work with:
1) planktonSummary.csv
showing plankton from Lake Biakal (thanks, Stephanie Hampton). Evluate how Chlorophyll (CHLFa
) is affected by other predictors.
2) SwaddleWestNile2002NCEAS_shortnames.csv
is about the prevalence of West Nile virus in Birds around Sacramento county in California. What predicts human WNV?
Using the following workflow, analyze these data sets.
# Load the data
# Perform a preliminary visualization. Play with this and choose two predictors
# Fit a MLR model
# Test Asssumptions and modeify model if needed
# Evaluate results
# Visualize results