For this lab, see the etherpad at https://etherpad.wikimedia.org/p/607-bayes
Last week, we fit parameters with likelihood. To give you a hands-on feel for Bayesian data analysis let’s do the same thing with Bayes. We’re going to fit a single parameter - lambda from a poisson distribution - with Bayes instead of likelihood (although as you’ll see likelihood is a part of it!).
Let’s say you start with this data:
pois_data <- c(6, 6, 13, 7, 11, 10, 9, 7, 7, 12)
Now, how do you get the Bayesian estimate and credible intervals for this?
Now, in Bayesian data analysis, according to Bayes theorem
\[p(\lambda | data) = \frac{p(data | \lambda)p(\lambda)}{p(data)}\]
To operationalize this, we can see three things we need to either provide or calculate
The likelihood of each choice of lambda.
The prior probability of each choice of lambda.
The marginal distribution - i.e., the sum of the product of p(D|H)p(H), as this is a discrete distribution.
So, first, what range of lambdas should we test? And then let’s get a likelihood profile. Well, since our range of data is from 6-12, we can reasonably assume lambda must be between 6 and 13. Could be outside of that, but it’s a reasonable suggestion. We can then simple get our likelihood profile using dplyr.
library(dplyr)
library(tidyr)
bayes_analysis <- data.frame(lambda = 6:13) %>%
rowwise() %>%
mutate(lik = sum(dpois(pois_data, lambda))) %>%
ungroup()
OK, that’s our likelihood. We know what to do with that. BUT - now we want to add in a prior. So, what should our prior be? For the sake of argument here, let’s go with a flat prior using a uniform distribution between 6 and 13. Note, this actually is somewhat informative, in that our lambda cannot be outside of these bounds. But that is a choice! We could have done something else - a big flat normal or somesuch. But, let’s go with the uniform. After filling it in as a column, we then multiple our likelihood by our prior.
bayes_analysis <- bayes_analysis %>%
mutate(prior = dunif(lambda, min = 6, max = 13)) %>%
mutate(numerator = lik*prior)
OK, now let’s get our posterior! To do that, we just divide the numerator of our posterior by the marginal likelihood - which is just the sum of that numerator!
bayes_analysis <- bayes_analysis %>%
mutate(posterior = numerator/sum(numerator))
library(ggplot2)
ggplot(data=bayes_analysis, mapping = aes(x = lambda, y = posterior)) +
geom_bar(stat="identity")
Or we can look at a table
lambda | lik | prior | numerator | posterior |
---|---|---|---|---|
6 | 0.8834112 | 0.1428571 | 0.1262016 | 0.1266247 |
7 | 1.0031112 | 0.1428571 | 0.1433016 | 0.1437820 |
8 | 1.0363080 | 0.1428571 | 0.1480440 | 0.1485403 |
9 | 1.0040256 | 0.1428571 | 0.1434322 | 0.1439131 |
10 | 0.9279933 | 0.1428571 | 0.1325705 | 0.1330149 |
11 | 0.8252266 | 0.1428571 | 0.1178895 | 0.1182848 |
12 | 0.7085170 | 0.1428571 | 0.1012167 | 0.1015561 |
13 | 0.5880167 | 0.1428571 | 0.0840024 | 0.0842840 |
From this table, we can see that the 90%CI is wide - ranges from 6-13. That’s because we have a weak prior and not much data. Now, what if we’d had a stronger prior? Maybe a poisson distribution with a lambda of 10?
bayes_analysis <- bayes_analysis %>%
mutate(prior2 = dpois(lambda, 10),
posterior2 = (lik*prior2)/sum(lik*prior2))
ggplot(data=bayes_analysis) +
geom_area(alpha=0.5, fill="red", mapping=aes(x=lambda, y=posterior)) +
geom_area(alpha=0.5, fill="blue", mapping=aes(x=lambda, y=posterior2)) +
ggtitle("Red = Flat Priot, Blue = Informative Prior")
knitr::kable(bayes_analysis %>% select(lambda, posterior, posterior2))
lambda | posterior | posterior2 |
---|---|---|
6 | 0.1266247 | 0.0786391 |
7 | 0.1437820 | 0.1275636 |
8 | 0.1485403 | 0.1647315 |
9 | 0.1439131 | 0.1773332 |
10 | 0.1330149 | 0.1639042 |
11 | 0.1182848 | 0.1325030 |
12 | 0.1015561 | 0.0948029 |
13 | 0.0842840 | 0.0605226 |
A noticable difference. The 90% CI still contains 6 and 13 - but just barely. The 80% CI is only from 7-12.
What’s super near about this is that you can simulate samples from your posterior density. Say, draw 100 sampled lambdas, then, for each lambda, draw a sample of 10 random numbers (as in our initial distribution). We can then see how these posterior predictive distributions compare to the original.
nsims <- 10
posterior_sims <- data.frame(sampled_lambda = sample(6:13, size = nsims,
replace=TRUE,
prob = bayes_analysis$posterior),
sim = 1:nsims) %>%
group_by(sim) %>%
nest() %>%
mutate(predicted_values = purrr::map(data, ~rpois(10, .$sampled_lambda))) %>%
unnest(predicted_values) %>%
ungroup()
ggplot() +
geom_density(mapping=aes(x=pois_data), fill="lightblue") +
geom_density(data=posterior_sims, mapping=aes(group = sim, x=predicted_values), color="grey") +
theme_bw()
Today we’re going to go through fitting and evaluating a linear regression fit using Bayesian techiniques. For that, we’re going to use the rstanarm
library which uses STAN to perform the MCMC simulations.
We’ll use the seal linear regression as an example.
library(rstanarm)
seals <- read.csv("data/06/17e8ShrinkingSeals Trites 1996.csv")
head(seals)
## age.days length.cm
## 1 5337 131
## 2 2081 123
## 3 2085 122
## 4 4299 136
## 5 2861 122
## 6 5052 131
Note that when you loaded rstanarm
it gave you some warnings bout wanting to use more cores. THis is great - MCMC is one of those places where using all of your computer’s cores (most these days have at least two) can really speed things along. And the parallelization is done for you!
options(mc.cores = parallel::detectCores())
The basic steps of fitting a linear regression using Bayesian techniques (presuming you’ve already settled on a linear data generating process and a normal error generating process) are as follows.
1. Fit the model
2. Assess convergence of chains
3. Evaluate posterior distributions
4. Check for model misspecification (fit v. residual, qq plot)
5. Evaluate simulated residual distributions
6. Evaluate simulated fit versus observed values
7. Compare posterior predictive simulations to observed values
8. Visualize fit and uncertainty
To begin, let’s define our model. The beauty of rstanarm
is that you merely just tack stan_
onto most common fitting functions in R. We’re not going to use stan_lm()
(more on that in a second), but instead stan_glm
. GLM stands for Generalized Linear Model. The reason we’re using stan_glm
is because it allows to set priors for slopes and intercepts.
By default, though, it sets relatively flat priors (slope prior ~ N(0, 2.5) and intercept prior ~ N(0,10)). No intercept is set for the SD, as the SD results from your choice of slope and intercept.
To specify that we’re using a gaussian error, simply set the family
argument to gaussian()
- evertyhing else is handled for you.
Note, if you really like being this specific, you can use glm
to fit all of your linear models. It uses likelihood for the fit.
set.seed(607)
seal_lm_bayes <- stan_glm(length.cm ~ age.days,
data = seals,
family=gaussian())
Note the output - you can see you’re doing something! And you get a sense of speed.
Now, there is a stan_lm
function that works really well - it’s prior is just different. Rather than put priors on parameters, you put a prior on your fit. With one parameter, you have to specify the mean, median, or mode. It’s a bit fiddly, and you cannot then change the priors for individual parameters, so, I don’t do it. But, as an example…
seal_lm_bayes <- stan_lm(length.cm ~ age.days,
data = seals,
prior = R2(0.5, what='median'))
Before we diagnose whether we have a good model or not, we want to make sure that our MCMC chains have converged properly so that we can feel confident in our ability to assess the model. Now, rstanarm
usually runs until you have reached convergence, as the models it works with are pretty straightforward. But, good to check.
We’re going to check a few diagnostics:
Diagnostic | Fix |
---|---|
Did your chains converge? | More iterations, check model |
Are your posteriors well-behaved? | Longer burning, more interations, check model & priors |
Are samples in your chains uncorrelated? | Change your thinning interval |
So, first, did your model converge? The easiest way to see this is to plot the trace of your four chains. Now, plot()
does a lot of things with rstanarm objects
- you just have to tell it the plotfun
you want to use. To see the full list, try ?rstanarm-plots
. The first we will use, stan_trace
shows the traoce of our chains.
plot(seal_lm_bayes, plotfun="stan_trace")
Note, there’s a par
argument, so, you can see other chains, like sigma
for the SD.
You can assess convergence by examining the Rhat values in summary(seal_lm_bayes)
. These values are something called the Gelman-Rubin statistic, and it should be at or very close to 1.
OK, this looks pretty good - but, are those posteriors normal?
plot(seal_lm_bayes, show_density = TRUE)
Oh, that’s weird - the scale of the parameters is so off that we cannot clearly see their distribution. So, instead…
plot(seal_lm_bayes, show_density = TRUE, par="age.days")
plot(seal_lm_bayes, show_density = TRUE, par="(Intercept)")
plot(seal_lm_bayes, show_density = TRUE, par="sigma")
Note that the function itself tells you what the intervals are. You can set them yourself, for example for the 67% interval and showing the full width of the distribution using
plot(seal_lm_bayes, show_density = TRUE, par="age.days",
ci_level=0.67, outer=1)
Well, everything looks nice and normal and well behaved!
Last, we want to look at autocorrelation within our chains. We want values in the thinned chain to be uncorelated with each other. So, while they’d have an autocorrelation that might be high at a distance of 1, this should drop to near zero very quickly. If not, we need a different thinning interval.
plot(seal_lm_bayes, plotfun="stan_ac")
What do you do if you have funky errors? 1) Check your model for errors/bad assumptions. Just visualize the data! 2) Check your priors to see if they are of and try something different. Maybe a uniform prior was a bad choice, and a flat normal is a better idea. (I have made this error) 3) Try different algorithms found in ?stan_glm
but make sure you read the documentation to know what you are doing. 4) Dig deeper into the docs and muck with some of the guts of how the algoright works. Common fixes include the following, but there are so many more - iter
to up from 2000 to more iterations - warmpup
to change the burnin period - thin
to step up the thinning - adapt_delta
to change the acceptance probability of your MCMC chains.
Also, as a final note, using shinystan
you can create a nice interactive environment to do posterior checks.
So, your MCMC is ok. What now? Well, we have our usual suite of model diagnostics - but they’re just a wee bit different given that we have simulated chains we’re working with. So, here’s what we’re going to look at:
Diagnostic | Probable Error |
---|---|
Fitted v. residual | Check linearity, error |
QQ Plot | Check error distribution |
Simulated residual histograms | Check error distribution |
Simulated Fit v. Observed | Check linearity |
Reproduction of Sample Properties | Respecify Model |
Outlier Analysis | Re-screen data |
To start with, let’s look at the relationships between the fitted and residual values at the estimated parameter values. We’ll need to extract those values and work from there.
fit_vals <- fitted(seal_lm_bayes)
res <- residuals(seal_lm_bayes)
plot(fit_vals, res)
Good. Now the QQ plot.
qqnorm(res)
qqline(res)
This is well and good, but, we know that we’re working with chains here. So, while these plots might be good for point estimates, we reall want to look at either replicate simiulated outputs or the same plots as above, but with mean values from MCMC draws of our posterior predictions. For example, here’s a qq plot with simulated residuals from 4000 simulations per each data point. Note that the only funky thing here, besides generating the prediction simulation matrix via posterior_predict
, is the transposition of matrices we have to do to get things to align for the subtraction
pred_vals <- posterior_predict(seal_lm_bayes)
resid_vals <- t( t(pred_vals) - seals$length.cm)
qqnorm(colMeans(resid_vals))
qqline(colMeans(resid_vals))
Not too different. And, hey, now we can look at the average of the posterior prediction against the residual.
plot(colMeans(pred_vals), colMeans(resid_vals))
rstanarm
has a few of these types of diagnostics builtin. They look at simulated draws, as those give you a better idea of how your model is behaving.
For residuals, the first is to look at the histogram of residuals from several simulated runs.
pp_check(seal_lm_bayes, check="residuals", bins=10)
We can also look at the relationship between average fitted and observed values
pp_check(seal_lm_bayes, check="scatter")
Note, you can add the nreps
argument to look at individual simulation runs. But, in general there should be a nice cloud around this line. And it should follow a roughly 1:1 relationship.
Another check is to see if we have reproduced the properties of our observed response variable. The “test” check gets at that. You can one or more properties of the response variable. The default is to look at the mean, but it’s often useful to look at both the mean and SD
pp_check(seal_lm_bayes, check="test")
pp_check(seal_lm_bayes, check="test", test=c("mean", "sd"))
Last, we can look at whether simulations of the posterior lineup with the actual distribution of the posterior. If there are wild swings, we know there is a problem.
pp_check(seal_lm_bayes, check="distributions", nreps=3)
As with all diagnostics, failures indicate a need to consider respecifying a model and/or poor assumptions in the error generating process.
Last, we can look for outliers using loo
- leave one out. We use a particular form of this function as there are a few.
plot(rstanarm::loo(seal_lm_bayes), label_points = TRUE, cex=1.1)
Points with a score >0.5 are a bit worrisoome, and scores >1.0 have a large deal of leverage and should be examined.
OK, phew we have gotten through the mdoel checking stage. Now, what does our model tell us?
#adding extra digits as some of these are quite small
summary(seal_lm_bayes, digits=5)
## stan_glm(formula = length.cm ~ age.days, family = gaussian(),
## data = seals)
##
## Family: gaussian (identity)
## Algorithm: sampling
## Posterior sample size: 4000
## Observations: 9665
##
## Estimates:
## mean sd 2.5% 25%
## (Intercept) 115.76393 0.17306 115.43124 115.64502
## age.days 0.00237 0.00004 0.00229 0.00234
## sigma 5.68183 0.03898 5.60821 5.65502
## mean_PPD 124.60814 0.07980 124.45065 124.55478
## log-posterior -30511.93017 1.13368 -30514.93478 -30512.44786
## 50% 75% 97.5%
## (Intercept) 115.76555 115.87943 116.10273
## age.days 0.00237 0.00240 0.00246
## sigma 5.68116 5.70783 5.75952
## mean_PPD 124.60928 124.66202 124.76122
## log-posterior -30511.63759 -30511.10008 -30510.65972
##
## Diagnostics:
## mcse Rhat n_eff
## (Intercept) 0.00291 1.00132 3535
## age.days 0.00000 1.00077 3684
## sigma 0.00073 0.99978 2867
## mean_PPD 0.00149 1.00001 2869
## log-posterior 0.02370 1.00062 2288
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
Wow. No p-values, no nothing. We have, in the first block, the mean, SD (assuming a gaussian distribution of posteriors), and some of the quantiles of the parameters. There are two slightly mysterious outputs here. mean_PPD
and log-posterior
.
mean_PPD
is just the mean of the response variable’s posterior density. Compare it to mean(seals$length.cm)
which is 124.6095189. In essence, you can do a nice comparison of the response variable and it’s simulated values.
The log-posterior
is the log of the combined posterior distribution. This is analagous to a likelihood, or rather, a penalized likelihood. It’s used later for model comparison.
We can also visualize coefficients (separating them into multiple plots due to differences in scale)
plot(seal_lm_bayes, par="(Intercept)")
plot(seal_lm_bayes, par="age.days")
plot(seal_lm_bayes, par="sigma")
The second block is diagnostics. It evaluates convergence via the Gelman-Rubin diagnostic - you want a Rhat
of 1, and the effective number of samples you’re working with from your posterior chains given autocorrelation.
If we want to know more about the posteriors, we have to begin to explore the chains themselves. To get the chains, we convert the object into a data frame.
seal_chains <- as.data.frame(seal_lm_bayes)
head(seal_chains)
## (Intercept) age.days sigma
## 1 115.7148 0.002382104 5.689543
## 2 115.7148 0.002382104 5.689543
## 3 115.8711 0.002341696 5.710861
## 4 115.8173 0.002377908 5.693768
## 5 116.0416 0.002305621 5.684965
## 6 115.5561 0.002449215 5.659366
We can now do really interesting things, like, say, as what is the weight of the age coefficient that is less than 0? To get this, we need to know the number of entries in the chain that are <0, and then divide that total by the total length of the chains.
sum(seal_chains$age.days<0)/nrow(seal_chains)
## [1] 0
Oh, that’s 0. Let’s test something more interesting. How much of the PPD of the slope is between 0.00229 and 0.00235?
sum(seal_chains$age.days>0.00229 &
seal_chains$age.days<0.00235) / nrow(seal_chains)
## [1] 0.28225
28.9% - nice chunk. We can also look at some other properties of that chain:
mean(seal_chains$age.days)
## [1] 0.00237105
median(seal_chains$age.days)
## [1] 0.002370685
To get the Highest Posteriod Density Credible Intervals (often called the HPD intervals)
posterior_interval(seal_lm_bayes)
## 5% 95%
## (Intercept) 1.154820e+02 116.04497747
## age.days 2.299774e-03 0.00244315
## sigma 5.619859e+00 5.74688741
Yeah, zero was never in the picture.
This is all well and good, but, how does our model fit? Cn we actually see how well the model fits the data, and how well the data generating process fits the data relative to the overall uncertainty.
To visualize, we have coefficient estimates. We can use good olde ggplot2
along with the geom_abline()
function to overlay a fit onto our data.
library(ggplot2)
#the data
seal_plot <- ggplot(data = seals,
mapping=aes(x = age.days, y = length.cm)) +
geom_point(size=2)
#add the fit
seal_plot +
geom_abline(intercept = coef(seal_lm_bayes)[1], slope = coef(seal_lm_bayes)[2],
color="red")
This is great, but what if we want to see the CI of the fit? Rather than use an area plot, we can actually use the output of the chains to visualize uncertainty. seal_chains
contains simulated slopes and intercepts. Let’s use that.
seal_plot +
geom_abline(intercept = seal_chains[,1], slope = seal_chains[,2], color="grey", alpha=0.6) +
geom_abline(intercept = coef(seal_lm_bayes)[1], slope = coef(seal_lm_bayes)[2], color="red")
We can see the tightness of the fit, and that we have high confidence in the output of our model.
So how to we visualize uncertainty given our large SD of our fit? We can add additional simulated values from posterior_predict
at upper and lower values of our x-axis, and put lines through them.
seal_predict <- posterior_predict(seal_lm_bayes, newdata=data.frame(age.days=c(1000, 8500)))
This produces a 4000 x 2 matrix, each row is one simulation, each column is for one of the new values.
seal_predict <- as.data.frame(seal_predict)
seal_predict$x <- 1000
seal_predict$xend <- 8500
#The full viz
seal_plot +
geom_segment(data = seal_predict,
mapping=aes(x=x, xend=xend, y=V1, yend=V2),
color="lightblue", alpha=0.1)+
geom_abline(intercept = seal_chains[,1], slope = seal_chains[,2], color="darkgrey", alpha=0.6) +
geom_abline(intercept = coef(seal_lm_bayes)[1], slope = coef(seal_lm_bayes)[2], color="red")
We can now see how much of the range of the data is specified by both our data and error generating process. There’s still some data that falls outside of the range, although that’s not surprising given our large sample size.
What if you wanted to try different priors, and assess the influence of your choice? First, let’s see how our current prior relates to our posterior.
posterior_vs_prior(seal_lm_bayes, pars="age.days")
Eh, not much, most likely. Let’ see if we had a different prior on the slope. Maybe a strong prior of a slope of 10. A very strong prior. Our stan_glm
has two places where we can insert a prior - both in the prior
for slope and prior_intercept
for the intercept.
seal_lm_bayes_prior <- stan_glm(length.cm ~ age.days,
data = seals,
family=gaussian(),
prior = normal(10,0.1),
prior_intercept = normal(0, 10))
Note that this took longer to run due to the odd prior. It took a looooong time for our chains to settle into something reasonable.
We can re-visuzliaze the influence of the prior on our posterior - did it make a difference?
posterior_vs_prior(seal_lm_bayes_prior, pars="age.days")
Fist, the relationship between how lean you are and how quickly you lose fat. Implement this to get a sense ot the general workflow for analysis
fat <- read.csv("./data/06/17q04BodyFatHeatLoss Sloan and Keatinge 1973 replica.csv")
#initial visualization to determine if lm is appropriate
fat_plot <- ggplot(data=fat, aes(x=leanness, y=lossrate)) +
geom_point()
fat_plot
#fit the model!
fat_mod <- stan_glm(lossrate ~ leanness,
data = fat,
family=gaussian())
# Inspect chains
plot(fat_mod, plotfun = "stan_trace")
#Inspect Posteriors
plot(fat_mod, show_density=TRUE)
#Inspect Autocorrelation
plot(fat_mod, plotfun = "stan_ac")
#model assumptions
fat_fit <- predict(fat_mod)
fat_res <- residuals(fat_mod)
#fit
qplot(fat_fit, fat_res)
pp_check(fat_mod, check="scatter")
#normality
qqnorm(fat_res)
qqline(fat_res)
pp_check(fat_mod, check="residuals", bins=8)
##match to posterior
pp_check(fat_mod, check="test", test=c("mean", "sd"))
pp_check(fat_mod)
#coefficients
summary(fat_mod, digits=5)
#confidence intervals
posterior_interval(fat_mod)
#visualize
fat_chains <- as.data.frame(fat_mod)
fat_plot +
geom_abline(intercept=fat_chains[,1], slope = fat_chains[,2], alpha=0.1, color="lightgrey") +
geom_abline(intercept=coef(fat_mod)[1], slope = coef(fat_mod)[2], color="red") +
geom_point()
For your first faded example, let’s look at the relationship between DEET and mosquito bites.
deet <- read.csv("./data/06/17q24DEETMosquiteBites.csv")
deet_plot <- ggplot(data=___, aes(x=dose, y=bites)) +
geom_point()
deet_plot
#fit the model!
deet_mod <- stan_glm(___ ~ dose,
data = ____,
family=gaussian())
# Inspect chains
plot(deet_mod, plotfun = "stan_trace")
#Inspect Posteriors
plot(deet_mod, show_density=TRUE)
#Inspect Autocorrelation
plot(___, plotfun = "stan_ac")
#model assumptions
deet_fit <- predict(___)
deet_res <- residuals(___)
#fit
qplot(deet_fit, deet_res)
pp_check(___, check="scatter")
#normality
qqnorm(___)
qqline(___)
pp_check(___, check="residuals", bins=8)
##match to posterior
pp_check(___, check="test", test=c("mean", "sd"))
pp_check(___)
#coefficients
summary(___, digits=5)
#confidence intervals
posterior_interval(___)
#visualize
deet_chains <- as.data.frame(___)
deet_plot +
geom_abline(intercept=deet_chains[,1], slope = deet_chains[,2], alpha=0.1, color="lightgrey") +
geom_abline(intercept=coef(___)[1], slope = coef(___)[2], color="red") +
geom_point()
Do longer lived species also have larger home ranges? Let’s test this!
zoo <- read.csv("./data/06/17q02ZooMortality Clubb and Mason 2003 replica.csv")
zoo_plot <- ggplot(data=___, aes(x=mortality, y=homerange)) +
___()
___
#fit the model!
zoo_mod <- stan_glm(___ ~ ___,
data = ____,
family=___)
# Inspect chains
#Trace of chains
plot(___, plotfun = "___")
#Inspect Posterior Distributions
plot(___, ___=TRUE)
#Inspect Autocorrelation
plot(___, plotfun = "___")
#model assumptions
zoo_fit <- predict(___)
zoo_res <- residuals(___)
#fit versus residuals
#and fit versus observed
qplot(___, ___)
pp_check(___, check="___")
#normality via residuals
qqnorm(___)
qqline(___)
pp_check(___, check="___", bins=8)
##match to posterior
pp_check(___, check="___", test=c("mean", "sd"))
pp_check(___)
#coefficients
summary(___, digits=5)
#confidence intervals
___(___)
#visualize
zoo_chains <- as.data.frame(___)
zoot_plot +
___(___=___[,1], ___ = ___[,2], alpha=0.1, color="lightgrey") +
___(___=coef(___)[1], ___ = coef(___)[2], color="red") +
geom_point()