For today, we’ll consider data from Brutsaert et al. 2002 looking at how progestrone levels influence respiration at altitude. The data can be downloaded here with progestrone levels and ventilation as a metric of breathing.

1. Create models with different polys

Let’s first look at the data. Plot it, along with a polynomial fit (remember, formula = y ~ poly(x,2) for a quadratic). Then, compare the \(r^2\) value of a linear versus fith order fit. What do you see?

2. Fit each model with 5-fold CV

Does that result hold up, or is it due to overfitting? Let’s evaluate by comparing 5-fold CV scores using RMSE. Let’s do this efficiently, though!

A. Get things ready! Make a 5-fold cross validation tibble using rsample::vfold_cv() and then combine each possible fold with the polynomials 1:5 using tidyr::crossing()

B. Now you have splits and a column of coefficients. Use purr::map2() to make a list column of fit models, where you use the splits and data and the polynomials for you poly() call in the model.

C. Great! Now, calculate the rmse for each fold/polynomial combination as we did in lab.

D. Implications - ok, given that the 5-fold score is the average RMSE across all folds for a given polynomial, show in both a table and figure the relationship between polynomial and out-of-sample RMSE. What does this tell you?

3. Compare models and see how they differ from AIC

That was all well and good, but, how to these results compare to doing this analysis with AIC using the {AICcmodavg} package? Note, you can use dplyr and purrr to not have to fit each model manually.

EC 4. boot::gv.glm()

Let’s try again, for orders 1-5, but this time, let’s do a LOOCV analysis using boot::cv.glm(). Using dplyr and purrr will make things faster and more efficient here - perhaps even with something you created in #3, if you used glm() instead of lm().

Although, if you do that, quick note that you will need to use a map2_*() function with polys in it so that it’s variable can match the . variable used. This may seem like a weird sentence. But, once you get the error that made me realize this, you’ll get it.

5. Grid sample with Bayes

    Last week, we did grid sampling with Likelihood. This week, let’s do it with Bayes!

\[p(H|D) = \frac{p(D|H)p(H)}{p(D)}\]

A. Let’s start with the Palmer Penguins data. Let’s look at just the Gentoo. Why don’t you plot the distribution of the average flipper length of females. We’ll use this data for the exercise. Remember to remove NAs - it will make the rest of the exercise easier. 1 EC for each thing you do to snaz the plot up.

B. OK, this is pretty normal, with a mean of 212.71 and sd of 3.9. Make a grid to search a number of values around that mean and SD, just as you did for likelihood. Let’s say 100 values of each parameter.

C. Write a function that will give you the numerator for any combination of m and s! This is just the same as writing a function for likelihood, but including an additional multiplier of p(H), when putting in the likelihood. Let’s assume a prior for m of dnorm(210, 50) and for s of dunif(1,10) - so, pretty weak!

So, we want p(m, s|flipper length)*p(m)*p(s).

BUT - small problem. These numbers get so vanishingly small, we can no longer do calculations with them. So, for any probability density you use, add log=TRUE and take a sum instead of products or multiplication, as

\[log(p(D|H)p(H)) = log(p(D|H)) + log(p(H))\]

D. Great! Now use this function with your sample grid to get the numerator of the posterior, and then standardize with the p(D) - the sum of all numerators - to get a full posterior. Note, as we’re working in logs, we just subtract log(p(D)) What is the modal estimate of each parameter? How do they compare to the standard frequentist estimate?

Note: log(p(d)) = log(sum(exp(p(D|H)p(H))))

E.C. E. Show me ’dat surface! Make it sing!

E.C. x2 F Compare our weak prior to one with a strong prior. Note, as you progress in this, instead of doing the product of p(D|H)p(H), you might want to do log(p(D|H)) + log(p(H)) as it simplifies calculations. The nice thing is then you just subtract log(p(D)) to get log(p(H|D)) - which you can then safely exponentiate!

6. Final Project Thinking

    We’re at the half-way point in the course, and after the mid-term, it’s time to start thinking about your final project. So…. I want to know a bit about what you’re thinking of!

A. What is the dataset you are thinking of working with? Tell me a bit about what’s in it, and where it comes from.

B. What question do you want to ask of that data set?

EC C. Wanna make a quick visualization of some aspect of the data that might be provocative and interesting?