<- read_delim("https://github.com/rmcelreath/rethinking/raw/master/data/reedfrogs.csv",
reedfrogs delim = ";") |>
mutate(tank = 1:n() |> as.character(),
died = density - surv)
Midterm and Self-Evaluation
Welcome to your mid-term! I hope you enjoy. Note, in all of the questions below, there are easy not so code intensive ways of doing it, and there are longer more involved, yet still workable ways to answer them. I would suggest that before you dive into analyses, you do the following.
First, breathe.
Second, think about the steps you need to execute to get answer the question. Write them down.
Third, for those parts of problems that require code, put those steps, in sequence, in comments in your script file. Use those as signposts to step-by-step walk through the things you need to do.
Last, note there are some impress yourselves. You do not need to do them if you don’t want to. But they are there for you to stretch your wings, if you so desire, and get lost in the questions. Or learn something new.
Have fun!
The exam will be due on Nov 17th, 5pm.
1. Sampling your system
Each of you have a study system you work in and a question of interest. Give an example of one variable that you would sample in order to get a sense of its variation in nature. Describe, in detail, how you would sample for the population of that variable in order to understand its distribution. Questions to consider include, but are not limited to: Just what is your sample versus your population? What would your sampling design be? Why would you design it that particular way? What are potential confounding influences of both sampling technique and sample design that you need to be careful to avoid? What statistical distribution might the variable take, and why?
2. Data Reshaping and Visuzliation
Johns Hopkins has been maintaining one of the best Covid-19 timseries data sets out there. The data on the US can be found here with information about what is in the data at https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data
2a. Access
Download and read in the data. Can you do this without downloading, but read directly from the archive?
2b. It’s big and wide!
The data is, well, huge. It’s also wide, with dates as columns. Write a workflow that will filter to a state (your choice which one!), and then returns a time series (long data where every row is a day) with columns for date, new daily cases and for cumulative cases in that state.
Note, let’s make the date column that emerges a true date object. Let’s say you’ve called it date_col
. If you mutate it, mutate(date_col = lubridate::mdy(date_col))
, it will be turned into a date object that will have a recognized order. {lubridate} is da bomb, and I’m hoping we have some time to cover it in the future.
Even better - impress yourself (if you want - not required!) and merge it with some other data source to also return cases per 100,000 people.
2c. Let’s get visual!
Great! Make a compelling plot of the timeseries for Massachusetts! Points for style, class, ease of understanding major trends, etc. Note, 10/10 for yourself only for the most killer figures. Don’t phone it in! Also, note what the data from JHU is. Do you want the cummulatives, or daily, or what? Want to highlight features? Events? Go wild!
3. Fit and Evaluate a Linear model
Let’s fit and evaluate a linear model! To motivate this, we’ll look at Burness et al.’s 2012 study “Post-hatch heat warms adult beaks: irreversible physiological plasticity in Japanese quail http://rspb.royalsocietypublishing.org/content/280/1767/20131436.short the data for which they have made available at Data Dryad at http://datadryad.org/resource/doi:10.5061/dryad.gs661. We’ll be doing a slightly modified version of their analysis.
3a. What should I do?
Starting with loading this (new to you) data set for the very first time, what are the steps that you would take to analyze this data? Write them out! Maybe even in comments!
3b. Let’s get started
Load the data. Look at it. Anything interesting? Anything you’d want to watch out for? Or is it OK? Anything you’d change to make working with the data easier? Also, manipulate it a bit - for our analysis, we’re going to want Bird # and the Age of the birds to be categorical variables. Make sure we have variables that will work!
3c. Viz before fit
The model we will fit is one where we want to look at how temperature treatment affects the development of tarsus length over time (age). Visualize this. Make it look good! Is there anything here that would shape how you fit a model? Do you see why we are treating age as categorical?
3d. Fit and evaluate
Fit a model where tarsus length is predicted by age, treatment, their interaction, and a ‘block’ effect, as it were, of bird #. Evaluate the fit. Make any modifications as you see necessary due to your model checks. Note, it’s not going to be perfect (I checked the original - hey, you can too - and they’re on the edge) - but our models are robust, so we’re OK.
3e. Answer the Question
A central question of this paper is - does temperature affect the development of tarsus length. With your fit model in hand, answer the question, however you deem fit. Make sure that, alongside any numerical answers you produce, there is at least one stunning figure for a publishable paper!
4. Something Generalized
In their 2011 paper, Stanton-Geddes and Anderson assessed the role of a facultative mutualism between a legume and soil rhizobia in limiting the plant’s range. After publishing, they deposited their data at Dryad http://datadryad.org/resource/doi:10.5061/dryad.8566. As part of their lab experiment, they looked at a variety of plant properties after growing plants with rhizobia from different regions. We want to look at what happened in that experiment during the March 12th sampling event.
4a. Fit a glm
Load the data. Vizualize. Then, fit and evaluate a generalized linear model with your choice of error distribution looking at the effect of rhizobial region and plant height as measured on march 12 as predictors of # of leaves on march 12. Does your model meet assumptions? If not, refit with a different. Why did you chose this (or these) error distribution(s)?
4b. Evaluate your treatments
Which rhizobial regions enable more leaves relative to the control? And in what direction?
4c. Prediction intervals from a distribution
So, your distribution has quantiles (right? We see these in QQ plots). You can see what the value for those quantiles are from the q*
function for a distribution. For example, the 95th percentile of a poisson with a \(\lambda\) of 5 spans from 1 to 10. You can see this with qpois(0.025, lambda = 5)
for the lower, and change it to 0.975 for the upper. Check this out. Plot the upper and lower bounds of the 95th percentile of a Poisson distribution over a range of lambdas from 1 to 30. Do the same for a negative binomial with means from 1 to 30. Note, to translate from a mean ( \(\mu\) ) with a size of 10 (you might have to look at the helpfile for qnbinom here).
4d. Prediction intervals from your model
All right! Armed with this knowledge, one of the frustrating things about broom::augment()
for glm models is that it doesn’t have an interval
argument. And has one trick. One - you need to see what scale your answer is returned on. We want to look at the response scale - the scale of the data. Second, while you can use se_fit
to get standard errors, you don’t get a CI per se (although, hey, ~2*se = CI).
AND - we just looked at how when you have an estimated value, you can get the prediction CI yourself by hand in the previous part of the question. So, using your creativity, visualize the fit, 95% fit interval, and 95% prediction interval at the min, mean, and max of height for each treatment. Geoms can be anything you like! Have fun here!
4e. INMPRESS YOURSELF! Prediction intervals from your model
Again, totally optional, but, check out the sinterval package which you’d have to install from github. It uses fit models and it’s two core functions add_fitted_sims()
and add_predicted_sims()
to get simulated values from fit models using one or the other interval. Do this, and visualize the same thing as above however you’d like (maybe look at ggdist
?). Or try something new? Perhaps visualize across a range of heights, and not just three?
5. Mix it up!
To explore the consequences of random effects, we are going to look at an example from Vonesh, J. R. and Bolker, B. M. (2005). Compensatory larval responses shift trade-offs associated with predator-induced hatching plasticity. Ecology, 86:1580–1591. In the paper, one thing they explore is the effect of tank size, predator presence, and prey density on larval tadpoles. They go on to look at induced plastic defenses in those tadpoles. It’s a cool paper, and worth a look! But, it also shows something really interesting about logistic regression models - namely, when should we consider them as mixed models.
5a. Load it up and plot
Load the data with the following code:
To get used to the data, plot survivorship as a function of tank size and predator treatment. Then, to challenge yourself a bit. Expand the data to 1s and 0s (lived or died) as follows:
<- reedfrogs |>
reedfrogs_expanded group_by(tank, size, pred, density) |>
reframe(status = c(rep(1, surv), rep(0, died))) |>
ungroup()
Now with the expanded data, plot it showing who lived, who died (who tells their story…), and how that corresponds to treatment AND tank. Do you see anything to do with within-tank variability?
5b. Are you over it?
Often, we treat data like this as a binomial logistic regression. We look at each tank as a set of ‘coin flips’ - some living some dead. Fit a glm here with survivorship determined by pred*size, and then evaluate it for overdispersion. Careful to weight for number of individuals stocked in a tank (density)! What do you see when you look at overdispersion?
5c. Fixed or Random Intercept
One way to fix this problem is a quasi-binomial. But, this is unsatisfying, as it’s a posthoc adjustment and not figured into the likelihood. But, another is to think in this case about what is happening in each tank. Fit a mixed model version of 5b, and assess its assumptions. How does it compare in overdispersion? Why the difference? What is a random tank effect doing here? And why couldn’t we use a FE of tank with this pred*size model (try it if you don’t believe me - the model will fit, but something will be… off)?
5d. Changes in Parameters
Now that you have a model with and without a random intercept, how do the estimates of fixed effects change? Why?
5e. Model evaluation
Last, let’s evaluate this model - both with contrasts and a crisp visualization of results. emmeans
, modelbased
, and other friends might help you here. Or not! You do you, and extract things from the model with other helpers! There’s no one right way to do this, but, make it sing.
5f. IMPRESS YOURSELF What did we do
Write out the model equations for the glmm you have fit in 5.3. Use LaTeX to write it all out. Explain what your symbols mean in text.
5g. IMPRESS YOURSELF! Variable slopes
Explore multiple different RE structures beyond a variable intercept. What errors do you get? What do they mean? Can you implement some of the suggested solutions, and what do they tell you about why more complex structures don’t work well here and/or might not be appropriate. Do these different RE structures make a difference for your evaluation of the FE? What RE structure would you use? What if you added density as a covariate/predictor as in the paper?