1. Write a function that takes a vector and returns one bootstrapped sample from said vector. Demonstrate that it works.
2. Write a function that given a vector of values a request for some number of bootstraps (let’s call the parameter R), and a sample statistic function (e.g., mean
, IQR
, etc.) returns R number of values of that statistic. Have it default to R = 1000 and the function is mean. Show this works for 10 bootstrapped replicate draws of a mean from some vector. Do the values look reasonable? Compare to the actual mean of the vector.
make sure you are using the function(s) you wrote in #1
3. Write a function that, given a vector of values a request for some number of bootstraps, and a sample statistic function, returns the original value of the statistic as applied to the vector, the mean of the statistic generated by the bootstrapped reps, the upper and lower 95% CI of the bootstrapped statistic (e.g., the 0.025 and 0.975 quantile), and the bias (i.e., the original value of the statistic - the mean of the bootstrapped statistic).
make sure you are using the function(s) you wrote in #1 and/or #2
4. FiveThirtyEight keeps a great archive of poll data at https://projects.fivethirtyeight.com/polls/. The presidential general election polling data is freely available at https://projects.fivethirtyeight.com/polls-page/president_polls.csv with question, poll id, and cycle defining a unique poll.
4a. Download and look at the data. Is it long or wide?
4b. Get just the polling data for this last week (from 9/29 to today). Filter on start_date
. Also filter down to just Biden and Trump (see candidate_name
or answer
). Extra credit for using {lubridate} for this, but you can just do a messy %in% string match.
4c. OK, this is your sample. What’s the bootstrapped average percentage for each candidate for nationwide polls (state == "")? Note, this answer will not match 538 given their weighting by poll trustworthiness.
4d. What is the average difference between the two candidates by state and national polls? Note, you’ll need to make this a wide data frame to answer! And, well, try the pivot without this advice first, but then….
make a unique ID by pasting together the question_id, poll_id, and state. Then select the ID, state, answer, and pct. Also filter out NA diffs
5. replicate()
has been our friend, but we’ve always had to be a little hacky with it. We’ve either had to fold in means, or use tricksy functions like colMeans and the like.
BUT - what’s interesting about replicate()
is that, if you ask it to turn back raw draws from a random number generator - or anything with more than one value - it gives you a matrix or array.
5a. So, I want you to, using the mean and SD of Biden’s national polling average (you’ll need to calculate it!) from above, simulate 1000 draws from that population with a sample size of 50. What are the dimensions of the object. What are in the rows and columns?
5b. Yuck. Can you turn this into something usable? Say, first make it a tibble or data frame, and then pivot it to long, such that you end up with a column that has an identifier for sim and a column with a single value from that sim?
(Oh, and for all columns, cols = everything()
)
5c. For each sim, what’s the bootsrapped mean and CI? Plot it! And tell us how often it’s greater than the initial mean. E.C. for the plot showing the stats in order from low to high.
5d. So…. what is that plot showing? What are the concepts involved?
EC 3 bonus point for each awesome quality visualization of the general polling data. There is a LOT there, so look carefully before you leap.