0. Some setup

# libraries
library(dplyr)
library(ggplot2)

#set that theme
theme_set(theme_bw(base_size = 12))

1. A purrr-fect simulation

1.1 Replicators Abound

To simulate things, we want to repeat an action again and again. There are a few ways to do this in R. The first is the function replicate()

replicate(n = 5, sum(1:10))
## [1] 55 55 55 55 55

Well, that’s a hair boring, but, you can see it summed something 5 times. We can of course create fancier things to do, but, in general, replicate() is a solid workhorse.

R has a lot of other ways to do repetitive work. For example, your classic for loop, where you iterate over an index. Note the use of {} like in a function.

# always start with an object, otherwise you eat memory
my_vec <- rep(NA, 10)

#iterate over indices
for(i in 1:10){
  my_vec[i] <- mean(rnorm(10))
}

my_vec
##  [1]  0.324026730  0.008839435 -0.390083370  0.056193056 -0.037800320
##  [6] -0.482682694 -0.505578944  0.081784930  0.135450244  0.569138830

Or a series of functions that branch off of one called apply() where you apply things to elements of the vector. Common ones include sapply() and lapply() which work on vectors or lists. These work easily with a predefined function

mean_rnorm_n <- function(n) mean(rnorm(n))

sapply(1:10, mean_rnorm_n)
##  [1] -1.3495863790  0.6417172832  0.8936823558  0.0001879669 -0.2447177725
##  [6]  0.2178507130  0.0295330627  0.1432629430  0.4088698943  0.0564228468

But - the applies can be tricky. You don’t know if you are going to get a list, vector, or other data structure back often. So, useful for simulation, but……

1.2 But I like cats…

There is also a beautiful library in R called purrr. This library takes the dplyr logic of mapping a function to many inputs and puts it on a different level. We’ll talk more about purrr at another time, but, for now, there are a few functions worth knowing.

image from Allison Horst https://github.com/allisonhorst/

The core purrr functions are the map_*() functions. Basically, a whole family of functions where you can change what is in the * to specify the output type. For example, if we want to replicate the replicate() action of above, we’d be returning a double and would use map_dbl(). There are two flavors. One where we specify a function, and the other where we do it more dynamically with a ~

library(purrr)

f <- function(x) sum(1:10)

map_dbl(1:5, f)
## [1] 55 55 55 55 55
map_dbl(1:5, ~sum(1:10))
## [1] 55 55 55 55 55

Note our use of ~ to say “Hey, we’re about to call some stuff.” Very useful if you want to skip writing a function (notice the function HAS to have an argument).

Other useful versions of map include map_chr() for character returns, map_df() for a data frame, map_lgl for true/false, and just plain olde map() which returns a list of whatever you want.

Let’s see map_df() in action, as we’re going to use it today. Let’s say we want to take some input, and return a data frame with the input and the sum of all numbers up to that input. To do so, we can write a function, or, you need to know that after the ~, map functions call the individual argument being sent to the function .x.

get_new_df <- function(x){
  return(data.frame(x = x, 
                    y = sum(1:x)))
}

map_df(1:10, get_new_df)
##     x  y
## 1   1  1
## 2   2  3
## 3   3  6
## 4   4 10
## 5   5 15
## 6   6 21
## 7   7 28
## 8   8 36
## 9   9 45
## 10 10 55
map_df(1:10, ~data.frame(x = .x, y = sum(1:.x)))
##     x  y
## 1   1  1
## 2   2  3
## 3   3  6
## 4   4 10
## 5   5 15
## 6   6 21
## 7   7 28
## 8   8 36
## 9   9 45
## 10 10 55

Nifty! What if we wanted to return not only the final sum, but all numbers up until that point?

map_df(1:10, ~data.frame(x = .x, y = cumsum(1:.x)))
##     x  y
## 1   1  1
## 2   2  1
## 3   2  3
## 4   3  1
## 5   3  3
## 6   3  6
## 7   4  1
## 8   4  3
## 9   4  6
## 10  4 10
## 11  5  1
## 12  5  3
## 13  5  6
## 14  5 10
## 15  5 15
## 16  6  1
## 17  6  3
## 18  6  6
## 19  6 10
## 20  6 15
## 21  6 21
## 22  7  1
## 23  7  3
## 24  7  6
## 25  7 10
## 26  7 15
## 27  7 21
## 28  7 28
## 29  8  1
## 30  8  3
## 31  8  6
## 32  8 10
## 33  8 15
## 34  8 21
## 35  8 28
## 36  8 36
## 37  9  1
## 38  9  3
## 39  9  6
## 40  9 10
## 41  9 15
## 42  9 21
## 43  9 28
## 44  9 36
## 45  9 45
## 46 10  1
## 47 10  3
## 48 10  6
## 49 10 10
## 50 10 15
## 51 10 21
## 52 10 28
## 53 10 36
## 54 10 45
## 55 10 55

Exercises

  1. Use replicate() to repeatedly average the numbers 1:10 seven times.

  2. Do the same thing with map_dbl() - try it both with a function and with a ~

  3. What happens if you use other map functions?

  4. Start with a vector:

my_vec <- c(11, 10, 10, 9, 10, 11, 10, 9, 10, 
            12, 9, 11, 8, 11, 11, 10, 11, 10, 
            11, 9)

Use map_df() to make a data frame that, for the numbers 3 through 15, returns two columns. One is the the average of the element of the vector 1 through the chosen number, the second is the standard deviation.

e.g. 10.2 for a mean - but the 10 will be .x for you!

2. Simulation and Sample Size

2.1 Random Number Generation

There are a number of ways to get random numbers in R from a variety of distributions. For our simulations, let’s start with a sample of 40 individuals that’s from a population with a mean value of 10, a SD of 3.

set.seed(323)
samp <- rnorm(n = 40, mean = 10, sd = 3)

Note the set.seed() function. Random number generators on computers aren’t REALLY random - they use some nugget of information that changes constantly to be pseudo-random. By setting that nugget - the seed - we make sure that all random numbers generated from this point on are the same.

Unsettling, no?

There are a wide variety of different random number generators in R. rnorm() is exceedingly comming, but, we also have the following. Let’s flip a coin 30 times and see how many heads we get if there’s a 50:50 change of getting heads. And let’s do it 10 times!

rbinom(10, size = 30, prob = 0.5)
##  [1] 10 18 16 11 16 17 14 14 15 17

We can also draw random numbers from a uniform distribution (all numbers are equally likely to be drawn) with a minimum and maximum bound.

runif(10, min = -1, max = 1)
##  [1]  0.06789064 -0.41526715 -0.16290617  0.87500096  0.82789850  0.86382277
##  [7] -0.88316031  0.81005091  0.99624116  0.41923706

So from this, we can draw a random sample from a population with a distribution of our choice!

2.2 Using rowwise() and map_dbl() for simulation

OK, let’s begin our journey towards figuring out what sample size we want. We now know how to generate random measurements from an assumed population. The other tool we’ll need to simulate sampling a population is some output. Let’s go through this step by step. First, let’s say we are going to try out sample sizes from 3 through 50 for a population with a mean of 45 and a standard deviation of 20. To start, we’ll need to setup a data frame with the different sample sizes we want and some parameters we can use.

# our assumptions
mean_pop <- 45
sd_pop <- 20

# use rep to repeat the sample sizes 3 through 50
samp_sim <- data.frame(samp_size = 3:50)

So, how do we create those replicate populations? Let’s say we wanted to just make one simulated draw for each sample size. For that, we need to 1) look at each sample size, and then 2) draw a random sample. For the latter, we already have a function. For the former, we need a function to go row by row. dplyr::rowwise(). We feed it samp_size as an argument so that the variable is preserved in the resulting data frame. We could also have done dplyr::group_by(samp_size) as well. But I often find rowwise() more intuitive in this case. For the later, we actually use summarize(), as it lets us create resulting outputs that are not the same size as the input. Note, this is new in dplyr 1.0, and I could not be more excited.

Let’s do this and visualize the results.

samp_sim_1 <- samp_sim |>
  rowwise(samp_size) |>
  summarize(samp = rnorm(samp_size, mean_pop, sd_pop))

plot(samp ~ samp_size, data = samp_sim_1)

OK, that’s nice. I have a lot of samples at each sample size. You can see as our sample size increases, there are, indeed, more points. However, we want to simulate a lot of sampling evens and calculate sample statistics. This is where map_dbl() or replicate() becomes very useful. Let’s say, instead of just drawing from a normal population, we want to modify the above to calculate a mean. And do it, oh, let’s say twice.

One thing that will make this more redable and easier to debug is to use a function.

# one sim
get_one_pop_mean <- function(samp_size,
                             pop_mean,
                             pop_sd) {
  rnorm(samp_size,
        pop_mean,
        pop_sd) |>
    mean()
}

#now a function to repeat the action
get_sim_means <- function(samp_size,
                          pop_mean,
                          pop_sd,
                          reps) {
  map_dbl(1:reps,
          ~ get_one_pop_mean(samp_size,
                             pop_mean,
                             pop_sd))
}

samp_sim_means_2 <- samp_sim %>%
  rowwise(samp_size) %>%
  summarize(samp_mean = get_sim_means(samp_size, 
                                        mean_pop, 
                                        sd_pop,
                                        reps = 2))

samp_sim_means_2
## # A tibble: 96 × 2
## # Groups:   samp_size [48]
##    samp_size samp_mean
##        <int>     <dbl>
##  1         3      12.9
##  2         3      71.2
##  3         4      45.6
##  4         4      54.8
##  5         5      55.1
##  6         5      54.4
##  7         6      55.0
##  8         6      52.4
##  9         7      45.0
## 10         7      32.7
## # … with 86 more rows

Oh! Now we’re getting somewhere. Before we fully sally forth, let’s outline what we’re doing. You’ll see it’s a common structure for simulations.

# write a function that does one simulations for one set of parameters
# write a function that does repeated simulations for parameters
# Take a data frame with a bunch of sample sizes
# For each row
# Feed the parameters from the row to your function

Reasonable, no? You could have written this out. Now, let’s operationalize this for 1000 simulations.

# write a function that does...
# oh - wait - we already did that!

# Take a data frame with a bunch of sample sizes
sim_results <- samp_sim %>%

  # For each row
  rowwise(samp_size) %>%
  
  # Replicate calculating the mean from a random draw some number of times
  summarize(samp_mean = get_sim_means(samp_size, 
                                        mean_pop, 
                                        sd_pop, 
                                        reps = 1e3))

What’s great about this (and using a map function) is that we can also add additional information - say, the sd - using map_df() Note how we don’t even use = in summarize()

# write a function that does one simulations for one set of parameters
get_1_sim_df <- function(sim_num = 1,
                samp_size,
                mean_pop,
                sd_pop){
  
  samp <- rnorm(samp_size,
                mean_pop,
                sd_pop)
  
  data.frame(sim = sim_num,
             samp_mean = mean(samp),
             samp_sd = sd(samp))
}

# write a function that does repeated simulations for parameters
run_pop_sims <- function(samp_size,
                mean_pop,
                sd_pop,
                n_sims = 5){
  
  #run the simulations
  map_df(1:n_sims,
          ~get_1_sim_df(sim_num = .x,
                samp_size,
                mean_pop,
                sd_pop))
}

# Take a data frame with a bunch of sample sizes
sim_results <- samp_sim %>%
  
  # For each row
  rowwise(samp_size) %>%
  
  # Replicate calculating the mean
  # and Sdfrom a random draw some number of times
  summarize(run_pop_sims(samp_size,
                         mean_pop,
                         sd_pop,
                         n_sims = 1e3))
            
sim_results
## # A tibble: 48,000 × 4
## # Groups:   samp_size [48]
##    samp_size   sim samp_mean samp_sd
##        <int> <int>     <dbl>   <dbl>
##  1         3     1      30.7   20.2 
##  2         3     2      50.9    5.32
##  3         3     3      59.1   20.7 
##  4         3     4      44.5   18.4 
##  5         3     5      57.8   17.0 
##  6         3     6      33.7   18.1 
##  7         3     7      64.3   28.8 
##  8         3     8      51.8   40.9 
##  9         3     9      35.4   20.1 
## 10         3    10      42.6   32.5 
## # … with 47,990 more rows

2.3 Faded Examples.

Let’s try this out, and have you fill in what’s missing in these faded examples.

#Some preperatory material
set.seed(42)
mean_pop <- 10
sd_pop <- 3
nsim <- 100
sampSim <- data.frame(samp_size = 3:50)

#Mean simulations
sampSim %>%
  rowwise(samp_size) %>%
  summarize(samp_mean = 
              replicate(nsim, 
                        rnorm(samp_size, mean_pop, sd_pop) %>% mean()))


#Now the faded examples! Fill in the ___

#Median simulations
samp_sim %>%
  rowwise(samp_size) %>%
  summarize(samp_median = 
              ____(nsim, 
                        rnorm(samp_size, mean_pop, sd_pop) %>% median()))

#SD simulations
samp_sim %>%
  rowwise(samp_size) %>%
  ____(samp_sd = 
              ____(nsim, 
                        ____(samp_size, mean_pop, ____) %>% sd()))

  
  
  
#IQR simulations
#function for interquartile range is IQR
samp_sim %>%
  ____(____) %>%
  ____(samp_iqr = 
              ____(nsim, 
                        ____(____, mean_pop, ____) %>% IQR()))

2.4 Determining Optimal Sample Size with plot and summarize

Great, so we’ve done the simulation! How do we determine sample size? The first way is a plot.

ggplot(sim_results,
       mapping = aes(x = samp_size,
                     y = samp_mean)) +
  geom_point(alpha = 0.1) +
  theme_bw()

We can eyeball this result and see a leveling out > 20 or so. OK, that’s great, but…totally imprecise.

Better might be to see where the SD of the mean levels off. Let’s pseudocode that out in comments to see what we might need to do.

# Take our sampSim data
# Group it by sample size
# Take the SD of the sample size
# ungroup the resulting data
# Sort it by SD
# Look at the top 5 entries.

A few new things here. First, we’re grouping by sample size, not sim number. Second, we’re summarizing. We are reducing our data down - boiling it to its essence. For this, dplyr has a function called - wait for it - summarize(). Second, we’re going to have to do some arranging. With a dplyr function called - and this is a shocker - arrange. sort was already taken by the base package. OK, let’s walk through the resulting pipeline in two pieces. First, the summarization

# Take our sampSim data
sampSim_summary <- sim_results %>%
  # Group it by sample size
  group_by(samp_size) %>%
  # Take the SD of the sample size
  summarize(pop_mean_sd = sd(samp_mean))

Neat - this gives us a much reduced piece of data to work with. But - what about the arranging?

sampSim_summary <- sampSim_summary %>%
  # Sort it by SD
  arrange(pop_mean_sd)

sampSim_summary
## # A tibble: 48 × 2
##    samp_size pop_mean_sd
##        <int>       <dbl>
##  1        49        2.88
##  2        46        2.89
##  3        50        2.94
##  4        48        2.95
##  5        45        2.96
##  6        47        2.96
##  7        44        2.98
##  8        43        3.01
##  9        42        3.04
## 10        40        3.16
## # … with 38 more rows

Now, notice that I don’t have to use head or chop of the top few rows (there’s a dplyr function for that - slice()). That’s because dplyr creates a data frame-like object called a tibble. We’ll talk about this in more detail later. For now, tibbles work just like data frames, but they print out a little differently. If we REALLY wanted a data frame, we could just use as.data.frame().

As for the result, we can see that, eh, something in the 40s has the lowest SD, but things bounce around. We could also plot this.

ggplot(sampSim_summary,
       mapping = aes(x = samp_size, y = pop_mean_sd)) +
  geom_point() +
  geom_line() +
  theme_classic()

So, what is acceptable? At what point does adding another sample begin to have diminishing returns.

2.5 Exercises.

  1. Look at the resulting simulations using the sample. Would you chose a different optimal sample size?

  2. Repeat this whole process, but for the sample median. What is the optimal sample size? Feel free to reuse code or modify code in place (for the simulations).

3. Bootstrapped Standard Errors

3.1 Basic Sample Properties

So, as we launch into things like standard errors, I wanted to pause and hit a few key functions we need to describe a sample. Let’s use the samp vector from above as adata source. First, mean, sd, and median.

mean(samp)
## [1] 10.28201
sd(samp)
## [1] 2.881881
median(samp)
## [1] 10.0977

This is pretty great. There are a ton of functions like these. Such as

IQR(samp)
## [1] 4.15341

Although there is no standard error of the mean or confidence interval function. There are functions for skew, kurtosis, and more in other packages. Last, for arbitrary percentiles, there’s the ‘quantile()’ function.

#quartiles by defeault
quantile(samp)
##        0%       25%       50%       75%      100% 
##  5.648247  8.072854 10.097696 12.226264 18.916196
#or, get what you want
quantile(samp, probs = c(0.05, 0.1, 0.5, 0.9, 0.95))
##        5%       10%       50%       90%       95% 
##  6.264685  7.177671 10.097696 13.676815 14.484173

You can get individual values by specifying the percentile you want with probs.

3.2 The Bootstrap

OK, but what about bootstrapped standard errors? Or, you know, bootstrapping in general. Let’s start by generating a ‘sample’ with a sample size of 40.

set.seed(2020)
samp <- rnorm(n = 40, mean = 10, sd = 3)

OK, how do we get one bootstrap sample from this? Why, with sample() of course. We give sample() a few arguments about how big of a sample we want as well as if we want to sample with or without replacement.

one_boot <- sample(samp, size = length(samp), replace = TRUE)

3.3 Bootstrapped SEs

Let’s look at the bootstrapped standard error of a median from our samp vector. We’re going to do it using the same techniques of simulation - and in particular with replicate(). So, let’s replicate taking the median of a sample with replacement of the samp vector. Then we can get the standard deviation of that.

median_sims <- replicate(n = 100, 
                         median(sample(samp,length(samp), replace=TRUE)))

sd(median_sims)
## [1] 0.3784669

Simple, no? Now…. can you think about how you might use this in the context of trying different sample sizes? Maybe a function of your own can help you along the way.

one_boot_median <- function(samp){
  sample(samp, 
         size = length(samp),
         replace = TRUE) |>
    median()
}

meadian_from_boot <- function(n, pop_mean = 0, pop_sd = 1,
                              n_boot = 1000){
  
  # draw a sample
  samp <- rnorm(n, mean = pop_mean, sd = pop_sd)
  
  # get a bootstrap resample
  sims <- replicate(n_boot,one_boot_median(samp))
  
  # return the sims
  sims
}

# hey, it's the se from bootstrapping for ONE sample
# drawn from a population at n = 4
meadian_from_boot(4) |> sd()
## [1] 0.2554226

3.4 Exercise! How does sample size change bootstrapped SE?

OK, you’ve done simulations from populations. Now, let’s mesh those bootstraps with our concepts of simulation at different sample sizes and…. plot out how to look at change in bootstrapped SE for the IQR at different sample sizes.

  1. First, make sure you feel comfortable calculating the bootstrapped SE of the IQR from samp. Repeat what we did above with IQR instead of median.

  2. Now, write out in comments what you will do to end up with a data frame that has a column of sample sizes and a column of IQRs calculated from sampling our samp vector.

  3. Code it!

  4. Now, write out in comments how you would go from that data frame to one that has the SE for the IQR at each sample size.