Sampling distributions, dplyr, and simulation

So you want to simulate?

Dplyr is fantastic for simulations. By setting up a set of paramter values or even just simulation numbers, we can repeatedly do things to sample over and over again. For a trivial example, let’s use dplyr to get a column of random numbers. This will be convoluted, but, you’ll see where I’m going with this in a second…

library(dplyr)
library(ggplot2)

sim_column <- tibble(sims = 1:10) |>
  group_by(sims) |>
  mutate(rand = runif(1, min = 0, max = 10)) |>
  ungroup()

sim_column
# A tibble: 10 × 2
    sims  rand
   <int> <dbl>
 1     1 3.07 
 2     2 8.42 
 3     3 1.86 
 4     4 8.51 
 5     5 7.02 
 6     6 9.47 
 7     7 8.01 
 8     8 1.24 
 9     9 0.572
10    10 8.27 

OK, this is totally trivial, in that we could have just created a second column and used runif(10, 0, 10) and gotten the same result. BUT - note how here we create simulations with the sims variable, and then we group and ungroup on it? This allows us to keep track of simulations throughout - something that will be come very powerful as we move forward. You could use rowwise() instead if you didn’t want to keep track of simulations, but, you’ll often find it convenient to do so.

Simulating Sample Distributions

So how can we use this to simulate sampling? Let’s say we wanted to simulate drawing random samples from a population that was normally distributed. Let’s say our population mean is 10 with a sd of 4. We want 1000 means from a sampling with n=5, and then plot the sampling distribution of means.

This isn’t so bad! We can again create a tibble with a sims column, and then just mutate our way away!

# some parameters
n <- 10
m <- 10
s <- 5

mean_sims <- tibble(sims = 1:1000) |>
  group_by(sims) |>
  mutate(sample_mean = rnorm(n, mean = m, sd = s) |> 
           mean()) |>
  ungroup()

mean_sims
# A tibble: 1,000 × 2
    sims sample_mean
   <int>       <dbl>
 1     1        9.73
 2     2       10.1 
 3     3       13.8 
 4     4        9.40
 5     5       11.8 
 6     6        8.01
 7     7       14.2 
 8     8        9.97
 9     9        8.20
10    10       12.2 
# ℹ 990 more rows

Great! We have our tibble of simulated sample means, and we can plot.

ggplot(data = mean_sims,
       mapping = aes(x = sample_mean)) +
  geom_histogram(bins = 200, fill = "darkorange")

We can also get our SE of the mean.

sd(mean_sims$sample_mean)
[1] 1.557361

EXERCISE Try getting the sample distribution with n = 10 from a uniform distribution. Is the sample distribution normal? If you up the number of simulations, does it make it easier to see?

Getting the Sample Distribution of Multiple Parameters

That’s cool and all, but what if we want to get the mean AND the standard deviation? We can do the same as above, with both a mean and sd calculation on a re-randomized set of data, but….. when we have a column identifying simulations, that’s usually because we want things generated by that simulation to use the same data - the same stochastic pull of data for each calculation. To do that, we need a two-step process.

  1. For each simulation, generate a set of random data.
  2. Calculate derived sample statistics on that data.

So, how do we make a data set per simulation. Two ideas come to mind…

tibble(sims = 1:5) |>
  group_by(sims) |>
  mutate(sample = rnorm(10))


tibble(sims = 1:5) |>
  group_by(sims) |>
  summarize(sample = rnorm(10))

The top example throws an error (try it) as mutate should return the same number of lines. The seccond work, but we get a deprecation error - that instead of using summarize, if we’re making data that has a new number of rows, we use reframe(). This is a great function in dplyr, as it allows us to expand our data frame if the return value from a function has multiple rows. Let’s see it in action to simulate data using the same parameters as before.

sample_sims <- tibble(sims = 1:1000) |>
  group_by(sims) |>
  reframe(sample = rnorm(n, mean = m, sd = s)) |>
  ungroup()

sample_sims
# A tibble: 10,000 × 2
    sims sample
   <int>  <dbl>
 1     1  9.09 
 2     1 15.7  
 3     1 10.1  
 4     1  6.38 
 5     1 12.8  
 6     1 12.3  
 7     1 11.7  
 8     1 11.8  
 9     1 -0.298
10     1  7.41 
# ℹ 9,990 more rows

Great! Now we have data, and we can then calculate properties from each sample!

sample_properties <- sample_sims |>
  group_by(sims) |>
  summarize(sample_mean = mean(sample),
            sample_sd = sd(sample),
            median = median(sample))

sample_properties
# A tibble: 1,000 × 4
    sims sample_mean sample_sd median
   <int>       <dbl>     <dbl>  <dbl>
 1     1        9.70      4.44  10.9 
 2     2       10.0       6.32  10.9 
 3     3        8.44      4.89   6.86
 4     4       10.1       4.04   9.79
 5     5        8.57      6.71   8.31
 6     6       12.5       6.16  13.3 
 7     7        8.79      5.19   9.19
 8     8        8.85      4.33  10.1 
 9     9        9.95      6.48  10.5 
10    10        9.38      4.67  10.0 
# ℹ 990 more rows

Exercise Plot the distributions of the properties. What do they look like. Now repeat the sample simulations and properties for a uniform distribution. Do the resulting distributions look different?

Sample Size and SE

To take this all one step further, we can also look at the effect of sample size on our precision in our estimation of a mean. To do so using dplyr is a snap. We can still group by simulations, but also add in a sample size parameter. To make a full simulation frame, we can use tidyr::crossing() which creates all possible combinations of vectors and turns them into a data frame.

library(tidyr)

#for example
crossing(x = 1:3, y = 7:9)
# A tibble: 9 × 2
      x     y
  <int> <int>
1     1     7
2     1     8
3     1     9
4     2     7
5     2     8
6     2     9
7     3     7
8     3     8
9     3     9
# our simulations
sims_frame <- crossing(sims = 1:1000, n = c(3,5,7,9)) |>
  group_by(sims, n) |>
  mutate(sample_mean = rnorm(n, mean = m, sd = s) |> mean())|>
  ungroup()

We can then look at the effect of sample size on the standard error of the mean by getting the SD of each sim/sample size combination and plotting. This is why I didn’t ungroup() before.

sims_frame |>
  group_by(n) |>
  summarize(se = sd(sample_mean)) |>
  
  #oh, piping right into a ggplot!
  ggplot(aes(x = n, y = se)) +
  geom_line() +
  geom_point()

If we do this for many many sample sizes, we can generate a curve and see if there is some sample size where the SE levels off, or find a place where we are comfortable with the n versus se tradeoff.

Bootstrap resampling

This is all well and good if we’re pulling from a theoretical distribution. But, what’s this bootstrap resampling we hear about? Quite simply, it’s sampling from a sample with replacement. Rinse and repeat this many many times, and you can calculate bootstrapped statistics. We do this primarily with the sample() function. For example:

vec <- 1:10

sample(vec, replace = TRUE)
 [1] 6 7 7 1 9 6 6 2 3 7

So, if we want to get the boostrapped SE of a sample, we can use sample() instead of rnorm() in our workflow.

# The OG Sample
my_samp <- rnorm(10, mean = m, sd = s) 

boot_frame <- tibble(sims = 1:1000) |>
  group_by(sims) |>
  mutate(boot_mean = sample(my_samp, replace = TRUE) |>
           mean()) |>
  ungroup()

So, how does the bootstrapped SE compare to the estimate?

sd(boot_frame$boot_mean)
[1] 1.739233
sd(my_samp) / sqrt(10)
[1] 1.787256

Not bad!