Sampling distributions, dplyr, and simulation

So you want to simulate?

Dplyr is fantastic for simulations. By setting up a set of paramter values or even just simulation numbers, we can repeatedly do things to sample over and over again. For a trivial example, let’s use dplyr to get a column of random numbers. This will be convoluted, but, you’ll see where I’m going with this in a second…

library(dplyr)
library(ggplot2)

sim_column <- tibble(sims = 1:10) |>
  group_by(sims) |>
  mutate(rand = runif(1, min = 0, max = 10)) |>
  ungroup()

sim_column

# A tibble: 10 × 2
    sims  rand
   <int> <dbl>
 1     1  3.11
 2     2  5.88
 3     3  5.22
 4     4  8.52
 5     5  1.20
 6     6  8.34
 7     7  5.18
 8     8  6.73
 9     9  3.20
10    10  8.89

OK, this is totally trivial, in that we could have just created a second column and used runif(10, 0, 10) and gotten the same result. BUT - note how here we create simulations with the sims variable, and then we group and ungroup on it? This allows us to keep track of simulations throughout - something that will be come very powerful as we move forward. You could use rowwise() instead if you didn’t want to keep track of simulations, but, you’ll often find it convenient to do so.

Simulating Sample Distributions

So how can we use this to simulate sampling? Let’s say we wanted to simulate drawing random samples from a population that was normally distributed. Let’s say our population mean is 10 with a sd of 4. We want 1000 means from a sampling with n=5, and then plot the sampling distribution of means.

This isn’t so bad! We can again create a tibble with a sims column, and then just mutate our way away!

# some parameters
n <- 10
m <- 10
s <- 5

mean_sims <- tibble(sims = 1:1000) |>
  group_by(sims) |>
  mutate(sample_mean = rnorm(n, mean = m, sd = s) |> 
           mean()) |>
  ungroup()

mean_sims

# A tibble: 1,000 × 2
    sims sample_mean
   <int>       <dbl>
 1     1       12.6 
 2     2       12.3 
 3     3        7.15
 4     4        7.51
 5     5       10.4 
 6     6        8.57
 7     7        9.68
 8     8       10.9 
 9     9       10.8 
10    10        9.17
# ℹ 990 more rows

Great! We have our tibble of simulated sample means, and we can plot.

ggplot(data = mean_sims,
       mapping = aes(x = sample_mean)) +
  geom_histogram(bins = 200, fill = "darkorange")

We can also get our SE of the mean.

sd(mean_sims$sample_mean)

[1] 1.615636

EXERCISE Try getting the sample distribution with n = 10 from a uniform distribution. Is the sample distribution normal? If you up the number of simulations, does it make it easier to see?

Getting the Sample Distribution of Multiple Parameters

That’s cool and all, but what if we want to get the mean AND the standard deviation? We can do the same as above, with both a mean and sd calculation on a re-randomized set of data, but….. when we have a column identifying simulations, that’s usually because we want things generated by that simulation to use the same data - the same stochastic pull of data for each calculation. To do that, we need a two-step process.

For each simulation, generate a set of random data.
Calculate derived sample statistics on that data.

So, how do we make a data set per simulation. Two ideas come to mind…

tibble(sims = 1:5) |>
  group_by(sims) |>
  mutate(sample = rnorm(10))


tibble(sims = 1:5) |>
  group_by(sims) |>
  summarize(sample = rnorm(10))

The top example throws an error (try it) as mutate should return the same number of lines. The seccond work, but we get a deprecation error - that instead of using summarize, if we’re making data that has a new number of rows, we use reframe(). This is a great function in dplyr, as it allows us to expand our data frame if the return value from a function has multiple rows. Let’s see it in action to simulate data using the same parameters as before.

sample_sims <- tibble(sims = 1:1000) |>
  group_by(sims) |>
  reframe(sample = rnorm(n, mean = m, sd = s)) |>
  ungroup()

sample_sims

# A tibble: 10,000 × 2
    sims sample
   <int>  <dbl>
 1     1  13.4 
 2     1  14.9 
 3     1  12.2 
 4     1   3.92
 5     1  10.7 
 6     1   8.14
 7     1   8.36
 8     1   5.76
 9     1   7.81
10     1   2.14
# ℹ 9,990 more rows

Great! Now we have data, and we can then calculate properties from each sample!

sample_properties <- sample_sims |>
  group_by(sims) |>
  summarize(sample_mean = mean(sample),
            sample_sd = sd(sample),
            median = median(sample))

sample_properties

# A tibble: 1,000 × 4
    sims sample_mean sample_sd median
   <int>       <dbl>     <dbl>  <dbl>
 1     1        8.73      4.11   8.25
 2     2       10.1       5.56  12.1 
 3     3       11.3       2.11  11.6 
 4     4       10.4       5.86   9.84
 5     5       12.0       6.21  11.3 
 6     6        8.39      5.45   8.00
 7     7       10.3       5.29   9.10
 8     8        9.57      5.29   9.25
 9     9       11.4       3.46  11.8 
10    10       10.4       5.63  10.9 
# ℹ 990 more rows

Exercise Plot the distributions of the properties. What do they look like. Now repeat the sample simulations and properties for a uniform distribution. Do the resulting distributions look different?

Sample Size and SE

To take this all one step further, we can also look at the effect of sample size on our precision in our estimation of a mean. To do so using dplyr is a snap. We can still group by simulations, but also add in a sample size parameter. To make a full simulation frame, we can use tidyr::crossing() which creates all possible combinations of vectors and turns them into a data frame.

library(tidyr)

#for example
crossing(x = 1:3, y = 7:9)

# A tibble: 9 × 2
      x     y
  <int> <int>
1     1     7
2     1     8
3     1     9
4     2     7
5     2     8
6     2     9
7     3     7
8     3     8
9     3     9

# our simulations
sims_frame <- crossing(sims = 1:1000, n = c(3,5,7,9)) |>
  group_by(sims, n) |>
  mutate(sample_mean = rnorm(n, mean = m, sd = s) |> mean())|>
  ungroup()

We can then look at the effect of sample size on the standard error of the mean by getting the SD of each sim/sample size combination and plotting. This is why I didn’t ungroup() before.

sims_frame |>
  group_by(n) |>
  summarize(se = sd(sample_mean)) |>
  
  #oh, piping right into a ggplot!
  ggplot(aes(x = n, y = se)) +
  geom_line() +
  geom_point()

If we do this for many many sample sizes, we can generate a curve and see if there is some sample size where the SE levels off, or find a place where we are comfortable with the n versus se tradeoff.

Bootstrap resampling

This is all well and good if we’re pulling from a theoretical distribution. But, what’s this bootstrap resampling we hear about? Quite simply, it’s sampling from a sample with replacement. Rinse and repeat this many many times, and you can calculate bootstrapped statistics. We do this primarily with the sample() function. For example:

vec <- 1:10

sample(vec, replace = TRUE)

 [1]  8  4  5  2  5  1  5  7  5 10

So, if we want to get the boostrapped SE of a sample, we can use sample() instead of rnorm() in our workflow.

# The OG Sample
my_samp <- rnorm(10, mean = m, sd = s) 

boot_frame <- tibble(sims = 1:1000) |>
  group_by(sims) |>
  mutate(boot_mean = sample(my_samp, replace = TRUE) |>
           mean()) |>
  ungroup()

So, how does the bootstrapped SE compare to the estimate?

sd(boot_frame$boot_mean)

[1] 1.770641

sd(my_samp) / sqrt(10)

[1] 1.786409

Not bad!

Simulating the (linear) World

This is all well and good, but we’re often not interested in estimating means and the like. We want to study systems and relationships! Before going out to sample, we often want to know what kinds of results we might get based on how our system works. This can help us guide our intuition and think about possible things we might observe based on our assumptions. Always test your assumptions! Otherwise you make an ass out of you and me!

Let’s consider a simple system where urban runoff increases the abundance of bloodworms which serve as food for guppies in urban streams but also is toxic and can cause growth problems.

Let’s assume these relationships are linear - y = a + bx + error where error is normally distributed. Let’s setup the simulation like so:

bloodworm_abundance = b1 * runoff guppy_size = b2 + -b3 * runoff + b4 * bloodworm_abundance

WHOAH! We have specified how the world works!!! What’s nice is that we can now simulate sampling programs from this under different scenarios and see what’s reasonable. Note, we don’t have any random error here. Let’s start nice and deterministic.

We also have these coefficients in here. This is where simulation comes in. Let’s say we don’t actually know the EXACT values of those coefficients. That would be pretty presumptuous of us. But, we can make some educated guesses as to their ranges. Let’s say we have the following

b1 ~ N(1,1) b2 ~ N(50, 5)
b3 ~ N(-1,.1)
b4 ~ N(2,1)

This is kind of fun as now we have a positive effect of runoff simulating from these coefficients. Let’s start by creating a data frame of simulated coefficients.

n <- 1e3 #sample size

sim_guppy_coefs <- tibble(
  sims = 1:n,
  b1 = rnorm(n, 1, 1),
  b2 = rnorm(n, 50, 5),
  b3 = rnorm(n, -1, 0.1),
  b4 = rnorm(n, 2, 2)
)

Whoah! That…….. didn’t involve very much at all, did it, as we used random normal number generation to give us the values we need.

Now let’s populate our simulations. As we’re doing deterministic simulations, we just need to chose two runoff values as two points make a line. For the sake of argument, let’s go with 0 and 1. We can use reframe to populate the equations

sim_guppy <- 
  sim_guppy_coefs |>
  group_by(sims) |>
  reframe(runoff = c(0,1),
          bloodworm_abundance = b1 * runoff,
          guppy_size = b2 + -b3 * runoff + b4 * bloodworm_abundance)

So now we have a lot of possible lines! Let’s look at some relationships.

sim_guppy_plot <- ggplot(sim_guppy, 
       aes(x = runoff,
           y = guppy_size,
           group = sims)) + 
  geom_line(alpha = 0.2)

sim_guppy_plot

So, you can ask yourself, is this a reasonable expectation? Are there any lines here that are unrealistic? Are those choices of coefficient good or bad?

These are worth asking, as, once we understand the range of possibility, we can begin to plan out how we want to sample and what we want to do.

Introducing Error

The nice thing is that adding external variation on top of this is not difficult. It’s just some more values! Let’s say we think that the additional sources of variation add error to bloodworm abundance that is N(0, 5) and N(0,10) to