# libraries
library(dplyr)
library(ggplot2)
#set that theme
theme_set(theme_bw(base_size = 12))
To simulate things, we want to repeat an action again and again.
There are a few ways to do this in R. The first is the function
replicate()
replicate(n = 5, sum(1:10))
## [1] 55 55 55 55 55
Well, that’s a hair boring, but, you can see it summed something 5
times. We can of course create fancier things to do, but, in general,
replicate()
is a solid workhorse.
R has a lot of other ways to do repetitive work. For example, your
classic for loop, where you iterate over an index. Note the use of
{}
like in a function.
# always start with an object, otherwise you eat memory
my_vec <- rep(NA, 10)
#iterate over indices
for(i in 1:10){
my_vec[i] <- mean(rnorm(10))
}
my_vec
## [1] 0.324026730 0.008839435 -0.390083370 0.056193056 -0.037800320
## [6] -0.482682694 -0.505578944 0.081784930 0.135450244 0.569138830
Or a series of functions that branch off of one called
apply()
where you apply things to elements of the vector.
Common ones include sapply()
and lapply()
which work on vectors or lists. These work easily with a predefined
function
mean_rnorm_n <- function(n) mean(rnorm(n))
sapply(1:10, mean_rnorm_n)
## [1] -1.3495863790 0.6417172832 0.8936823558 0.0001879669 -0.2447177725
## [6] 0.2178507130 0.0295330627 0.1432629430 0.4088698943 0.0564228468
But - the applies can be tricky. You don’t know if you are going to get a list, vector, or other data structure back often. So, useful for simulation, but……
There is also a beautiful library in R called purrr. This library takes the dplyr logic of mapping a function to many inputs and puts it on a different level. We’ll talk more about purrr at another time, but, for now, there are a few functions worth knowing.
image from Allison Horst https://github.com/allisonhorst/
The core purrr functions are the map_*()
functions.
Basically, a whole family of functions where you can change what is in
the * to specify the output type. For example, if we want to replicate
the replicate()
action of above, we’d be returning a
double and would use map_dbl()
. There are two
flavors. One where we specify a function, and the other where we do it
more dynamically with a ~
library(purrr)
f <- function(x) sum(1:10)
map_dbl(1:5, f)
## [1] 55 55 55 55 55
map_dbl(1:5, ~sum(1:10))
## [1] 55 55 55 55 55
Note our use of ~
to say “Hey, we’re about to call some
stuff.” Very useful if you want to skip writing a function (notice the
function HAS to have an argument).
Other useful versions of map include map_chr()
for
character returns, map_df()
for a data frame,
map_lgl
for true/false, and just plain olde
map()
which returns a list of whatever you want.
Let’s see map_df()
in action, as we’re going to use it
today. Let’s say we want to take some input, and return a data frame
with the input and the sum of all numbers up to that input. To do so, we
can write a function, or, you need to know that after the
~
, map functions call the individual argument being sent to
the function .x
.
get_new_df <- function(x){
return(data.frame(x = x,
y = sum(1:x)))
}
map_df(1:10, get_new_df)
## x y
## 1 1 1
## 2 2 3
## 3 3 6
## 4 4 10
## 5 5 15
## 6 6 21
## 7 7 28
## 8 8 36
## 9 9 45
## 10 10 55
map_df(1:10, ~data.frame(x = .x, y = sum(1:.x)))
## x y
## 1 1 1
## 2 2 3
## 3 3 6
## 4 4 10
## 5 5 15
## 6 6 21
## 7 7 28
## 8 8 36
## 9 9 45
## 10 10 55
Nifty! What if we wanted to return not only the final sum, but all numbers up until that point?
map_df(1:10, ~data.frame(x = .x, y = cumsum(1:.x)))
## x y
## 1 1 1
## 2 2 1
## 3 2 3
## 4 3 1
## 5 3 3
## 6 3 6
## 7 4 1
## 8 4 3
## 9 4 6
## 10 4 10
## 11 5 1
## 12 5 3
## 13 5 6
## 14 5 10
## 15 5 15
## 16 6 1
## 17 6 3
## 18 6 6
## 19 6 10
## 20 6 15
## 21 6 21
## 22 7 1
## 23 7 3
## 24 7 6
## 25 7 10
## 26 7 15
## 27 7 21
## 28 7 28
## 29 8 1
## 30 8 3
## 31 8 6
## 32 8 10
## 33 8 15
## 34 8 21
## 35 8 28
## 36 8 36
## 37 9 1
## 38 9 3
## 39 9 6
## 40 9 10
## 41 9 15
## 42 9 21
## 43 9 28
## 44 9 36
## 45 9 45
## 46 10 1
## 47 10 3
## 48 10 6
## 49 10 10
## 50 10 15
## 51 10 21
## 52 10 28
## 53 10 36
## 54 10 45
## 55 10 55
Exercises
Use replicate()
to repeatedly average the numbers
1:10 seven times.
Do the same thing with map_dbl()
- try it both with
a function and with a ~
What happens if you use other map functions?
Start with a vector:
my_vec <- c(11, 10, 10, 9, 10, 11, 10, 9, 10,
12, 9, 11, 8, 11, 11, 10, 11, 10,
11, 9)
Use map_df()
to make a data frame that, for the numbers
3 through 15, returns two columns. One is the the average of the element
of the vector 1 through the chosen number, the second is the standard
deviation.
e.g. 10.2 for a mean - but the 10 will be .x
for
you!
There are a number of ways to get random numbers in R from a variety of distributions. For our simulations, let’s start with a sample of 40 individuals that’s from a population with a mean value of 10, a SD of 3.
set.seed(323)
samp <- rnorm(n = 40, mean = 10, sd = 3)
Note the set.seed()
function. Random number generators
on computers aren’t REALLY random - they use some nugget of information
that changes constantly to be pseudo-random. By setting that nugget -
the seed - we make sure that all random numbers generated from this
point on are the same.
Unsettling, no?
There are a wide variety of different random number generators in R.
rnorm()
is exceedingly comming, but, we also have the
following. Let’s flip a coin 30 times and see how many heads we get if
there’s a 50:50 change of getting heads. And let’s do it 10 times!
rbinom(10, size = 30, prob = 0.5)
## [1] 10 18 16 11 16 17 14 14 15 17
We can also draw random numbers from a uniform distribution (all numbers are equally likely to be drawn) with a minimum and maximum bound.
runif(10, min = -1, max = 1)
## [1] 0.06789064 -0.41526715 -0.16290617 0.87500096 0.82789850 0.86382277
## [7] -0.88316031 0.81005091 0.99624116 0.41923706
So from this, we can draw a random sample from a population with a distribution of our choice!
rowwise()
and map_dbl()
for
simulationOK, let’s begin our journey towards figuring out what sample size we want. We now know how to generate random measurements from an assumed population. The other tool we’ll need to simulate sampling a population is some output. Let’s go through this step by step. First, let’s say we are going to try out sample sizes from 3 through 50 for a population with a mean of 45 and a standard deviation of 20. To start, we’ll need to setup a data frame with the different sample sizes we want and some parameters we can use.
# our assumptions
mean_pop <- 45
sd_pop <- 20
# use rep to repeat the sample sizes 3 through 50
samp_sim <- data.frame(samp_size = 3:50)
So, how do we create those replicate populations? Let’s say we wanted
to just make one simulated draw for each sample size. For that, we need
to 1) look at each sample size, and then 2) draw a random sample. For
the latter, we already have a function. For the former, we need a
function to go row by row. dplyr::rowwise()
. We feed it
samp_size
as an argument so that the variable is preserved
in the resulting data frame. We could also have done
dplyr::group_by(samp_size)
as well. But I often find
rowwise()
more intuitive in this case. For the later, we
actually use summarize()
, as it lets us create resulting
outputs that are not the same size as the input. Note, this is new in
dplyr 1.0, and I could not be more excited.
Let’s do this and visualize the results.
samp_sim_1 <- samp_sim |>
rowwise(samp_size) |>
summarize(samp = rnorm(samp_size, mean_pop, sd_pop))
plot(samp ~ samp_size, data = samp_sim_1)
OK, that’s nice. I have a lot of samples at each sample size. You can
see as our sample size increases, there are, indeed, more points.
However, we want to simulate a lot of sampling evens and calculate
sample statistics. This is where map_dbl()
or
replicate()
becomes very useful. Let’s say, instead of just
drawing from a normal population, we want to modify the above to
calculate a mean. And do it, oh, let’s say twice.
One thing that will make this more redable and easier to debug is to use a function.
# one sim
get_one_pop_mean <- function(samp_size,
pop_mean,
pop_sd) {
rnorm(samp_size,
pop_mean,
pop_sd) |>
mean()
}
#now a function to repeat the action
get_sim_means <- function(samp_size,
pop_mean,
pop_sd,
reps) {
map_dbl(1:reps,
~ get_one_pop_mean(samp_size,
pop_mean,
pop_sd))
}
samp_sim_means_2 <- samp_sim %>%
rowwise(samp_size) %>%
summarize(samp_mean = get_sim_means(samp_size,
mean_pop,
sd_pop,
reps = 2))
samp_sim_means_2
## # A tibble: 96 × 2
## # Groups: samp_size [48]
## samp_size samp_mean
## <int> <dbl>
## 1 3 12.9
## 2 3 71.2
## 3 4 45.6
## 4 4 54.8
## 5 5 55.1
## 6 5 54.4
## 7 6 55.0
## 8 6 52.4
## 9 7 45.0
## 10 7 32.7
## # … with 86 more rows
Oh! Now we’re getting somewhere. Before we fully sally forth, let’s outline what we’re doing. You’ll see it’s a common structure for simulations.
# write a function that does one simulations for one set of parameters
# write a function that does repeated simulations for parameters
# Take a data frame with a bunch of sample sizes
# For each row
# Feed the parameters from the row to your function
Reasonable, no? You could have written this out. Now, let’s operationalize this for 1000 simulations.
# write a function that does...
# oh - wait - we already did that!
# Take a data frame with a bunch of sample sizes
sim_results <- samp_sim %>%
# For each row
rowwise(samp_size) %>%
# Replicate calculating the mean from a random draw some number of times
summarize(samp_mean = get_sim_means(samp_size,
mean_pop,
sd_pop,
reps = 1e3))
What’s great about this (and using a map function) is that we can
also add additional information - say, the sd - using
map_df()
Note how we don’t even use = in
summarize()
# write a function that does one simulations for one set of parameters
get_1_sim_df <- function(sim_num = 1,
samp_size,
mean_pop,
sd_pop){
samp <- rnorm(samp_size,
mean_pop,
sd_pop)
data.frame(sim = sim_num,
samp_mean = mean(samp),
samp_sd = sd(samp))
}
# write a function that does repeated simulations for parameters
run_pop_sims <- function(samp_size,
mean_pop,
sd_pop,
n_sims = 5){
#run the simulations
map_df(1:n_sims,
~get_1_sim_df(sim_num = .x,
samp_size,
mean_pop,
sd_pop))
}
# Take a data frame with a bunch of sample sizes
sim_results <- samp_sim %>%
# For each row
rowwise(samp_size) %>%
# Replicate calculating the mean
# and Sdfrom a random draw some number of times
summarize(run_pop_sims(samp_size,
mean_pop,
sd_pop,
n_sims = 1e3))
sim_results
## # A tibble: 48,000 × 4
## # Groups: samp_size [48]
## samp_size sim samp_mean samp_sd
## <int> <int> <dbl> <dbl>
## 1 3 1 30.7 20.2
## 2 3 2 50.9 5.32
## 3 3 3 59.1 20.7
## 4 3 4 44.5 18.4
## 5 3 5 57.8 17.0
## 6 3 6 33.7 18.1
## 7 3 7 64.3 28.8
## 8 3 8 51.8 40.9
## 9 3 9 35.4 20.1
## 10 3 10 42.6 32.5
## # … with 47,990 more rows
Let’s try this out, and have you fill in what’s missing in these faded examples.
#Some preperatory material
set.seed(42)
mean_pop <- 10
sd_pop <- 3
nsim <- 100
sampSim <- data.frame(samp_size = 3:50)
#Mean simulations
sampSim %>%
rowwise(samp_size) %>%
summarize(samp_mean =
replicate(nsim,
rnorm(samp_size, mean_pop, sd_pop) %>% mean()))
#Now the faded examples! Fill in the ___
#Median simulations
samp_sim %>%
rowwise(samp_size) %>%
summarize(samp_median =
____(nsim,
rnorm(samp_size, mean_pop, sd_pop) %>% median()))
#SD simulations
samp_sim %>%
rowwise(samp_size) %>%
____(samp_sd =
____(nsim,
____(samp_size, mean_pop, ____) %>% sd()))
#IQR simulations
#function for interquartile range is IQR
samp_sim %>%
____(____) %>%
____(samp_iqr =
____(nsim,
____(____, mean_pop, ____) %>% IQR()))
plot
and
summarize
Great, so we’ve done the simulation! How do we determine sample size? The first way is a plot.
ggplot(sim_results,
mapping = aes(x = samp_size,
y = samp_mean)) +
geom_point(alpha = 0.1) +
theme_bw()
We can eyeball this result and see a leveling out > 20 or so. OK, that’s great, but…totally imprecise.
Better might be to see where the SD of the mean levels off. Let’s pseudocode that out in comments to see what we might need to do.
# Take our sampSim data
# Group it by sample size
# Take the SD of the sample size
# ungroup the resulting data
# Sort it by SD
# Look at the top 5 entries.
A few new things here. First, we’re grouping by sample size, not sim
number. Second, we’re summarizing. We are reducing our data down -
boiling it to its essence. For this, dplyr
has a function
called - wait for it - summarize()
. Second, we’re going to
have to do some arranging. With a dplyr
function called -
and this is a shocker - arrange
. sort
was
already taken by the base package. OK, let’s walk through the resulting
pipeline in two pieces. First, the summarization
# Take our sampSim data
sampSim_summary <- sim_results %>%
# Group it by sample size
group_by(samp_size) %>%
# Take the SD of the sample size
summarize(pop_mean_sd = sd(samp_mean))
Neat - this gives us a much reduced piece of data to work with. But - what about the arranging?
sampSim_summary <- sampSim_summary %>%
# Sort it by SD
arrange(pop_mean_sd)
sampSim_summary
## # A tibble: 48 × 2
## samp_size pop_mean_sd
## <int> <dbl>
## 1 49 2.88
## 2 46 2.89
## 3 50 2.94
## 4 48 2.95
## 5 45 2.96
## 6 47 2.96
## 7 44 2.98
## 8 43 3.01
## 9 42 3.04
## 10 40 3.16
## # … with 38 more rows
Now, notice that I don’t have to use head
or chop of the
top few rows (there’s a dplyr function for that - slice()
).
That’s because dplyr creates a data frame-like object called a
tibble
. We’ll talk about this in more detail later. For
now, tibbles work just like data frames, but they print out a little
differently. If we REALLY wanted a data frame, we could just use
as.data.frame()
.
As for the result, we can see that, eh, something in the 40s has the lowest SD, but things bounce around. We could also plot this.
ggplot(sampSim_summary,
mapping = aes(x = samp_size, y = pop_mean_sd)) +
geom_point() +
geom_line() +
theme_classic()
So, what is acceptable? At what point does adding another sample begin to have diminishing returns.
Look at the resulting simulations using the sample. Would you chose a different optimal sample size?
Repeat this whole process, but for the sample median. What is the optimal sample size? Feel free to reuse code or modify code in place (for the simulations).
So, as we launch into things like standard errors, I wanted to pause
and hit a few key functions we need to describe a sample. Let’s use the
samp
vector from above as adata source. First, mean, sd,
and median.
mean(samp)
## [1] 10.28201
sd(samp)
## [1] 2.881881
median(samp)
## [1] 10.0977
This is pretty great. There are a ton of functions like these. Such as
IQR(samp)
## [1] 4.15341
Although there is no standard error of the mean or confidence interval function. There are functions for skew, kurtosis, and more in other packages. Last, for arbitrary percentiles, there’s the ‘quantile()’ function.
#quartiles by defeault
quantile(samp)
## 0% 25% 50% 75% 100%
## 5.648247 8.072854 10.097696 12.226264 18.916196
#or, get what you want
quantile(samp, probs = c(0.05, 0.1, 0.5, 0.9, 0.95))
## 5% 10% 50% 90% 95%
## 6.264685 7.177671 10.097696 13.676815 14.484173
You can get individual values by specifying the percentile you want with probs.
OK, but what about bootstrapped standard errors? Or, you know, bootstrapping in general. Let’s start by generating a ‘sample’ with a sample size of 40.
set.seed(2020)
samp <- rnorm(n = 40, mean = 10, sd = 3)
OK, how do we get one bootstrap sample from this? Why, with
sample()
of course. We give sample()
a few
arguments about how big of a sample we want as well as if we want to
sample with or without replacement.
one_boot <- sample(samp, size = length(samp), replace = TRUE)
Let’s look at the bootstrapped standard error of a median from our
samp vector. We’re going to do it using the same techniques of
simulation - and in particular with replicate()
. So, let’s
replicate taking the median of a sample with replacement of the samp
vector. Then we can get the standard deviation of that.
median_sims <- replicate(n = 100,
median(sample(samp,length(samp), replace=TRUE)))
sd(median_sims)
## [1] 0.3784669
Simple, no? Now…. can you think about how you might use this in the context of trying different sample sizes? Maybe a function of your own can help you along the way.
one_boot_median <- function(samp){
sample(samp,
size = length(samp),
replace = TRUE) |>
median()
}
meadian_from_boot <- function(n, pop_mean = 0, pop_sd = 1,
n_boot = 1000){
# draw a sample
samp <- rnorm(n, mean = pop_mean, sd = pop_sd)
# get a bootstrap resample
sims <- replicate(n_boot,one_boot_median(samp))
# return the sims
sims
}
# hey, it's the se from bootstrapping for ONE sample
# drawn from a population at n = 4
meadian_from_boot(4) |> sd()
## [1] 0.2554226
OK, you’ve done simulations from populations. Now, let’s mesh those bootstraps with our concepts of simulation at different sample sizes and…. plot out how to look at change in bootstrapped SE for the IQR at different sample sizes.
First, make sure you feel comfortable calculating the bootstrapped SE of the IQR from samp. Repeat what we did above with IQR instead of median.
Now, write out in comments what you will do to end up with a data
frame that has a column of sample sizes and a column of IQRs calculated
from sampling our samp
vector.
Code it!
Now, write out in comments how you would go from that data frame to one that has the SE for the IQR at each sample size.