1. Setting the stage

1.1 Pipes

Often in R, what we kludge together different pieces of code to generate a sequence of events. For example, let’s say I wanted to take a vector , find out it’s length, and get the square root. Then get the log of that. There are a few ways I could do this.

vec <- 1:10
len <- length(vec)
sq <- sqrt(len)
log(sq)
## [1] 1.151293

This makes sense, and is very logical, but I keep saving new variables as I go along. I could write some nested piece of code that does the same thing with no variables.

log(sqrt(length(1:10)))
## [1] 1.151293

Which is efficient, but, man, I hope I can keep track of parentheses. It’s also ugly as hell. And super difficult for future-me to understand.

This idea of doing one thing, then the next, then the next is often how we treat our data. We take a piece of data, sift it, transform it, summarize it, etc. And this step-by-step style of coding has a name - functional programming. It lies at the core of much of what we do in R.

So, is there a better way? Yes! What if we could take the output of one function, and pass it to the next? This is known as piping. And it is implemented in R by the magrittr library. (Pause. Get the joke. Laugh at it. Move on.) magrittr introduces a funny new operator %>% which says, take the output of the thing on the right hand side, and pass it to the next function as its first argument.

As a simple example, let’s sum one through ten using pipes.

#If you haven't, install.packages("magrittr") first
library(magrittr)
1:10 %>% sum()
## [1] 55

Notice that we supplied no argument to sum. Now, what if wanted to take the square root of the result?

1:10 %>% 
  sum() %>%
  sqrt()
## [1] 7.416198

Notice how I separated different steps on different lines? This is a more readable code. Future me can track what’s going on, line by line, instead of having to read a looooong single line of code.

Let’s try a few exercises with pipes.

  1. Use pipes to sum the log of 100:200.

  2. Use pipes to take the square root of the mean of 100 random uniform numbers.

  3. Let’s dive into the guts of R. Using the mtcars data frame, get it’s summary and str that summary. What do you get back?

1.2 Plot

The last basic thing we’ll need today is plotting, as we’ll want to see the results of our simulations. Next week we’ll learn ggplot2 and get fancy. For the moment, I just want you to know a few things in baseplot.

First, here’s how to make a histogram.

#just some random numbers
vals <- runif(50, 5, 10)

#the histogram
hist(vals)

Easy, no?

Now, what if I wanted to plot x by y? A scatterplot. R uses an equation-like syntax for this, where I plot y ~ x. We’ll use this again and again as we fit linear models. For now, here’s an example scatterplot.

#first, a data frame
my_df <- data.frame(x=1:10, y=10:1)

#note how data is used
#as well as column names in the equation
plot(y ~ x, data=my_df)

Now, you can supply all sorts of arguments on plot - look at the help file. We’re not really going to worry about those today.

Let’s try some example exercises. And, uh, then this is the last you will ever see of baseplot again in this class.

  1. Using mtcars, plot mpg by cyl. What happens if you set pch to 19 or 5?

  2. boxplot() works like plot. Use it for mpg ~ vs. Now try the col argument. Give it a vector of two colors, like c("red", "blue").

2. dplyr

dplyr will change your life. It is the data munging tool you didn’t know you needed until now. To learn our way around it, we’ll use it with the mtcars data frame. dplyr is a collection of functions that manipulate data frames (or data tables, or tibbles, or other objects that work like data frames - more on those later). It’s functions largely take the data frame object as the first argument, and then return data frames as the output, so pipes are a natural fit for dplyr. Indeed, if you load dplyr, magrittr is loaded by default.

2.1 mutate()

image from Allison Horst https://github.com/allisonhorst/

Often, we want to make a column that uses information from other columns. For this, we have mutate. For example, what if we wanted to make a log transformed column of mpg? Note, I’m going to wrap the whole thing in head, so we only get the first 6 lines of the result

library(dplyr)

mtcars2 <-  mutate(mtcars, log_mpg = log(mpg)) 

head(mtcars2)
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb  log_mpg
## 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 3.044522
## 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 3.044522
## 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 3.126761
## 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 3.063391
## 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 2.928524
## 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 2.895912

OK, the first argument was the data frame. Then, we just specified a new column from whole cloth, but used one of the columns from mtcars to make it. How easy! It’s even easier with pipes.

mtcars2 <- mtcars %>%
    mutate(log_mpg = log(mpg))

head(mtcars2)
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb  log_mpg
## 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 3.044522
## 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 3.044522
## 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 3.126761
## 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 3.063391
## 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 2.928524
## 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 2.895912

2.2 group_by()

One of the ways we use dplyr the most is group_by(). This lets us group by different unique values of a column - think treatments or categories - and do something to those sub-data-frames. For example, let’s say we wanted to group by number of gears, and then use mutate to make a new column that’s the average of the mpg values.

mtcars_group <- mtcars %>%
  group_by(gear) %>%
  mutate(avg_mpg = mean(mpg)) %>%
  ungroup()

head(mtcars_group)
## # A tibble: 6 x 12
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb avg_mpg
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4    24.5
## 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4    24.5
## 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1    24.5
## 4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1    16.1
## 5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2    16.1
## 6  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1    16.1

Notice the ungroup() at the end. That’s provided as some functions bork if you give them a grouped data frame. We want to remove that structure.

Now see that final column - each row with the same number of gears has the same avg_mpg value.

2.3 summarize()

Often, you don’t want to have a data frame with summarized information repeated. You just want to reduce your data frame down to summarized information. In Excel, this is pivot tables, basically. For that, we use a combination of group_by() and summarize() Let’s do the same thing as the previous example, only let’s add the standard deviation, and return just the summarized information.

mtcars %>%
  group_by(gear) %>%
  summarize(avg_mpg = mean(mpg),
            sd_mpg = sd(mpg)) %>% 
  ungroup()
## # A tibble: 3 x 3
##    gear avg_mpg sd_mpg
##   <dbl>   <dbl>  <dbl>
## 1     3    16.1   3.37
## 2     4    24.5   5.28
## 3     5    21.4   6.66

Whoah! Cool!

You try it, but group by gear and vs, and look at how that alters weight. Yes, you can group by multiple things in the same group_by statement - just separate them with a comma!

2.4 select()

Often when working with a data frame, you end up with some column you no longer want. Maybe it was intermediate to some calculation. Or maybe you just want only a few columns to save memory. Either way, select is a way to include or exclude columns.

head(select(mtcars2, -mpg))
##   cyl disp  hp drat    wt  qsec vs am gear carb  log_mpg
## 1   6  160 110 3.90 2.620 16.46  0  1    4    4 3.044522
## 2   6  160 110 3.90 2.875 17.02  0  1    4    4 3.044522
## 3   4  108  93 3.85 2.320 18.61  1  1    4    1 3.126761
## 4   6  258 110 3.08 3.215 19.44  1  0    3    1 3.063391
## 5   8  360 175 3.15 3.440 17.02  0  0    3    2 2.928524
## 6   6  225 105 2.76 3.460 20.22  1  0    3    1 2.895912

Uh oh! Where did mpg go? Gone!

Or

mtcars3 <- mtcars %>%
  select(wt, am, gear)

head(mtcars3)
##                      wt am gear
## Mazda RX4         2.620  1    4
## Mazda RX4 Wag     2.875  1    4
## Datsun 710        2.320  1    4
## Hornet 4 Drive    3.215  0    3
## Hornet Sportabout 3.440  0    3
## Valiant           3.460  0    3

One can even use some more imprecise matches if need be.

head(select(mtcars2, contains("mpg")))
##    mpg  log_mpg
## 1 21.0 3.044522
## 2 21.0 3.044522
## 3 22.8 3.126761
## 4 21.4 3.063391
## 5 18.7 2.928524
## 6 18.1 2.895912

2.5 filter()

image from Allison Horst https://github.com/allisonhorst/

We can also exclude by row values using filter. This function takes all manner of comparisons, and returns only those rows for which the comparison is true. For example, to get rid of 3 cylinder cars:

mtcars_filter <- mtcars %>%
  filter(cyl != 3)

head(mtcars_filter)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

I’ll often use this to remove rows with NA values for some column, such as

mtcars %>%
  filter(!is.na(cyl))

Exercises
1. Add some columns to mtcars to plot the log of mpg by the square root of hp.

  1. Get the average hp per gear and plot them against each other.

  2. Make a data fame for only 6 cylinder engines with only the disp and carb columns. Create a boxplot. of how carb influences disp.

3. A purrr-fect simulation

3.1 Replicators Abound

To simulate things, we want to repeat an action again and again. There are a few ways to do this in R. The first is the function replicate()

replicate(n = 5, sum(1:10))
## [1] 55 55 55 55 55

Well, that’s a hair boring, but, you can see it summed something 5 times. We can of course create fancier things to do, but, in general, replicate() is a solid workhorse.

3.2 But I like cats…

There is also a beautiful library in R called purrr. This library takes the dplyr logic of mapping a function to many inputs and puts it on a different level. We’ll talk more about purrr at another time, but, for now, there are a few functions worth knowing.

image from Allison Horst https://github.com/allisonhorst/

The core purrr functions are the map_*() functions. Basically, a whole family of functions where you can change what is in the * to specify the output type. For example, if we want to replicate the replicate() action of above, we’d be returning a double and would use map_dbl().

library(purrr)
map_dbl(1:5, ~sum(1:10))
## [1] 55 55 55 55 55

Note our use of ~ to say “Hey, we’re about to call some stuff.” There are other ways to use map, but they involve writing your own functions, and, we’ll leave that for another time.

Other useful versions of map include map_chr() for character returns, map_df() for a data frame, map_lgl for true/false, and just plain olde map() which returns a list of whatever you want.

Let’s see map_df() in action, as we’re going to use it today. Let’s say we want to take some input, and return a data frame with the input and the sum of all numbers up to that input. To do so, you need to know that after the ~, map functions call the individual argument being sent to the function .x.

map_df(1:10, ~data.frame(x = .x, y = sum(1:.x)))
##     x  y
## 1   1  1
## 2   2  3
## 3   3  6
## 4   4 10
## 5   5 15
## 6   6 21
## 7   7 28
## 8   8 36
## 9   9 45
## 10 10 55

Nifty! What if we wanted to return not only the final sum, but all numbers up until that point?

map_df(1:10, ~data.frame(x = .x, y = cumsum(1:.x)))
##     x  y
## 1   1  1
## 2   2  1
## 3   2  3
## 4   3  1
## 5   3  3
## 6   3  6
## 7   4  1
## 8   4  3
## 9   4  6
## 10  4 10
## 11  5  1
## 12  5  3
## 13  5  6
## 14  5 10
## 15  5 15
## 16  6  1
## 17  6  3
## 18  6  6
## 19  6 10
## 20  6 15
## 21  6 21
## 22  7  1
## 23  7  3
## 24  7  6
## 25  7 10
## 26  7 15
## 27  7 21
## 28  7 28
## 29  8  1
## 30  8  3
## 31  8  6
## 32  8 10
## 33  8 15
## 34  8 21
## 35  8 28
## 36  8 36
## 37  9  1
## 38  9  3
## 39  9  6
## 40  9 10
## 41  9 15
## 42  9 21
## 43  9 28
## 44  9 36
## 45  9 45
## 46 10  1
## 47 10  3
## 48 10  6
## 49 10 10
## 50 10 15
## 51 10 21
## 52 10 28
## 53 10 36
## 54 10 45
## 55 10 55

Exercises

  1. Use replicate() to repeatedly average the numbers 1:10 seven times.

  2. Do the same thing with map_dbl() - also what happens if you use other map functions?

  3. Start with a vector:

my_vec <- c(11, 10, 10, 9, 10, 11, 10, 9, 10, 
            12, 9, 11, 8, 11, 11, 10, 11, 10, 
            11, 9)

Use map_df() to make a data frame that, for the numbers 3 through 15, returns two columns. One is the the average of the element of the vector 1 through the chosen number, the second is the standard deviation.

e.g. 10.2 for a mean - but the 10 will be .x for you!

4. Simulation and Sample Size

4.1 Random Number Generation

There are a number of ways to get random numbers in R from a variety of distributions. For our simulations, let’s start with a sample of 40 individuals that’s from a population with a mean value of 10, a SD of 3.

set.seed(323)
samp <- rnorm(n = 40, mean = 10, sd = 3)

Note the set.seed() function. Random number generators on computers aren’t REALLY random - they use some nugget of information that changes constantly to be pseudo-random. By setting that nugget - the seed - we make sure that all random numbers generated from this point on are the same.

Unsettling, no?

There are a wide variety of different random number generators in R. rnorm() is exceedingly comming, but, we also have the following. Let’s flip a coin 30 times and see how many heads we get if there’s a 50:50 change of getting heads. And let’s do it 10 times!

rbinom(10, size = 30, prob = 0.5)
##  [1] 10 18 16 11 16 17 14 14 15 17

We can also draw random numbers from a uniform distribution (all numbers are equally likely to be drawn) with a minimum and maximum bound.

runif(10, min = -1, max = 1)
##  [1]  0.06789064 -0.41526715 -0.16290617  0.87500096  0.82789850  0.86382277
##  [7] -0.88316031  0.81005091  0.99624116  0.41923706

So from this, we can draw a random sample from a population with a distribution of our choice!

4.2 Using rowwise() and map_dbl() for simulation

OK, let’s begin our journey towards figuring out what sample size we want. We now know how to generate random measurements from an assumed population. The other tool we’ll need to simulate sampling a population is some output. Let’s go through this step by step. First, let’s say we are going to try out sample sizes from 3 through 50 for a population with a mean of 45 and a standard deviation of 20. To start, we’ll need to setup a data frame with the different sample sizes we want and some parameters we can use.

# our assumptions
mean_pop <- 45
sd_pop <- 20

# use rep to repeat the sample sizes 3 through 50
samp_sim <- data.frame(samp_size = 3:50)

So, how do we create those replicate populations? Let’s say we wanted to just make one simulated draw for each sample size. For that, we need to 1) look at each sample size, and then 2) draw a random sample. For the latter, we already have a function. For the former, we need a function to go row by row. dplyr::rowwise(). We feed it samp_size as an argument so that the variable is preserved in the resulting data frame. We could also have done dplyr::group_by(samp_size) as well. But I often find rowwise() more intuitive in this case. For the later, we actually use summarize(), as it lets us create resulting outputs that are not the same size as the input. Note, this is new in dplyr 1.0, and I could not be more excited.

Let’s do this and visualize the results.

samp_sim_1 <- samp_sim %>%
  rowwise(samp_size) %>%
  summarize(samp = rnorm(samp_size, mean_pop, sd_pop))

plot(samp ~ samp_size, data = samp_sim_1)

OK, that’s nice. I have a lot of samples at each sample size. You can see as our sample size increases, there are, indeed, more points. However, we want to simulate a lot of sampling evens and calculate sample statistics. This is where map_dbl() or replicate() becomes very useful. Let’s say, instead of just drawing from a normal population, we want to modify the above to calculate a mean. And do it, oh, let’s say twice.

samp_sim_means_2 <- samp_sim %>%
  rowwise(samp_size) %>%
  summarize(samp_mean = 
              replicate(2, rnorm(samp_size, mean_pop, sd_pop) %>% mean()))

samp_sim_means_2
## # A tibble: 96 x 2
## # Groups:   samp_size [48]
##    samp_size samp_mean
##        <int>     <dbl>
##  1         3      12.9
##  2         3      71.2
##  3         4      45.6
##  4         4      54.8
##  5         5      55.1
##  6         5      54.4
##  7         6      55.0
##  8         6      52.4
##  9         7      45.0
## 10         7      32.7
## # … with 86 more rows

Oh! Now we’re getting somewhere. Before we fully sally forth, let’s outline what we’re doing. You’ll see it’s a common structure for simulations.

# Take a data frame with a bunch of sample sizes
# For each row
# Replicate calculating the mean from a random draw some number of times

Reasonable, no? You could have written this out. Now, let’s operationalize this for 1000 simulations.

# Take a data frame with a bunch of sample sizes
sim_results <- samp_sim %>%

  # For each row
  rowwise(samp_size) %>%
  
  # Replicate calculating the mean from a random draw some number of times
  summarize(samp_mean = map_dbl(1:1000, 
                                ~ rnorm(samp_size, mean_pop, sd_pop) %>%
                                  mean()))

What’s great about this (and using a map function) is that we can also add additional information - say, the sd - using map_df() Note how we don’t even use = in summarize()

# Take a data frame with a bunch of sample sizes
sim_results <- samp_sim %>%

  # For each row
  rowwise(samp_size) %>%
  
  # Replicate calculating the mean 
  # and Sdfrom a random draw some number of times
  summarize(map_df(1:1000, 
                   ~data.frame(sim = .x,
                               samp_mean = rnorm(samp_size,
                                                 mean_pop, 
                                                 sd_pop) %>%
                                 mean(),
                               samp_sd = rnorm(samp_size,
                                               mean_pop, 
                                               sd_pop) %>%
                                 sd())))

sim_results
## # A tibble: 48,000 x 4
## # Groups:   samp_size [48]
##    samp_size   sim samp_mean samp_sd
##        <int> <int>     <dbl>   <dbl>
##  1         3     1      30.7    5.32
##  2         3     2      59.1   18.4 
##  3         3     3      57.8   18.1 
##  4         3     4      64.3   40.9 
##  5         3     5      35.4   32.5 
##  6         3     6      38.6   14.9 
##  7         3     7      40.2   11.7 
##  8         3     8      39.4   10.9 
##  9         3     9      40.0   25.8 
## 10         3    10      49.9   13.6 
## # … with 47,990 more rows

4.3 Faded Examples.

Let’s try this out, and have you fill in what’s missing in these faded examples.

#Some preperatory material
set.seed(42)
mean_pop <- 10
sd_pop <- 3
nsim <- 100
sampSim <- data.frame(samp_size = 3:50)

#Mean simulations
sampSim %>%
  rowwise(samp_size) %>%
  summarize(samp_mean = 
              replicate(nsim, 
                        rnorm(samp_size, mean_pop, sd_pop) %>% mean()))


#Now the faded examples! Fill in the ___

#Median simulations
samp_sim %>%
  rowwise(samp_size) %>%
  summarize(samp_median = 
              ____(nsim, 
                        rnorm(samp_size, mean_pop, sd_pop) %>% median()))

#SD simulations
samp_sim %>%
  rowwise(samp_size) %>%
  ____(samp_sd = 
              ____(nsim, 
                        ____(samp_size, mean_pop, ____) %>% sd()))

  
  
  
#IQR simulations
#function for interquartile range is IQR
samp_sim %>%
  ____(____) %>%
  ____(samp_iqr = 
              ____(nsim, 
                        ____(____, mean_pop, ____) %>% IQR()))

4.4 Determining Optimal Sample Size with plot and summarize

Great, so we’ve done the simulation! How do we determine sample size? The first way is a plot.

plot(samp_mean ~ samp_size, data=sim_results)

We can eyeball this result and see a leveling out > 20 or so. OK, that’s great, but…totally imprecise.

Better might be to see where the SD of the mean levels off. Let’s pseudocode that out in comments to see what we might need to do.

# Take our sampSim data
# Group it by sample size
# Take the SD of the sample size
# ungroup the resulting data
# Sort it by SD
# Look at the top 5 entries.

A few new things here. First, we’re grouping by sample size, not sim number. Second, we’re summarizing. We are reducing our data down - boiling it to its essence. For this, dplyr has a function called - wait for it - summarize(). Second, we’re going to have to do some arranging. With a dplyr function called - and this is a shocker - arrange. sort was already taken by the base package. OK, let’s walk through the resulting pipeline in two pieces. First, the summarization

# Take our sampSim data
sampSim_summary <- sim_results %>%
  # Group it by sample size
  group_by(samp_size) %>%
  # Take the SD of the sample size
  summarize(pop_mean_sd = sd(samp_mean))

Neat - this gives us a much reduced piece of data to work with. But - what about the arranging?

sampSim_summary <- sampSim_summary %>%
  # Sort it by SD
  arrange(pop_mean_sd)

sampSim_summary
## # A tibble: 48 x 2
##    samp_size pop_mean_sd
##        <int>       <dbl>
##  1        49        2.73
##  2        47        2.86
##  3        50        2.88
##  4        45        2.92
##  5        48        2.99
##  6        46        3.02
##  7        43        3.06
##  8        44        3.13
##  9        42        3.15
## 10        41        3.17
## # … with 38 more rows

Now, notice that I don’t have to use head or chop of the top few rows (there’s a dplyr function for that - slice()). That’s because dplyr creates a data frame-like object called a tibble. We’ll talk about this in more detail later. For now, tibbles work just like data frames, but they print out a little differently. If we REALLY wanted a data frame, we could just use as.data.frame().

As for the result, we can see that, eh, something in the 40s has the lowest SD, but things bounce around. We could also plot this.

plot(pop_mean_sd ~ samp_size, data = sampSim_summary)

So, what is acceptable? At what point does adding another sample begin to have diminishing returns.

4.5 Exercises.

  1. Look at the resulting simulations using the sample. Would you chose a different optimal sample size?

  2. Repeat this whole process, but for the sample median. What is the optimal sample size? Feel free to reuse code or modify code in place (for the simulations).

5. Bootstrapped Standard Errors

5.1 Basic Sample Properties

So, as we launch into things like standard errors, I wanted to pause and hit a few key functions we need to describe a sample. Let’s use the samp vector from above as adata source. First, mean, sd, and median.

mean(samp)
## [1] 10.28201
sd(samp)
## [1] 2.881881
median(samp)
## [1] 10.0977

This is pretty great. There are a ton of functions like these. Such as

IQR(samp)
## [1] 4.15341

Although there is no standard error of the mean or confidence interval function. There are functions for skew, kurtosis, and more in other packages. Last, for arbitrary percentiles, there’s the ‘quantile()’ function.

#quartiles by defeault
quantile(samp)
##        0%       25%       50%       75%      100% 
##  5.648247  8.072854 10.097696 12.226264 18.916196
#or, get what you want
quantile(samp, probs = c(0.05, 0.1, 0.5, 0.9, 0.95))
##        5%       10%       50%       90%       95% 
##  6.264685  7.177671 10.097696 13.676815 14.484173

You can get individual values by specifying the percentile you want with probs.

5.2 The Bootstrap

OK, but what about bootstrapped standard errors? Or, you know, bootstrapping in general. Let’s start by generating a ‘sample’ with a sample size of 40.

set.seed(2020)
samp <- rnorm(n = 40, mean = 10, sd = 3)

OK, how do we get one bootstrap sample from this? Why, with sample() of course. We give sample() a few arguments about how big of a sample we want as well as if we want to sample with or without replacement.

one_boot <- sample(samp, size = length(samp), replace = TRUE)

5.3 Bootstrapped SEs

Let’s look at the bootstrapped standard error of a median from our samp vector. We’re going to do it using the same techniques of simulation - and in particular with replicate(). So, let’s replicate taking the median of a sample with replacement of the samp vector. Then we can get the standard deviation of that.

median_sims <- replicate(n = 100, 
                         median(sample(samp,length(samp), replace=TRUE)))

sd(median_sims)
## [1] 0.3784669

Simple, no? Now…. can you think about how you might use this in the context of trying different sample sizes?

5.4 Exercise! How does sample size change bootstrapped SE?

OK, you’ve done simulations from populations. Now, let’s mesh those bootstraps with our concepts of simulation at different sample sizes and…. plot out how to look at change in bootstrapped SE for the IQR at different sample sizes.

  1. First, make sure you feel comfortable calculating the bootstrapped SE of the IQR from samp. Repeat what we did above with IQR instead of median.

  2. Now, write out in comments what you will do to end up with a data frame that has a column of sample sizes and a column of IQRs calculated from sampling our samp vector.

  3. Code it!

  4. Now, write out in comments how you would go from that data frame to one that has the SE for the IQR at each sample size.