Often in R, what we kludge together different pieces of code to generate a sequence of events. For example, let’s say I wanted to take a vector , find out it’s length, and get the square root. Then get the log of that. There are a few ways I could do this.
vec <- 1:10
len <- length(vec)
sq <- sqrt(len)
log(sq)
## [1] 1.151293
This makes sense, and is very logical, but I keep saving new variables as I go along. I could write some nested piece of code that does the same thing with no variables.
log(sqrt(length(1:10)))
## [1] 1.151293
Which is efficient, but, man, I hope I can keep track of parentheses. It’s also ugly as hell. And super difficult for future-me to understand.
This idea of doing one thing, then the next, then the next is often how we treat our data. We take a piece of data, sift it, transform it, summarize it, etc. And this step-by-step style of coding has a name - functional programming. It lies at the core of much of what we do in R.
So, is there a better way? Yes! What if we could take the output of one function, and pass it to the next? This is known as piping. And it is implemented in R by the magrittr
library. (Pause. Get the joke. Laugh at it. Move on.) magrittr
introduces a funny new operator %>%
which says, take the output of the thing on the right hand side, and pass it to the next function as its first argument.
As a simple example, let’s sum one through ten using pipes.
#If you haven't, install.packages("magrittr") first
library(magrittr)
1:10 %>% sum()
## [1] 55
Notice that we supplied no argument to sum. Now, what if wanted to take the square root of the result?
1:10 %>%
sum() %>%
sqrt()
## [1] 7.416198
Notice how I separated different steps on different lines? This is a more readable code. Future me can track what’s going on, line by line, instead of having to read a looooong single line of code.
Let’s try a few exercises with pipes.
Use pipes to sum the log of 100:200.
Use pipes to take the square root of the mean of 100 random uniform numbers.
Let’s dive into the guts of R. Using the mtcars
data frame, get it’s summary and str that summary. What do you get back?
The last basic thing we’ll need today is plotting, as we’ll want to see the results of our simulations. Next week we’ll learn ggplot2
and get fancy. For the moment, I just want you to know a few things in baseplot.
First, here’s how to make a histogram.
#just some random numbers
vals <- runif(50, 5, 10)
#the histogram
hist(vals)
Easy, no?
Now, what if I wanted to plot x by y? A scatterplot. R uses an equation-like syntax for this, where I plot y ~ x
. We’ll use this again and again as we fit linear models. For now, here’s an example scatterplot.
#first, a data frame
my_df <- data.frame(x=1:10, y=10:1)
#note how data is used
#as well as column names in the equation
plot(y ~ x, data=my_df)
Now, you can supply all sorts of arguments on plot - look at the help file. We’re not really going to worry about those today.
Let’s try some examples.
Using mtcars
, plot mpg by cyl. What happens if you set pch to 19 or 5?
boxplot()
works like plot. Use it for mpg ~ vs
. Now try the col argument. Give it a vector of two colors, like c("red", "blue")
.
dplyr
will change your life. It is the data munging tool you didn’t know you needed until now. To learn our way around it, we’ll use it with the mtcars
data frame. dplyr
is a collection of functions that manipulate data frames (or data tables, or tibbles, or other objects that work like data frames - more on those later). It’s functions largely take the data frame object as the first argument, and then return data frames as the output, so pipes are a natural fit for dplyr
. Indeed, if you load dplyr
, magrittr
is loaded by default.
mutate()
Often, we want to make a column that uses information from other columns. For this, we have mutate. For example, what if we wanted to make a log transformed column of mpg
? Note, I’m going to wrap the whole thing in head, so we only get the first 6 lines of the result
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mtcars2 <- mutate(mtcars, log_mpg = log(mpg))
head(mtcars2)
## mpg cyl disp hp drat wt qsec vs am gear carb log_mpg
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 3.044522
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 3.044522
## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 3.126761
## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3.063391
## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 2.928524
## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 2.895912
OK, the first argument was the data frame. Then, we just specified a new column from whole cloth, but used one of the columns from mtcars to make it. How easy! It’s even easier with pipes.
mtcars2 <- mtcars %>%
mutate(log_mpg = log(mpg))
head(mtcars2)
## mpg cyl disp hp drat wt qsec vs am gear carb log_mpg
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 3.044522
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 3.044522
## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 3.126761
## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3.063391
## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 2.928524
## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 2.895912
group_by()
One of the ways we use dplyr
the most is group_by()
. This lets us group by different unique values of a column - think treatments or categories - and do something to those sub-data-frames. For example, let’s say we wanted to group by number of gears, and then use mutate to make a new column that’s the average of the mpg values.
mtcars_group <- mtcars %>%
group_by(gear) %>%
mutate(avg_mpg = mean(mpg)) %>%
ungroup()
head(mtcars_group)
## # A tibble: 6 x 12
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## # ... with 1 more variables: avg_mpg <dbl>
Notice the ungroup()
at the end. That’s provided as some functions bork if you give them a grouped data frame. We want to remove that structure.
Now see that final column - each row with the same number of gears has the same avg_mpg value.
summarize()
Often, you don’t want to have a data frame with summarized information repeated. You just want to reduce your data frame down to summarized information. In Excel, this is pivot tables, basically. For that, we use a combination of group_by()
and summarize()
Let’s do the same thing as the previous example, only let’s add the standard deviation, and return just the summarized information.
mtcars %>%
group_by(gear) %>%
summarize(avg_mpg = mean(mpg),
sd_mpg = sd(mpg)) %>%
ungroup()
## # A tibble: 3 x 3
## gear avg_mpg sd_mpg
## <dbl> <dbl> <dbl>
## 1 3 16.10667 3.371618
## 2 4 24.53333 5.276764
## 3 5 21.38000 6.658979
Whoah! Cool!
You try it, but group by gear and vs, and look at how that alters weight. Yes, you can group by multiple things in the same group_by
statement - just separate them with a comma!
select()
Often when working with a data frame, you end up with some column you no longer want. Maybe it was intermediate to some calculation. Or maybe you just want only a few columns to save memory. Either way, select is a way to include or exclude columns.
head(select(mtcars2, -mpg))
## cyl disp hp drat wt qsec vs am gear carb log_mpg
## 1 6 160 110 3.90 2.620 16.46 0 1 4 4 3.044522
## 2 6 160 110 3.90 2.875 17.02 0 1 4 4 3.044522
## 3 4 108 93 3.85 2.320 18.61 1 1 4 1 3.126761
## 4 6 258 110 3.08 3.215 19.44 1 0 3 1 3.063391
## 5 8 360 175 3.15 3.440 17.02 0 0 3 2 2.928524
## 6 6 225 105 2.76 3.460 20.22 1 0 3 1 2.895912
Uh oh! Where did mpg go? Gone!
Or
mtcars3 <- mtcars %>%
select(wt, am, gear)
head(mtcars3)
## wt am gear
## Mazda RX4 2.620 1 4
## Mazda RX4 Wag 2.875 1 4
## Datsun 710 2.320 1 4
## Hornet 4 Drive 3.215 0 3
## Hornet Sportabout 3.440 0 3
## Valiant 3.460 0 3
One can even use some more imprecise matches if need be.
head(select(mtcars2, contains("mpg")))
## mpg log_mpg
## 1 21.0 3.044522
## 2 21.0 3.044522
## 3 22.8 3.126761
## 4 21.4 3.063391
## 5 18.7 2.928524
## 6 18.1 2.895912
filter()
We can also exclude by row values using filter. This function takes all manner of comparisons, and returns only those rows for which the comparison is true. For example, to get rid of 3 cylinder cars:
mtcars_filter <- mtcars %>%
filter(cyl != 3)
head(mtcars_filter)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
I’ll often use this to remove rows with NA
values for some column, such as
mtcars %>%
filter(!is.na(cyl))
sample()
There are a number of ways to get random numbers in R from a variety of distributions. For our simulations, let’s start with a sample of 40 individuals that’s from a population with a mean value of 10, a SD of 3.
set.seed(323)
samp <- rnorm(n = 40, mean = 10, sd = 3)
Note the set.seed()
function. Random number generators on computers aren’t REALLY random - they use some nugget of information that changes constantly to be pseudo-random. By setting that nugget - the seed - we make sure that all random numbers generated from this point on are the same.
Unsettling, no?
So, we can pull a random sample from a perfect population anytime we’d like. How do we get a bootstrap sample from a population?
Why, with the sample function. Let’s sample the sample we’ve greated above, and draw five different elements.
sample(samp, 5)
## [1] 13.644648 14.366408 16.721714 12.258668 9.629224
This is incredibly useful in a wide variety of context. For example, let’s say we wanted five rows of mtcars.
mtcars[sample(1:nrow(mtcars), 5), ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Of course, for that kind of thing, dplyr
has you covered with
sample_n(mtcars, 5)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Or, you can even use a porportions - so, here’s a random sampling of 10% of mtcars
sample_frac(mtcars, 0.1)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
But that is neither here nor there.
One thing that all of these sample functions let you do is sample with or without replacement. If we’re assuming a sample is representative of a population and we want to bootstrap resample it, we’re going to need to sample with replacement. Fortunately, that’s just one more argument.
sample(1:3, 15, replace = TRUE)
## [1] 3 3 1 3 3 2 2 3 3 1 1 2 3 2 1
This will come in handy shortly!
group_by()
for simulationThe other tool we’ll need to simulate sampling a population is some output. Let’s go through this step by step. First, let’s say we are going to have 10 samples per sample size, and try sample sizes from 3 to 50. We begin by creating a data frame with this information, as we’re going to want to plot this information later.
#use rep to repeat the sample sizes 3 through 50 10 times
sampSim <- data.frame(samp_size = rep(3:50, times = 10))
We also want to keep track of simulation number, as we’re going to want to perform independent actions for each sim.
sampSim$sim_number = 1:nrow(sampSim)
Now, we’re ready to go! What we want to do is, for each simulation, generate a sample population and take a mean. So, something like
mean(rnorm(n = 10, mean = 10, sd = 3))
Or replace that rnorm bit with the sample
function and vector of interest. We’ll do it both ways.
So, we want to work this into a dplyr
workflow. Now, what I often do it write out, in comments, what I want to do. Then implement that in dplyr
. For example.
# Take the data frame
# For each individual simulation
# Use samp_size to get the mean of a random normal population with mean 10 and sd 3
# Also draw a mean from the samp vector with n = samp_size, with replacement
Reasonable, no? You could have written this out. Now, let’s operationalize this. We’ll group by sim_number
, as we’re doing to do the same thing for each replicate simulation.
# Take the data frame
sampSim <- sampSim %>%
# For each individual simulation
group_by(sim_number) %>%
# Use samp_size to get the mean of a random normal population with mean 10 and sd 3
mutate(mean_pop = mean(rnorm(samp_size, mean = 10, sd = 3)),
# Also draw a mean from the samp vector with n = samp_size, with replacement
mean_sample = mean(sample(samp, size = samp_size, replace=T))) %>%
# Cleanup
ungroup()
Let’s try this out, and have you fill in what’s missing in these faded examples.
#Some preperatory material
set.seed(42)
samp <- rnorm(100, 10, 3)
sampSim <- data.frame(samp_size = rep(3:50, times = 10))
sampSim$sim_number = 1:nrow(sampSim)
#Mean simulations
sampSim %>%
group_by(sim_number) %>%
mutate(mean_pop = mean(rnorm(samp_size, mean = 10, sd = 3)),
mean_sample = mean(sample(samp, size = samp_size, replace=T))) %>%
ungroup()
#Now the faded examples! Fill in the ___
#Median simulations
sampSim %>%
group_by(sim_number) %>%
mutate(median_pop = ____(rnorm(samp_size, mean = 10, sd = 3)),
median_sample = ____(sample(samp, size = samp_size, replace=T))) %>%
ungroup()
#SD simulations
sampSim %>%
group_by(____) %>%
____(sd_pop = ____(rnorm(____, mean = 10, sd = 3)),
sd_sample = ____(sample(samp, size = ____, ____)) %>%
ungroup()
#IQR simulations
#function for interquartile range is IQR
sampSim %>%
____(____) %>%
____(IQE_pop = ____(____(____, mean = 10, sd = 3)),
IQR_sample = ____(____(samp, size = ____, ____)) %>%
ungroup()
plot
and summarize
Great, so we’ve done the simulation! How do we determine sample size? The first way is a plot.
plot(mean_pop ~ samp_size, data=sampSim)
We can eyeball this result and see a leveling out > 20 or so. OK, that’s great, but…totally imprecise.
Better might be to see where the SD of the mean levels off. Let’s pseudocode that out in comments to see what we might need to do.
# Take our sampSim data
# Group it by sample size
# Take the SD of the sample size
# ungroup the resulting data
# Sort it by SD
# Look at the top 5 entries.
A few new things here. First, we’re grouping by sample size, not sim number. Second, we’re summarizing. We are reducing our data down - boiling it to its essence. For this, dplyr
has a function called - wait for it - summarize()
. Second, we’re going to have to do some arranging. With a dplyr
function called - and this is a shocker - arrange
. sort
was already taken by the base package. OK, let’s walk through the resulting pipeline in two pieces. First, the summarization
# Take our sampSim data
sampSim_summary <- sampSim %>%
# Group it by sample size
group_by(samp_size) %>%
# Take the SD of the sample size
summarize(pop_mean_sd = sd(mean_pop))
Neat - this gives us a much reduced piece of data to work with. But - what about the arranging?
sampSim_summary <- sampSim_summary %>%
# ungroup the resulting data
ungroup() %>%
# Sort it by SD
arrange(pop_mean_sd)
sampSim_summary
## # A tibble: 48 x 2
## samp_size pop_mean_sd
## <int> <dbl>
## 1 24 0.2945951
## 2 17 0.3102889
## 3 38 0.3397260
## 4 50 0.3591257
## 5 36 0.3949678
## 6 33 0.3996060
## 7 32 0.4183672
## 8 42 0.4212161
## 9 12 0.4293109
## 10 45 0.4390557
## # ... with 38 more rows
Now, notice that I don’t have to use head
or chop of the top few rows (there’s a dplyr function for that - slice()
). That’s because dplyr creates a data frame-like object called a tibble
. We’ll talk about this in more detail later. For now, tibbles work just like data frames, but they print out a little differently. If we REALLY wanted a data frame, we could just use as.data.frame()
.
As for the result, we can see that, eh, 46 has the lowest SD, but things bounce around, with even some of the mid 20s sample sizes having nice properties. Now, to be sure, we should probably repeat this with 100 or even 1000 simulations per
Look at the resulting simulations using the sample. Would you chose a different optimal sample size?
Repeat this process, but for the sample median. What is the optimal sample size?
So, as we launch into things like standard errors, I wanted to pause and hit a few key functions we need to describe a sample. Let’s use the samp
vector from above as adata source. First, mean, sd, and median.
mean(samp)
## [1] 10.28201
sd(samp)
## [1] 2.881881
median(samp)
## [1] 10.0977
This is pretty great. There are a ton of functions like these. Such as
IQR(samp)
## [1] 4.15341
Although there is no standard error of the mean or confidence interval function. There are functions for skew, kurtosis, and more in other packages. Last, for arbitrary percentiles, there’s the ‘quantile()’ function.
#quartiles by defeault
quantile(samp)
## 0% 25% 50% 75% 100%
## 5.648247 8.072854 10.097696 12.226264 18.916196
#or, get what you want
quantile(samp, probs = c(0.05, 0.1, 0.5, 0.9, 0.95))
## 5% 10% 50% 90% 95%
## 6.264685 7.177671 10.097696 13.676815 14.484173
You can get individual values by specifying the percentile you want with probs.
Let’s look at the standard error of a median from our samp vector. And let’s do it via simulation. One way we can do this is to use the ’replicate()` function. This lets us replicate some statement over and over again. So, let’s replicate taking the median of a sample with replacement of the samp vector. Then we can get the standard deviation of that.
median_sims <- replicate(n = 100, median(sample(samp, size=10, replace=TRUE)))
sd(median_sims)
## [1] 0.9480173
This is nice, but, man, that’s a lot of nested statements. The other way we could do this is with dplyr
. Again, we can start with a blank data frame with 100 simulations, waiting to take flight.
median_sims <- data.frame(sims=1:100)
Now, we can take these, and for each simulation, draw a median from a sample of our samp
vector.
median_sims <- median_sims %>%
group_by(sims) %>%
mutate(median_sim = median(sample(samp, size=10, replace = TRUE))) %>%
ungroup()
median_sims
## # A tibble: 100 x 2
## sims median_sim
## <int> <dbl>
## 1 1 10.030112
## 2 2 10.368041
## 3 3 10.880794
## 4 4 10.719992
## 5 5 9.629224
## 6 6 10.019199
## 7 7 10.784097
## 8 8 9.988440
## 9 9 9.366972
## 10 10 11.239412
## # ... with 90 more rows
Perfect! From this, we can get the standard error of our median quite simply.
sd(median_sims$median_sim)
## [1] 1.0201
Done!
To take this for a spin, let’s try some faded examples.
#some preperatory material
se_sims <- data.frame(sample_size=rep(5:50, 100))
median_se_sims <- se_sims %>%
group_by(1:n()) %>%
mutate(median_sim = median(sample(samp, size = sample_size, replace=TRUE))) %>%
ungroup() %>%
group_by(sample_size) %>%
summarize(median_se = sd(median_sim))
plot(median_se ~ sample_size, data=median_se_sims, type="l")
#mean se sims
mean_se_sims <- se_sims %>%
___(1:n()) %>%
mutate(mean_sim = mean(___(samp, size = sample_size, replace=TRUE))) %>%
ungroup() %>%
group_by(___) %>%
summarize(mean_se = sd(___))
plot(mean_se ~ sample_size, data=___, type="l")
#iqr se sims
iqr_se_sims <- se_sims %>%
___(1:n()) %>%
___(iqr_sim = IQR(___(samp, size = sample_size, replace=TRUE))) %>%
ungroup() %>%
group_by(___) %>%
summarize(iqr_se = sd(___))
plot(___ ~ ___, data=___, type="l")
#sd se sims
sd_se_sims <- se_sims %>%
___(1:n()) %>%
___(sd_sim = sd(___(samp, size = ___, ___=___))) %>%
ungroup() %>%
___(___) %>%
___(sd_se = sd(___))
plot(___ ~ ___, data=___, type="l")