OK, we can fit and evluate linear regression models, but what about their power? Let’s look at our seal fit again - maybe we want to know what would have happened with a lower sample size?
Note, I’m going to use a knitr function, kable
to neaten up the table for markdown. Try it in your own homeworks!
knitr::kable(tidy(seal_lm))
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 115.7667505 | 0.1763751 | 656.3668 | 0 |
age.days | 0.0023706 | 0.0000447 | 53.0645 | 0 |
What about the residual SD?
knitr::kable(glance(seal_lm)[,1:3])
r.squared | adj.r.squared | sigma |
---|---|---|
0.2256493 | 0.2255691 | 5.6805 |
All right - there are our target values. Now, we can change a lof ot things. The effect size (slope), range of values, sigma, and more. But let’s try sample size.
The first step of a power analysis is to setup a data frame with all of the different possibilities that you might want to assess for power. For linear regression, we have the following parameters, all of which might vary:
To do a power analysis via simulation for a linear regression, we begin by building our simulation data frame with the parameters and information that varies. In this case, sample size.
library(dplyr)
set.seed(607)
simPopN <- data.frame(slope = 0.00237,
intercept=115.767,
sigma = 5.6805,
n=10:100)
Note, if we wanted to vary more than one parameter, we’d first create a data frame where only one parameter varied, then add a second that varried using crossing in tidyr, like so:
library(tidyr)
simPopN <- data.frame(slope = 0.00237,
intercept=115.767,
sigma = 2:6) %>%
crossing(n=10:100)
OK, once we have everything in place, including sample size, we need to expand this out to have some number of samples for each n. For that, we can use the function in tidyr
(same library as crossings), expand()
.
library(tidyr)
simPopN <- simPopN %>%
group_by(slope, intercept, sigma, n) %>%
expand(reps = 1:n) %>%
ungroup()
Now, if we want to simulate each of these, say, 100 times, we need to assign unique sim numbers, so for each n and sim number we have a unique data set. We can use crossing to replicate each combination of variables above some number of times. Note - 100 is really low, but it doesn’t eat your processor! Use low numbers of simulation for development, then crank them up for final analysis.
simPopN <- simPopN %>%
crossing(sim = 1:100)
Great - almost ready to go! Now we just need to add in fitted values. Fortunately, as rnorm()
works with vectors, we can just use a mutate here. We’ll also need to simulate random draws of ages, but that’s just another random number.
simPopN <- simPopN %>%
mutate(age.days = runif(n(), 1000, 8500)) %>%
mutate(length.cm = rnorm(n(), intercept + slope*age.days, sigma))
Yatzee! Ready to run!
First, we now need to generate a lot of fit models. Dplyr doesn’t take too kindly to including fit things, so, we can use two powerful functions here - first, nest
and unnest()
allow us to collapse grouped data down into little pieces and re-expand it.
fits <- simPopN %>%
group_by(slope, intercept, sigma, n, sim) %>%
nest()
fits
## # A tibble: 9,100 Ă— 6
## slope intercept sigma n sim data
## <dbl> <dbl> <dbl> <int> <int> <list>
## 1 0.00237 115.767 5.6805 10 1 <tibble [10 Ă— 3]>
## 2 0.00237 115.767 5.6805 10 2 <tibble [10 Ă— 3]>
## 3 0.00237 115.767 5.6805 10 3 <tibble [10 Ă— 3]>
## 4 0.00237 115.767 5.6805 10 4 <tibble [10 Ă— 3]>
## 5 0.00237 115.767 5.6805 10 5 <tibble [10 Ă— 3]>
## 6 0.00237 115.767 5.6805 10 6 <tibble [10 Ă— 3]>
## 7 0.00237 115.767 5.6805 10 7 <tibble [10 Ă— 3]>
## 8 0.00237 115.767 5.6805 10 8 <tibble [10 Ă— 3]>
## 9 0.00237 115.767 5.6805 10 9 <tibble [10 Ă— 3]>
## 10 0.00237 115.767 5.6805 10 10 <tibble [10 Ă— 3]>
## # ... with 9,090 more rows
Second, the map
function in the purrr library allows us to iterate over different levels or grouped data frames, and perform some function. In this case, we’ll fit a model, get it’s coefficients using broom
. This is a new weird set of functions. What’s odd about map is that the first argument is a column. But that argument, for the rest of the arguments of the function, is now called .
and we also use the ~
notation. What ~
does is says that, from this point forward, .
refers to the first argument given to map
.
In essence, what map does is iterate over each element of a list given to it. Once we nest, the data column is now a list, with each element of the list it’s own unique data frame. So, map works with lists to apply a function to each element. The output of the model fitting is another list - now called mod. A list of models. We then iterate, using map
over that list to generate another list of data frames - this time of coefficients.
library(purrr)
library(broom)
fits <- fits %>%
mutate(mod = map(data, ~lm(length.cm ~ age.days, data=.))) %>%
mutate(coefs = map(mod, ~tidy(.)))
fits
## # A tibble: 9,100 Ă— 8
## slope intercept sigma n sim data mod
## <dbl> <dbl> <dbl> <int> <int> <list> <list>
## 1 0.00237 115.767 5.6805 10 1 <tibble [10 Ă— 3]> <S3: lm>
## 2 0.00237 115.767 5.6805 10 2 <tibble [10 Ă— 3]> <S3: lm>
## 3 0.00237 115.767 5.6805 10 3 <tibble [10 Ă— 3]> <S3: lm>
## 4 0.00237 115.767 5.6805 10 4 <tibble [10 Ă— 3]> <S3: lm>
## 5 0.00237 115.767 5.6805 10 5 <tibble [10 Ă— 3]> <S3: lm>
## 6 0.00237 115.767 5.6805 10 6 <tibble [10 Ă— 3]> <S3: lm>
## 7 0.00237 115.767 5.6805 10 7 <tibble [10 Ă— 3]> <S3: lm>
## 8 0.00237 115.767 5.6805 10 8 <tibble [10 Ă— 3]> <S3: lm>
## 9 0.00237 115.767 5.6805 10 9 <tibble [10 Ă— 3]> <S3: lm>
## 10 0.00237 115.767 5.6805 10 10 <tibble [10 Ă— 3]> <S3: lm>
## # ... with 9,090 more rows, and 1 more variables: coefs <list>
Last, we cleanup - we unnest
, which takes list-columns from above, and expands them out as into full data frames that get slotted back into our original data frame. Nice trick, no?
We’ll also filter for just the slope coefficient.
fits <- fits %>%
unnest(coefs) %>%
ungroup() %>%
filter(term == "age.days")
fits
## # A tibble: 9,100 Ă— 10
## slope intercept sigma n sim term estimate std.error
## <dbl> <dbl> <dbl> <int> <int> <chr> <dbl> <dbl>
## 1 0.00237 115.767 5.6805 10 1 age.days 0.0004283944 0.0010318274
## 2 0.00237 115.767 5.6805 10 2 age.days 0.0021239104 0.0005736028
## 3 0.00237 115.767 5.6805 10 3 age.days 0.0033531879 0.0009399613
## 4 0.00237 115.767 5.6805 10 4 age.days 0.0032566962 0.0008342080
## 5 0.00237 115.767 5.6805 10 5 age.days 0.0028332982 0.0009062293
## 6 0.00237 115.767 5.6805 10 6 age.days 0.0028501936 0.0005176723
## 7 0.00237 115.767 5.6805 10 7 age.days 0.0021677718 0.0004457464
## 8 0.00237 115.767 5.6805 10 8 age.days 0.0034636535 0.0006116823
## 9 0.00237 115.767 5.6805 10 9 age.days 0.0012147324 0.0012945056
## 10 0.00237 115.767 5.6805 10 10 age.days 0.0039061730 0.0011307423
## # ... with 9,090 more rows, and 2 more variables: statistic <dbl>,
## # p.value <dbl>
Notice that we do indeed have p-values, so we can use these fits to get power for each sample size. We can now do our normal process - in this case grouping by sample size - to get power. And then we can plot the result!
pow <- fits %>%
group_by(n) %>%
summarise(power = 1-sum(p.value>0.05)/n()) %>%
ungroup()
qplot(n, power, data=pow, geom=c("point", "line")) +
theme_bw(base_size=17) +
geom_hline(yintercept=0.8, lty=2)
Let’s take a look at the whole workflow, this time trying a bunch of standard deviation values.
##setup parameters
simSD <- data.frame(slope = 0.00237,
intercept=115.767,
sigma = seq(2:6, lengthout=5),
n=100) %>%
group_by(slope, intercept, sigma, n) %>%
#add sample sizes
expand(reps = 1:n) %>%
#add sims
crossing(sim = 1:100) %>%
#add fake data
mutate(age.days = runif(n(), 1000, 8500)) %>%
mutate(length.cm = rnorm(n(), intercept + slope*age.days, sigma))
##make your fits
fitsSD <- simSD %>%
group_by(slope, intercept, sigma, n, sim) %>%
nest()%>%
#mapping
mutate(mod = map(data, ~lm(length.cm ~ age.days, data=.))) %>%
mutate(coefs = map(mod, ~tidy(.)))%>%
#unnest and filter
unnest(coefs) %>%
ungroup() %>%
filter(term == "age.days")
##calculate and plot power
powSD <- fitsSD %>% group_by(n) %>%
summarise(power = 1-sum(p.value>0.05)/n()) %>%
ungroup()
qplot(n, power, data=powSD, geom=c("point", "line")) +
theme_bw(base_size=17) +
geom_hline(yintercept=0.8, lty=2)