Every probability distribution in R - whether in the base package or in other packages - typically obeys a standard naming conventions. For some distribution, dist
, then ddist
is the function for it’s density, pdist
would give you p-values, qdist
allows you to get the value of an observation that will produce a given p-value, and rdist
uses the distribution to generate random numbers.
A normal distribution in R is defined by it’s mean and standard deviation. To get a point density for a normal distribution, dnorm()
accepts an obsered value, a mean
argument, and a sd
argument. The default mean and SD are for a standard normal distribution.
Let’s say we wanted to get the density from an observation of 5 from a distribution with mean 8 and sd of 3.
dnorm(5, mean = 8, sd = 3)
## [1] 0.08065691
Woo! Number!
So, I’ve been plotting all kinds of distributions over the last few weeks. To do that, all I’ve needed was the density over a wide range of values. So, using this distribution, let’s plot the normal curve from 0 to 16. For this, we’d give dnorm a vector instead of a single observation.
dens_frame <- data.frame(x = seq(0, 16, length.out = 200))
#note - you could do this with mutate! But I like $
dens_frame$density <- dnorm(dens_frame$x, mean = 8, sd = 3)
That produces a lot of numbers - but we can feed them easily ito ggplot2.
library(ggplot2)
norm_dens <- ggplot(dens_frame,
mapping=aes(x=x, y=density)) +
geom_line()
norm_dens
Neat!
We’ve also already seen normal distributions in action to generate random variables.
rnorm(10, mean = 8, sd=3)
## [1] 6.308199 10.575167 9.638887 9.392137 9.415474 7.043890 3.751772
## [8] 4.949667 5.915964 10.836236
What about discrete distributions? For the poisson distribution, we have the dpois()
function for density, which just takes lambda
as its argument to define the distribution. So, assuming a lambda of 10, what’s the probability of an observation of 3?
dpois(3, lambda=10)
## [1] 0.007566655
Note, if you enter a continuous number, you get an error.
dpois(1.5, lambda=10)
## Warning in dpois(1.5, lambda = 10): non-integer x = 1.500000
## [1] 0
Negative numbers just give a probability of 0, which is sensible. They are not possible.
Plotting is a wee bit funnier, as these distributions are discrete.
pois_frame = data.frame(x = 0:20)
#note - you could do this with mutate! But I like $
pois_frame$density <- dpois(pois_frame$x, lambda = 10)
So how to plot? Line plots are… dishonest. We need to show the discrete nature of the distribution. So, we have geom_bar
to come to our rescue. stat="identity"
means just plot what value is supplied. And we can monkey with the bar width to express the discrete nature of the distribution.
ggplot(pois_frame) +
aes(x=x, y=density) +
geom_bar(width=0.1, stat="identity")
OK, let’s try getting the density of one value and plotting the distribution of a few different distributions
# What's the density of 4
# from a standard lognormal?
dlnorm(4, meanlog=0, sdlog = 1)
#plot the density
vals <- seq(0,10, length.out=500)
dens <- dlnorm(vals, meanlog = 0, sdlog = 1)
ggplot(mapping = aes(x=vals, y=dens)) +
geom_line() +
geom_vline(xintercept = 4)
dgamma(___, scale = 2, shape = 5)
#plot the density
vals <- seq(0,20, length.out = ___)
dens <- dgamma(___, scale = 2, shape = 5)
ggplot(mapping = aes(x=vals, y=dens)) +
geom_line() +
geom_vline(xintercept = ___)
dexp(___, rate = ___)
#plot the density
vals <- ___(0,50, length.out=___
dens <- ___(___, rate = ___)
ggplot(mapping = aes(x=___, y=___)) +
geom_line() +
geom_vline(___ = ___)
dbinom(___, size = ___, prob = ___)
#plot the density
vals <- 0:100
dens <- ___(___, ___ = ___, ___=___)
ggplot(____ = ___(x=___, y=___)) +
geom_bar(width=0.6, stat="identity") +
___(___ = ___)
From getting densities it’s a hop, skip, and a jump to getting p-values. With densitites, we were getting point probabilities. With p-values, we invoke the quantile function (which starts with p
). Think about it. Quantiles are non-parametric p-values, if you will, stating the value that X% of a sample is less than or equal to.
Using that as a jumping off point, let us remember that quantiles are ordered from smallest value to largest value. They are inherently 1-tailed. Let’s take an example of surveying coral reefs after a hurricane. We survey 500 reefs one year after a hurricane, and want to know if the hurricane has had an effect. We know that reefs get damaged from time to time by any number of things - not just storms. Our expectation based on past surveys is that there’s a 30% chance of a reef being damaged. So, a 70% chance of it being undamaged. We survey the reefs, and find that 300 are undamaged.
What is the probability of observing 300 or fewer undamaged reefs?
pbinom(300, size = 500, prob = 0.7)
## [1] 1.231434e-06
Excellent. So, good chance that this hurricane had an effect! What if we had wanted to do this with the damaged information instead?
pbinom(200, size = 500, prob = 0.3)
## [1] 0.9999992
Oh that’s weird. A really high p-value. Why?
Well, a quantile function looks at the lower tail of a distribution. If we’re talking about number of damaged reefs, and want to know if the number is higher than usual, then we want to look at the upper tail. This is fairly straightforward with one additional argument. Or you can cleverly use some subtraction for many well behaved distributions.
pbinom(200, size = 500, prob = 0.3, lower.tail=FALSE)
## [1] 7.770953e-07
1 - pbinom(200, size = 500, prob = 0.3)
## [1] 7.770953e-07
What if instead of assuming that a hurricane would cause more damage, we weren’t sure, one way or another. For example, maybe the hurricane stimulated so much growth, that it could have mitigated the propensity of a reef to be damaged!
Simply put, we take the p-value based on the appropriate tail, and double it.
2*pbinom(300, size = 500, prob = 0.7)
## [1] 2.462868e-06
There are all sorts of tricks one can do to minimize coding here in different tests. For example, for a z-test, one can take the absolute value of the z-test, now that it’s positive, and then use lower.tail=FALSE
and multiply the whole shebang by 2. You’ll see this with t-tests next weeek.
For example, let’s assume a sample of 10 flies with an antenna length of 0.4mm. The population of fly antenna lengths has a mean of 0.3mm with a SD of 0.1SD.
#Calculate SE based on the population
sigma_ybar = 0.1/sqrt(5)
#Z-Score
z <- (0.4 - 0.5)/sigma_ybar
#P-Value
2*pnorm(abs(z), lower.tail=FALSE)
## [1] 0.02534732
What if you don’t know your distirbution, or it’s something weird? Let’s go back to the tapir example from class
So, we know that tapirs have a 90% chance, with even probability, of having a nose that is 5-10cm in length. And a 10% chance, with an even distribution of probability, of having noses from 10 to 15cm in length. Greater than that or less than that means that those tapirs have experienced the wrath of Papa Darwin.
First, how do we build a null distribution?
Simply put, start by making a vector of all of the possible values you might want.
nose_length <- c(seq(5,15, length.out=2000))
OK, we know that if nose_length
\(\le\) 10, there’s a 90% chance of seeing one of those values. Otherwise, a 10% chance. So first we find the index of where nose length begins being over 10.
which
is a great function that
which(nose_length>10)[1]
## [1] 1001
1001 - cool, so, right in the middle. That means, the first 1000 have a high chance of being chose, and the second 1000 have a low chance of being chosen.
If you’re in the first 1000, each individual size has a 0.9/1000 chance of being observed. If you’re in the second 1000, each size has a 0.1/1000 chance of being observed. Let’s turn that into a vector
prob_observed = c(rep(0.9/1000, 1000), rep(0.1/1000, 1000))
Now we can use sample
to take replicate samples of a fictional population to create a distribution. Let’s use a population size of 100000 - although you could go larger. We’ll sample from the list of nose lengths, with replacement, but use the probs
argument to specify the chance of being chosen using our prob_observed
vector.
samp <- sample(nose_length, 100000, replace=TRUE,
prob = prob_observed)
OK, you have a null population. How do you get a p-value from that population sample? Let’s say, for example, we have a tapir with a nose length of 14.5. Is it a tapir, or a faux tapir?
Given that p-values are the probability of observing a value or something greater than it, this translates to the fraction of individuals in a population that is equal to or greater than some value. So, what fraction of the observations in your simulated population are equal to or greater than 14.5?
We can assess this a number of ways. Using our which
function above, we can look at the length of the output vector.
Or, in any comparison that produces boolean values, TRUE
= 1 and FALSE
= 0. So we could sum up a comparison.
length(which(samp >= 14.5))/length(samp)
## [1] 0.01003
sum(samp >= 14.5)/length(samp)
## [1] 0.01003
We can of course make this a two-tailed comparison by doubling our simulated p-value.
To evaluate how power works via simulation, let’s walk through an example with a z-test. Let’s assume an average population-wide resting heart rate of 80 beats per minute with a standard deviation of 6 BPM.
A given drug speeds people’s heart rates up on average by 5 BPM. What sample size do we need to achieve a power of 0.8?
To begin with, we need to create a simulated data frame of observations at different sample sizes. We’ve done this before two weeks ago using dplyr, so, making a data frame with, let’s say 500, replicates of each sample size between 1-10 shouldn’t be too onerous. We’ll draw our random samples from a normal distribution with mean of 85 BPM and a SD of 6 BPM.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#first make a data frame with sample sizes
sim_data <- data.frame(samps = rep(1:10, 500)) %>%
#a trick - group by 1:n() means you don't need sim numbers!
group_by(1:n()) %>%
#get a sample mean
mutate(sample = mean(rnorm(samps, 90, 6))) %>%
#clean up
ungroup()
OK, now you have the data….so, let’s do a z-test! We can define a variable for the population standard error of the mean, and then use that to calculate z, and then calculate p.
sim_data <- sim_data %>%
mutate(se_y = 6/sqrt(samps)) %>%
mutate(z = (sample-80)/se_y) %>%
mutate(p = 2*pnorm(abs(z), lower.tail=FALSE))
Cool! We have p values! We can even plot this by sample size
ggplot(data = sim_data, mapping = aes(x=samps, y=p)) +
geom_jitter(alpha=0.4)
Now - power! We know that our type II error rate, beta, is the fraction of p values that are greater than our alpha (here 0.05). Great! Power is then 1 - beta.
Note, I’m going to use n()
within a group here rather than trying to remember how big my number or replicate simulations for each sample size was. Protect yourself! Practice lazy coding!
sim_power <- sim_data %>%
group_by(samps) %>%
summarise(power = 1-sum(p>0.05)/n()) %>%
ungroup()
And we can plot that result with our critical threshold for power.
ggplot(data=sim_power, mapping = aes(x=samps, y=power)) +
geom_point() +
geom_line() +
geom_hline(yintercept=0.8, lty=2)
Wow - pretty good!
OK, let’s take the above example, and now, assume that the drug only increased the BPM by 2, the population has a SD of 7, and our alpha is 0.01. What is our power?
Let’s say we wanted to look at multiple alphas. Well, we’d need another column for alpha! And we want all possible combinations of alpha and the sim_data
. For that, we can use a function in the tidyr
library that does that sort of all possible combinations of values in two objects called crossing()
.
library(tidyr)
sim_tradeoff <- sim_data %>%
#more alphas!
crossing(alpha = 10^(-1:-10)) %>%
#now calculate power for each alpha
group_by(samps, alpha) %>%
summarise(power = 1-sum(p>alpha)/500) %>%
ungroup()
We can then plot this with ggplot
ggplot(sim_tradeoff) +
aes(x=samps, y=power, color=factor(alpha)) +
geom_point() + geom_line() +
xlab("Sample Size") +
#just a little note on greek letters and re-titling
#colorbars
scale_color_discrete(guide = guide_legend(title=expression(alpha)))
Let’s start from scratch. Assume a mean effect of 5 again. But now, make a simulated data frame that not only looks at multiple sample sizes, but also multiple SD values. You’re going to want crossing
with your intitial data frame of sample sizes and a vector of sd values from 3 to 10 (just iterate by 1).
OK, now that you’ve done that, calculate the results from z-tests. Plot p by sample size, using facet_wrap
for different SD values.
Now plot power, but use color for different SD values. Include our threshold power of 0.8.
Last, use crossing
again to explore changing alphas from 0.01 to 0.1. Plot power curves with different alphas as different colors, and use faceting to look at different SDs.
What do you learn about how alpha and SD affect power?
How do you think that changing the effect size would affect power? You can just answer this without coding out anything. Based on what we’ve learned so far - what do you think?