Using Test Statistics

1. Put it to the Test

Today we’re going to bein our exploration of how R handles performing statistical tests. Within R, there are many functions that enable one to do statistical testing. These functions typically force a user to be explicit about what they want R to do. In today’s lab, we’re going to first examine the simplicity of the \(\chi^2\) test and begin to build a workflow for data analysis. We’ll then dig into the t-test and more formally develop a data analysis workflow.

2. \(\chi^2\)

The \(\chi^2\) test is a fairly straightforward test. It and its variations are ubiquitous in evaluating expectations. We also see it in likelihood ratio tests and evaluations of model fit across a variety of other techniques, all verifying whether or not our models adequately the data at hand.

2.1 Day of Birth

In class we used a dataset looking at day of birth. Let’s load that up and visualize it.

#load some libriaries
library(readr)
library(ggplot2)

births <- read_csv("data/05/08e1DayOfBirth.csv")

## Parsed with column specification:
## cols(
##   Day = col_character(),
##   `Number of births` = col_integer()
## )

births

## # A tibble: 7 x 2
##   Day       `Number of births`
##   <chr>                  <int>
## 1 Sunday                    33
## 2 Monday                    41
## 3 Tuesday                   63
## 4 Wednesday                 63
## 5 Thursday                  47
## 6 Friday                    56
## 7 Saturday                  47

Next, what’s our expected value? If we assume each day should have an equal number of births, then…

expected <- sum(births$`Number of births`)/nrow(births)

expected

## [1] 50

Just seeing these numbers might be enough. We can also visulize things using

ggplot(data=births, mapping = aes(x=Day, y=`Number of births`)) +
  geom_point() +
  geom_hline(yintercept = expected)

But what abou the test. We don’t have to worry about sample size. Our expectation is nothing fancy - an even distribution. So, we can use the chisq.test() function.

chisq.test(births$`Number of births`)

## 
##  Chi-squared test for given probabilities
## 
## data:  births$`Number of births`
## X-squared = 15.24, df = 6, p-value = 0.01847

The output is fairly straightforward - we get our test statistic, DF, and p-value.

2.2 Mass Extinctions and Non-Even Probabilities

We don’t have to use an even probability distribution. For example, we might have an idea of how values should be distributed between cells. For example, let’s look at the number of families going extinct in major extinction events over geological time.

extinct <- read_csv("data/05/08e6MassExtinctions.csv")

## Parsed with column specification:
## cols(
##   `Number of extinctions` = col_integer(),
##   Frequency = col_integer()
## )

ggplot(extinct, mapping=aes(x=`Number of extinctions`, y=Frequency)) +
  geom_line()

So, few mass extinctions, but many where a few families went extinct. And 0 where 0 went extinct (hey, it wouldn’t be an extinction event, otherwise).

One distribution that matches this kind of shape is negative binomial, describing the number of successes (families going extinct) until the first failure (extinction event ending). In our data, we 3.6 extinctions on average, and for fun, let’s assume an overdispersion parameter of 3.

From this, we can generate a vector of expectations by first getting a vector of porportion of data that should be in each cell from the probability density distribution, and then multiplying it by the total number of observations.

#get raw frequencies
freqs <- dnbinom(0:20, mu = 3.6, size = 3)

#make sure they sum to one
freqs <- freqs/sum(freqs)

extinct$expected <- freqs*sum(extinct$Frequency)

We can even plot the match/mismatch

ggplot(data=extinct, mapping=aes(x=expected, y=Frequency)) +
  geom_point()

Eh, not bad.

And then put it to the test, using the p argument to incorporate our new expected frequencies.

chisq.test(x = extinct$Frequency,
           p = freqs)

## Warning in chisq.test(x = extinct$Frequency, p = freqs): Chi-squared
## approximation may be incorrect

## 
##  Chi-squared test for given probabilities
## 
## data:  extinct$Frequency
## X-squared = 183.32, df = 20, p-value < 2.2e-16

Note the warning message. This indicates that there might be something fishy here - and, indeed, we know from looking at the data that many cells have <5 observations in them, indicating that we might want to rethink if this is the right approach.

2.3 Faded Examples

So, to review, here’s the births example, where we’ll load, visualize, and test.

#Load
births <- read_csv("data/05/08e1DayOfBirth.csv")

#calculate expected values
expected <- sum(births$`Number of births`)/nrow(births)

#visualize
ggplot(data=births, mapping = aes(x=Day, y=`Number of births`)) +
  geom_point() +
  geom_hline(yintercept = expected)

#test
chisq.test(births$`Number of births`)

Let’s try this for two other data sets.

First, days on which people buy Powerball tickets.

#load
powerball <- ____("data/05/08q02Powerball.csv")

#calculate expected values
expected_powerball <- sum(powerball$`Millions of tickets sold`)/nrow(powerball)

#visualize
ggplot(data=___, mapping = aes(x=Day, y=`Millions of tickets sold`)) +
  geom_point() +
  geom_hline(yintercept = ___)

##test
chisq.test(____$`Millions of tickets sold`)

Last, number of boys in 2-child families. We’d expect a distribution of 25:50:25 for 0,1, or 2 boys. So, an uneven frequency.

#load
boys <- ____("data/05/08e5NumberOfBoys.csv")

#calculate expected values
freq_boys <- c(0.25, 0.5, 0.25)
boys$expected_boys <- freq_boys * sum(boys$Frequency)

#visualize
ggplot(data=___, mapping = ____(x=expected_boys, y=_____)) +
  geom_point()

##test
chisq.test(boys$_____,
           p = _______)

2.4 Contingency Tables

Briefly, let’s consider an extension of the \(\chi^2\) test, the contingency table. The first difference between contingency table data and what we’ve already considered is that there are multiple columns with categories instead of just one.

parasite <- read_csv("./data/09e3ParasiteBrainWarp.csv")

parasite

## # A tibble: 6 x 3
##   `infection status` eaten frequency
##   <chr>              <chr>     <int>
## 1 uninfected         yes           1
## 2 uninfected         no           49
## 3 light              yes          10
## 4 light              no           35
## 5 high               yes          37
## 6 high               no            9

How can we visualize this? Well, we have two options. The first is as a stacked bar plot. This lets us see if, whatever we put on the x axis, is roughly evenly distributed. Let’s look at it both ways using patchwork to put it into one figure.

#devtools::install_github("thomasp85/patchwork") #if you don't have it.

library(patchwork)

a <- ggplot(parasite, 
       mapping = aes(fill=`infection status`, y = frequency, 
                     x=eaten)) +
  geom_col() 

b <- ggplot(parasite, 
       mapping = aes(x=`infection status`, y = frequency, 
                     fill=eaten)) +
  geom_col() 

a + b

Often times, we want to view the contingency table as is. For that, we can use geom_raster to make colored blocks - and we can even include the numbers as a form of redundant coding.

ggplot(parasite, 
       mapping = aes(x=`infection status`, y = eaten, 
                     fill=frequency, label = frequency)) +
  geom_raster() +
  scale_fill_gradient(low = "white", high = "orange") +
  geom_text()

Our data, on the other hand, doesn’t look like the contingency table presented here. It needs a bit of reshaping for chisq.test to consider it as one. For that, we use the xtabs function - which, incidentally, will generalize into an array for an n x n contingency table.

cont_tab <- xtabs(frequency ~ eaten + `infection status`, data = parasite)

cont_tab

##      infection status
## eaten high light uninfected
##   no     9    35         49
##   yes   37    10          1

Perfect - now we can just plug it into chisq.test and get our answer.

chisq.test(cont_tab)

## 
##  Pearson's Chi-squared test
## 
## data:  cont_tab
## X-squared = 69.756, df = 2, p-value = 7.124e-16

3. T-Tests

T-tests are among the most frequently used tests in data analysis. They’re delightfully simple, and provide a robust example of how to examine the entire workflow of a data analysis. These are steps you’ll take with any analysis you do in the future, no matter how complex the model!

3.1 One Sample T-Test

For a one sample t-test, we’re merely testing whether or not a vector of observations are different from zero, our null hypothesis. This can be a sample of observed values, it can be a sample of differences between paired treatments, anything!

Let’s look at the W&S data on blackbird immunoresponse before and after testosterone implants. So, first, load the data and visualize the change in immunoresponse.

blackbird <- read_csv("data/05/12e2BlackbirdTestosterone.csv")

ggplot(data = blackbird, mapping=aes(x=dif)) + 
  geom_density() +
  geom_vline(xintercept=0, lty=2)

So, right away we can see the problem with the distribution of the data.

Let’s proceed and ignore it for the moment.

The t.test() function gives us a lot of options.

t.test(x, y = NULL,
       alternative = c("two.sided", "less", "greater"),
       mu = 0, paired = FALSE, var.equal = FALSE,
       conf.level = 0.95, ...)

We can feed it different alternatives (it defaults to two tails), specify a hypothesis we’re testing against (it defaults to a null hypothesis with mu=0), and we can give it either just a sample of observations, or also a vector with groups. We can also tell it whether we’re worried about equal variances, if we’re giving it paired data, and more.

In this case, for a one-sample t-test, we just want to feed it our differences.

t.test(blackbird$dif)

## 
##  One Sample t-test
## 
## data:  blackbird$dif
## t = -1.5079, df = 12, p-value = 0.1575
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -15.98626   2.90934
## sample estimates:
## mean of x 
## -6.538462

That’s a fairly wonky output table, but we see all of the critical information - the value of t, our DF, our p-value, and both a mean estimate and confidence interval. Here we see a p-value of 0.157455 that suggests we fail to reject the null.

If you find this output ugly, there’s a wonderful package called broom that produces a standardized set of model output tidiers in a data frame format.

library(broom)
tidy(t.test(blackbird$dif))

## # A tibble: 1 x 8
##   estimate statistic p.value parameter conf.low conf.high method
##      <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl> <chr> 
## 1    -6.54     -1.51   0.157        12    -16.0      2.91 One S…
## # ... with 1 more variable: alternative <chr>

3.2 Two Sample T-Test

For a two sample t-test, we can feed in a vector of groups, or, t.test() is the first function that will take a formula for statistical tests. Let’s look at a data set examining the survival of juvenile chinook salmon (yum) in rivers with versus without brook trout.

First, we’ll load and plot the data.

chinook <- read_csv("data/05/12e4BrookTrout.csv")

ggplot(data=chinook, mapping=aes(x=`brook trout?`, y=`Mean chinook survival`)) +
  geom_boxplot()

We have some unequal variances in the data here, which is of note. We also have two groups.

As with xtabs above, R has a formula syntax that we can use to specify a relationship where we have a predictor and response. Broadly, it looks like:

\[response \sim predictor\]

We can use this syntax with our t.test function, as it also accepts formulae. So, here’s our two-sample unpaired t-test with unequal variances:

chin_test <- t.test(`Mean chinook survival` ~ `brook trout?`, data = chinook,
       unequal.var = TRUE)

chin_test

## 
##  Welch Two Sample t-test
## 
## data:  Mean chinook survival by brook trout?
## t = 1.8718, df = 5.7256, p-value = 0.1127
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.02298513  0.16548513
## sample estimates:
## mean in group - mean in group + 
##       0.2340833       0.1628333

Great, we have our means per group, the difference between them, and we see we’re using a Welch’s t-test for unequal variance. Here we’d again fail to reject the null hypothesis.

If you want prettier output:

tidy(chin_test)

## # A tibble: 1 x 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>
## 1   0.0713     0.234     0.163      1.87   0.113      5.73  -0.0230
## # ... with 3 more variables: conf.high <dbl>, method <chr>,
## #   alternative <chr>

OK, not that much prettier, but you can put it in a table.

3.3 Evaluating Residuals

So, we’ve ignored the assumption of normality up until now. Broadly, in any statistical model, error comes in as a residual value. So, often our data may not be normally distribted, but after accounting for predictors, we find that the residuals are. To test whether residuals are normal, we need to, well, create residuals!

For a t-test this is easy, as residuals are just means - either of a single column, or from groups. We can thus use dplyr here. So for our one-sample t-test of blackbirds

library(dplyr)

blackbird <- blackbird %>%
  mutate(resid = dif - mean(dif))

For our two-sample t-test, we use group_by for group means.

chinook <- chinook %>%
  group_by( `brook trout?`) %>%
  mutate(mean_survival_resid = 
           `Mean chinook survival` - mean(`Mean chinook survival`)) %>%
  ungroup()

We can then evaluate for normality. Let’s use the blackbird example. First, we’d look at the distribution of residuals.

ggplot(data = blackbird, mapping=aes(x=resid)) +
  geom_density()

Again, not looking good. But, who knows, this is a density estimated off of not many data points. Maybe we need something more accurate - like a qqplot. For a qqplot, we invoke two functions in base plot (if we want the fit line - ggplot2 still doesn’t do this, but give it time).

The functions are qqnorm and qqline. We use them sequentially.

qqnorm(blackbird$resid)
qqline(blackbird$resid)

Now we can see that systematic behavior in the lower tail.

We may still want to put it to the test as it were, with a Shapiro Wilk’s test. R provides a shapiro.test() functio for this.

shapiro.test(blackbird$resid)

## 
##  Shapiro-Wilk normality test
## 
## data:  blackbird$resid
## W = 0.93148, p-value = 0.3563

OK, so, what does that p-value mean? In this case, it means we would fail to reject the null hypothesis that this data comes from a normal distribution. So, we should actually be OK going forward! This is one case where we don’t want to have a p value smaller than our alpha.

Exercise - Repeat this normality analysis for the chinook salmon!

3.3 Plotting results

For a one-sample t-test, plotting a result - a mean and SE - might not be necessary. But for a two-sample test, it’s highly informative! It should be the final step in any analysis in order to aid interpretation. Here, ggplot2’s stat_summary function is invaluable, as it defaults to plotting mean and standard errors.

salmon_means <- ggplot(data=chinook, 
                       mapping=aes(x=`brook trout?`, 
                                   y=`Mean chinook survival`)) +
  stat_summary(size=1.5)

salmon_means

Nice. If you want to see this relative to the data, you can still include it.

salmon_means+
  geom_jitter(color="red")

3.4 Workflow and Faded Examples

As we’ve talked about, our general workflow for an analysis is

Build a Test
Evaluate Assumptions of Test
Evaluate Results
Visualize Results

If we’ve decided on a t-test, we’ve satisfied #1. So let’s go through a few examples where we load up our data, evaluate assumptions, evaluate the results of our test, and visualize the results.

We’ll start with the salmon example, all in one place.

#Load and visualize data
chinook <- read_csv("data/05/12e4BrookTrout.csv")

## Parsed with column specification:
## cols(
##   `brook trout?` = col_character(),
##   `chinook survival 1998` = col_double(),
##   `chinook survival 1993` = col_double(),
##   `chinook survival 1999` = col_character(),
##   `Mean chinook survival` = col_double(),
##   `chinook survival 1992` = col_double(),
##   arcsin = col_double(),
##   `Number survivors` = col_integer(),
##   `Number released` = col_integer()
## )

ggplot(data=chinook, mapping=aes(x=`brook trout?`, y=`Mean chinook survival`)) +
  geom_boxplot()

## test assumptions
chinook <- chinook %>%
  group_by( `brook trout?`) %>%
  mutate(resid = 
           `Mean chinook survival` - mean(`Mean chinook survival`)) %>%
  ungroup()

#qq
qqnorm(chinook$resid)
qqline(chinook$resid)

shapiro.test(chinook$resid)

## 
##  Shapiro-Wilk normality test
## 
## data:  chinook$resid
## W = 0.97827, p-value = 0.9758

#put it to the test!
t.test(`Mean chinook survival` ~ `brook trout?`, data = chinook,
       unequal.var = TRUE)

## 
##  Welch Two Sample t-test
## 
## data:  Mean chinook survival by brook trout?
## t = 1.8718, df = 5.7256, p-value = 0.1127
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.02298513  0.16548513
## sample estimates:
## mean in group - mean in group + 
##       0.2340833       0.1628333

ggplot(data=chinook, 
                       mapping=aes(x=`brook trout?`, 
                                   y=`Mean chinook survival`)) +
  stat_summary(size=1.5)

## No summary function supplied, defaulting to `mean_se()

OK, now that we have this, let’s apply the same strategy to Cichlid habitat preferences that vary by genotypes.

#Load and visualize data
cichlid <- read_csv("data/05/12q09Cichlids.csv")

ggplot(data=cichlid, mapping=aes(x=Genotype, y=preference)) +
  ____() 

## test assumptions
cichlid <- cichlid %>%
  group_by(____) %>%
  mutate(resid = 
          preference - mean(preference)) %>%
  ungroup()

#qq
qqnorm(cichlid$____)
qqline(cichlid$____)

shapiro.test(cichlid$____)

#put it to the test!
t.test(____ ~ ____, data = ____,
       unequal.var = TRUE)
 
ggplot(data=cichlid, mapping=aes(x=____,y=____)) +
  stat_summary(size=1.5)

And now for how monogamy influences testes size

#Load and visualize data
monogomy <- ____("data/05/12q05MonogamousTestes.csv")

ggplot(data=____, mapping=aes(x=`Column 1` , y=`Testes area`)) +
  ____() 

## test assumptions
monogomy <- monogomy %>%
  group_by(____) %>%
  ____(resid = 
          `Testes area` - ____(____)) %>%
  ungroup()

#qq
____(____)
____(____)

shapiro.test(____)

#put it to the test!
t.test(____ ~ ____ , data = monogomy,
       unequal.var = ____)
 
ggplot(data=____, mapping=aes(x=____ ,y=____)) +
  ____(size=1.5)

4. T and Power

Trying to think about power analysis by simulation can be a daunting prospect. However, for any type of modeling endeavour, you can think about the various steps you’ll need to execute in order to come up with a straightforward workflow. You know that for any set of parameters you are monkeying with, you’ll need to:

Simulate a data set with those parameters.
Fit a model to that simulated dataset and extract a p-value.
Execute steps 1-2 some number of times.
From those p-values, calculate power (1 - proportion of wrong p-values at a prespecified \(\alpha\)).

For each of the above steps, we can write a nice simple function! And then use some dplyr magic to put the whole effort together. Let’s walk through this, piece by piece.

4.1 Simulating Data with a Function

The basic underlying model of a t-test is \(y_{ij} ~ N(\mu_i, \sigma)\) where i us the group identity. We know there are only two means. So, we can whip up a simple function that returns a data frame with two groups, and data for each at some pre-specified sample size.

make_t_data <- function(m1, m2, s, n){
  #make a data frame, repeating treatments n number of times
  #and use rnorm to get values
  data.frame(treatment = c(rep("A", n), rep("B", n)),
             value = rnorm(n*2, mean = c(rep(m1,n), rep(m2, n)), s))
}

To test if it works, you can record values for all of the parameters and see if the guts of the function work. We can also test it.

make_t_data(m1 = 1, m2 = 2, s = 1, n = 3)

##   treatment      value
## 1         A 0.09843274
## 2         A 1.13514442
## 3         A 0.36102575
## 4         B 2.54367201
## 5         B 2.07403517
## 6         B 1.20845344

4.2 Fit a model and extract a p-value

OK, our next function needs to take a data set, and instead of all of the t-output, return the p-value of a t-test. If we look at the t.test helpfile, we can see that the output of any t.test is a list with an entry called p.value. So, we can write a simple extraction function, given the dataset from above.

get_p_from_t_test <- function(sim_data){
  #run the t test on the data
  test <- t.test(value ~ treatment, data = sim_data)
  test$p.value
}

Let’s test that out.

get_p_from_t_test(make_t_data(m1 = 1, m2 = 2, s = 1, n = 3))

## [1] 0.4498394

4.3 Repeat some number of times

This is great! Now, we know we want some huge vector of p-values. Let’s say we want 100 simulated p-values. For that, we’ve already introduced the replicate() function. Indeed, we can take the above statement, and show how it works with replicate.

replicate(10,
          get_p_from_t_test(make_t_data(m1 = 1, m2 = 2, s = 1, n = 3)))

##  [1] 0.11928459 0.26585015 0.74215331 0.97767122 0.37594727 0.64794243
##  [7] 0.31948863 0.11328838 0.22658170 0.05680045

The nice thing about this approach is that, rather than generating a lot of data sets, and then testing each one indivudally - which would eat a lot of memory with a large number of simulations - we do things one at a time, so we’re only limited by processor power (and could later multithread if we wanted).

We can also eyeball the above statement and see what the power is visually. What fraction are wrong? For a tiny sample size, and a SD that is the same as the difference between the means, we should see that our test has low power (i.e., most p values will be larger than some pre-specified low value of alpha assuming we should be rejecting our null). If that wasn’t the case, you’d want to go back and see what went wrong.

Now, the above statement is, indeed, something we could wrap into a function quite nicely. A function for calculating power, which brings us to….

4.4 Calculating Power from a vector of p-values.

OK, we have the parameters we need for simulations. We also need some number of simulations - let’s call that nsims - and some pre-specified alpha in order to calculate power 1-(# of p > alpha)/nsims. Let’s code out a brief function just in comments, giving a default value for nsims and alpha.

get_t_power <- function(m1, m2, s, n, nsims = 100, alpha = 0.07){
  #get a vector of p values
  
  #calculate the number of p values that are incorrect given
  #that we should be rejecting the null
  
  #return power
}

Put that way, we can fill the above function in with things we have already discussed. Remember, for booleans TRUE = 1 and FALSE = 0

get_t_power <- function(m1, m2, s, n, nsims = 100, alpha = 0.07){
  #get a vector of p values
  p <- replicate(nsims,
          get_p_from_t_test(make_t_data(m1, m2, s, n)))
  
  #calculate the number of p values that are incorrect given
  #that we should be rejecting the null
  num_wrong <- sum(p > alpha)
  
  #return power
  1 - num_wrong/nsims
}

We can again test and make sure this produces sensible results.

get_t_power(m1 = 1, m2 = 2, s = 1, n = 3, nsims = 100, alpha=0.07)

## [1] 0.13

Yeah, that’s not great power - nor should it be! We can test again, just to make sure, by putting in something that should have high power.

get_t_power(m1 = 1, m2 = 5, s = 1, n = 3, nsims = 100, alpha=0.07)

## [1] 0.94

Much better. Now… let’s put the whole shebang together.

Power analysis!

Let’s make a data frame where we look at a few effect sizes (difference between means), standard deviations, and sample sizes. We’ll use the tidyr function crossing() which gets all possible combinations of parameter values, and then dplyr::rowwise() to iterate through our set of parameters, one by one. Note, this might take a little while.

#power!
library(tidyr)
library(dplyr)

pow_df <- crossing(diff = 1:5, s = 1:5, n = 5:10) %>%
  rowwise() %>%
  mutate(power = get_t_power(m1 = 0, m2 = diff, s = s, n = n, nsims=100, alpha=0.05)) %>%
  ungroup()

Now we can visualize this using ggplot2. For funsies, let’s use the beyonce set of color palattes. YAS KWEEN!

library(ggplot2)

#devtools::install_github("dill/beyonce")
library(beyonce)


ggplot(pow_df, aes(x=n, y = power, color = factor(diff))) +
  geom_point() +
  geom_line() +
  facet_wrap(~s) +
  scale_color_manual(values = beyonce_palette(79))

NICE!

EXERCISE: Try analyses that vary alpha, n, and the effect size. Or get crazy and vary those three and the SD.