Null Hypothesis Significance Testing & Power

Review

  • Statistical Distributions give us probabilities of observations

  • P-Values allow us to ask \(P(x \le Data | H)\)

  • We often apply p-values to a null hypothesis to assess support for rejecting the null

Today

  1. Null Hypothesis Tests & Other Distibrutions

  2. Test Statistics: Z-Tests

  3. Null Hypothesis Sigificance Testing

  4. Statistical Power in a NHST framework

  5. Power via Simulation

R.A. Fisher and The P-Value For Null Hypotheses

P-value: The Probability of making an observation or more extreme observation given that the null hypothesis is true.

We’re Thought About P-Values with Normal Distributions…

The world ain’t normal

The Lognormal Distribution

The Gamma Distribution

Waiting for more events

Longer average time per event

The Poisson Distribution

  • Discrete distribution for count data
  • Think number of mutations in meiosis, or number of flowers in a plot
  • Defined by \(\lambda\) - the mean and variance
  • \(Y \sim P(lambda)\)

Note how Variance Scales with Mean

When Lambda is Large, Approximately Normal

The Binomial Distribution

Increasing Probability Shifts Distribution

All of these distributions can be used as for Hypothesis Tests

Which distribution when?

Distribution Use
Normal Continuous variable with many small sources of variation
Log-Normal Continuous variable where sources of variation multiply
Gamma Positive continuous variable with long-tailed distribution
Poisson Count data, e.g. # of mutations
Binomial Outcome of multiple yes/no trials

We don’t need no stinkin’ distribution!

  • You can always simulate a reference distribution based on arbitrary rules

  • Then ask what fraction of your observation, or more extreme observations, are in that reference distribution

Tapir Snouts

Simulated Distribution

Testing a “Tapir”

p = 0.01042

Today

  1. Null Hypothesis Tests & Other Distibrutions

  2. Test Statistics: Z-Tests

  3. Null Hypothesis Sigificance Testing

  4. Statistical Power in a NHST framework

  5. Power via Simulation

Evaluation of a Test Statistic

We use our data to calculate a test statistic that maps to a value of the null distribution.

We can then calculate the probability of observing our data, or of observing data even more extreme, given that the null hypothesis is true.



\(P(X \leq Data | H_{0})\)

Evaluation of a Test Statistic

Let’s say we know the distribution of chest hair lengths on Welsh Corgis

A short-haired Corgi

You have a Corgi with a chest hair that is 5 cm

A Corgi’s P

What is the probability of this or a shorter-haired Corgi?

p = 0.0228 

The Z-Test

  • Let’s assume I have a BOX of 15 Corgis. I suspect fraud.

  • I get the average chest hair length of each Corgi - but are they different?

  • So, we’re interested in the difference between my sample mean and the population mean: \(\bar{Y} - \mu\)

    \[\Large H_o: \bar{Y} - \mu = 0\]

When & How to use Z-Test

We want to calculate a test statistic (z) & compare it to the standard normal curve (\(\mu = 0, \sigma=1\))

1. I have a known population mean (\(\mu\)) and standard deviation (\(\sigma\))

2. I can calculate a population SE of any estimate of the mean, \(\frac{\sigma}{\sqrt{n}}\)

3. Now calculate a test statistic

\[\Large z = \frac{\bar{Y} - \mu}{\sigma_{\bar{Y}}}\]

Side-Note: The Z Transformation

  • For any variable, subtract the mean and divide by the SD

  • Variable is now in units of SD from the variables mean

  • Useful for rescaling variables with very different units

Our Corgis and Z

  • Corgis: \(\mu = 7, \sigma=1\), N=15 so \(s_{\bar{Y}}=\) 0.258
  • Sample, \(\bar{Y} = 10\), 10-7 = 3
  • Z = 3/0.258 = 11.619

Our Corgis and Z

  • Z = 3/0.258 = 11.619
  • p = 3.299758910^{-31}

Today

  1. Null Hypothesis Tests & Other Distibrutions

  2. Test Statistics: Z-Tests

  3. Null Hypothesis Sigificance Testing

  4. Statistical Power in a NHST framework

  5. Power via Simulation

How do we Use P-Values

Confirmationist: You gather data and look for evidence in support of your research hypothesis. This could be done in various ways, but one standard approach is via statistical significance testing: the goal is to reject a null hypothesis, and then this rejection will supply evidence in favor of your preferred research hypothesis.

Falsificationist:You use your research hypothesis to make specific (probabilistic) predictions and then gather data and perform analyses with the goal of rejecting your hypothesis.

Neyman-Pearson Hypothesis Testing and Decision Making

image
Jerzy Neyman

Neyman-Pearson Hypothesis Testing

1. Establish a critical threshold below which one rejects a null hypothesis - \(\alpha\). A priori reasoning sets this threshold.

2. Neyman and Pearon state that if p \(\le\) \(\alpha\) then we reject the null.

3. Fisher pushed for a typical \(\alpha\) of 0.05 as indicating statistical significance, although eschewed the idea of a decision procedure where a null is abandoned.

Statistical Significance is NOT Biological Sigficance.


## Today

  1. Null Hypothesis Tests & Other Distibrutions

  2. Test Statistics: Z-Tests

  3. Null Hypothesis Sigificance Testing

  4. Statistical Power in a NHST framework

  5. Power via Simulation

Types of Errors in a NHST framework

Ho is True Ho is False
Reject Ho Type I Error Correct OR Type S error
Fail to Reject Ho** Type II Error


- Possibility of Type I error regulated by choice of \(\alpha\)

  • Probability of Type II error regulated by choice of \(\beta\)

  • Probability of Type S error is called \(\delta\)

Type S (Sign) Error

Correctly rejecting the null hypothesis for the wrong reason

This is a Type S, or Type III error - a mistake of sign. Often inherent in an experiment’s design, or possible by change.

Can determine by mechanistic simulation or a redesigned study.

Power of a Test

  • If \(\beta\) is the probability of comitting a type II error, 1-\(\beta\) is the power of a test.
  • The higher the power, the less of a chance of comitting a type II error.
  • We typically want a power of 0.8 or higher. (20% chance of failing to reject a false null)

\(\alpha = 0.05\) & \(\beta = 0.20\)

5% Chance of Falsely Rejecting the Null, 20% Chance of Falsely Accepting the Null

Are you comfortable with this? Why or why not?

What is Power, Anyway?

Given that we often begin by setting our acceptable \(\alpha\), how do we then determine \(\beta\) given our sample design?

  • Formula for a specific test, using sample size, effect size, etc.

  • Simulate many samples, and see how often we get the wrong answer assuming a given \(\alpha\)!

Today

  1. Null Hypothesis Tests & Other Distibrutions

  2. Test Statistics: Z-Tests

  3. Null Hypothesis Sigificance Testing

  4. Statistical Power in a NHST framework

  5. Power via Simulation

Power via Simulation Is a Process

  1. Determine your null hypothesis and model that describes how the world really works

  2. Draw a simulated sample based on your model of the world and sampling choices (e.g., sample size, additional confounding factors, sample spatial arrangement, etc.)

  3. Calculate the p-value for your one simulated sample
    - Repeat simulated p-value generation thousands of times


4. Determine the fraction of times you failed to reject the null based on your choice of \(\alpha\). This is \(\beta\)
- \(1-\beta\) is your power!

Example of Sample Size and Power via Simulation

We want to test the effects of a drug on heart rate. We know that the average heart rate of our patients is 80 beats per minute and the SD is 6 BPM.

How large must our sample size be, assuming that we have an \(\alpha\) of 0.05 and want our power to be at least 0.8?

First What is Ho? How would I turn this into a Z-test

Simulating the Real World

Second! Let’s say we assume that the drug actually speeds up heart rate by 13 beats per minute, on average. This is our effect size. We can generate simulated means from trials with different sample sizes (1-10) with 500 replicate draws for each sample size

Obtain P-values from your Simulations

Third!Get the p value of each sim with a Z-test:
(Sample Mean - 80) / (6 / \(\sqrt{Sample Size}\))

Turn Many P-Values into Power!

Third! Power is 1 - the fraction of those tests which p \(\le \alpha\). So, we group by each sample sizes to get…

Tradeoff between \(\alpha\) and \(\beta\)

Tradeoff between \(\alpha\) and \(\beta\)

Tradeoff between \(\alpha\) and \(\beta\)