Linear Regression and Frequentist Hypothesis Testing

https://xkcd.com/882/

1 / 51

Etherpad

https://etherpad.wikimedia.org/p/607-lm-eval-2020

2 / 51

Putting Linear Regression Into Practice with Pufferfish

Pufferfish are toxic/harmful to predators
Batesian mimics gain protection from predation - why?
Evolved response to appearance?
Researchers tested with mimics varying in toxic pufferfish resemblance

3 / 51

Question of the day: Does Resembling a Pufferfish Reduce Predator Visits?

4 / 51

Testing Our Models

How do we Know
Evaluating a Null Hypothesis.
Null Hypothesis Significance Testing: Friend of Foe of Science?
Testing Linear Models

5 / 51

So.... how do you draw conclusions from an experiment or observation?6 / 51

Inductive v. Deductive Reasoning

Deductive Inference: A larger theory is used to devise many small tests.

Inductive Inference: Small pieces of evidence are used to shape a larger theory and degree of belief.

7 / 51

Applying Different Styles of Inference

Null Hypothesis Testing: What's the probability that things are not influencing our data?
- Deductive
Cross-Validation: How good are you at predicting new data?
- Deductive
Model Comparison: Comparison of alternate hypotheses
- Deductive or Inductive
Probabilistic Inference: What's our degree of belief in a data?
- Inductive

8 / 51

Null Hypothesis Testing is a Form of Deductive Inference

Falsification of hypotheses is key!

A theory should be considered scientific if, and only if, it is falsifiable.

9 / 51

Null Hypothesis Testing is a Form of Deductive Inference

Falsification of hypotheses is key!

A theory should be considered scientific if, and only if, it is falsifiable.

Look at a whole research program and falsify auxilliary hypotheses

9 / 51

A Bigger View of Dedictive Inference

https://plato.stanford.edu/entries/lakatos/#ImprPoppScie

10 / 51

Testing Our Models

How do we Know
Evaluating a Null Hypothesis.
Null Hypothesis Significance Testing: Friend of Foe of Science?
Testing Linear Models

11 / 51

Reifying Refutation - What is the probability something is false?

What if our hypothesis was that the resemblance-predator relationship was 2:1. We know our SE of our estimate is 0.57, so, we have a distribution of what we could observe.

12 / 51

Reifying Refutation - What is the probability something is false?

BUT - our estimated slope is 3.

13 / 51

To falsify the 2:1 hypothesis, we need to know the probability of observing 3, or something GREATER than 3.

We want to know if we did this experiment again and again, what's the probability of observing what we saw or worse (frequentist!)

14 / 51

To falsify the 2:1 hypothesis, we need to know the probability of observing 3, or something GREATER than 3.

We want to know if we did this experiment again and again, what's the probability of observing what we saw or worse (frequentist!)

Probability = 0.04

14 / 51

To falsify the 2:1 hypothesis, we need to know the probability of observing 3, or something GREATER than 3.

We want to know if we did this experiment again and again, what's the probability of observing what we saw or worse (frequentist!)

Probability = 0.04

14 / 51

Null hypothesis testing is asking what is the probability of our observation or more extreme observation given that some null expectation is true.(it is NOT the probability of any particular alternate hypothesis being true)15 / 51

R.A. Fisher and The P-Value For Null Hypotheses

P-value: The Probability of making an observation or more extreme observation given that the null hypothesis is true.

16 / 51

Applying Fisher: Evaluation of a Test Statistic

We use our data to calculate a test statistic that maps to a value of the null distribution.

We can then calculate the probability of observing our data, or of observing data even more extreme, given that the null hypothesis is true.

$\large P(X \leq Data | H_{0})$

17 / 51

Problems with PMost people don't understand it.See American Statistical Society' recent statements

18 / 51

Problems with PMost people don't understand it.See American Statistical Society' recent statements

We don't know how to talk about it
18 / 51

Problems with PMost people don't understand it.See American Statistical Society' recent statements

We don't know how to talk about it
Interpretation of P-values as confirmation of an alternative hypothesis
18 / 51

Problems with P

Most people don't understand it.
- See American Statistical Society' recent statements

We don't know how to talk about it
Interpretation of P-values as confirmation of an alternative hypothesis
Like SE, it gets smaller with sample size!

18 / 51

Problems with P

Most people don't understand it.
- See American Statistical Society' recent statements

We don't know how to talk about it
Interpretation of P-values as confirmation of an alternative hypothesis
Like SE, it gets smaller with sample size!
Misususe of setting a threshold for rejecting a hypothesis

18 / 51

Testing Our Models

How do we Know
Evaluating a Null Hypothesis.
Null Hypothesis Significance Testing: Friend of Foe of Science?
Testing Linear Models

19 / 51

Neyman-Pearson Hypothesis Testing and Decision Making: What if you have to make a choice?

Jerzy Neyman

Egon Pearson

20 / 51

Neyman-Pearson Null Hypothesis Significance Testing

For Industrial Quality Control, NHST was introduced to establish cutoffs of reasonable p, called an $\alpha$
This corresponds to Confidence intervals - 1-$\alpha$ = CI of interest
This has become weaponized so that $\alpha = 0.05$ has become a norm.... and often determines if something is worthy of being published?
- Chilling effect on science

21 / 51

NHST in a nutshellEstablish a critical threshold below which one rejects a null hypothesis - α. A priori reasoning sets this threshold.
22 / 51

NHST in a nutshell

Establish a critical threshold below which one rejects a null hypothesis - $\alpha$ . A priori reasoning sets this threshold.
Neyman and Pearon state that if p $\le$ $\alpha$ then we reject the null.
- Think about this in a quality control setting - it's GREAT!

22 / 51

NHST in a nutshell

Establish a critical threshold below which one rejects a null hypothesis - $\alpha$ . A priori reasoning sets this threshold.
Neyman and Pearon state that if p $\le$ $\alpha$ then we reject the null.
- Think about this in a quality control setting - it's GREAT!
Fisher suggested a typical $\alpha$ of 0.05 as indicating statistical significance, although eschewed the idea of a decision procedure where a null is abandoned.
- Codified by the FDA for testing!

22 / 51

NHST in a nutshell

Establish a critical threshold below which one rejects a null hypothesis - $\alpha$ . A priori reasoning sets this threshold.
Neyman and Pearon state that if p $\le$ $\alpha$ then we reject the null.
- Think about this in a quality control setting - it's GREAT!
Fisher suggested a typical $\alpha$ of 0.05 as indicating statistical significance, although eschewed the idea of a decision procedure where a null is abandoned.
- Codified by the FDA for testing!

But... Statistical Significance is NOT Biological Signficance.

22 / 51

Types of Errors in a NHST framework

	Ho is True	Ho is False
Reject Ho	Type I Error	Correct or Type S Error
Fail to Reject Ho	-	Type II Error

Possibility of Type I error regulated by choice of $\alpha$
Probability of Type II error regulated by choice of $\beta$
Probability of Type S error is called $\delta$

23 / 51

Type I & II Error

24 / 51

Power of a Test

If $\beta$ is the probability of committing a type II error, 1 - $\beta$ is the power of a test.
The higher the power, the less of a chance of committing a type II error.
We often want a power of 0.8 or higher. (20% chance of failing to reject a false null)

$\alpha = 0.05$ & $\beta = 0.20$

5% Chance of Falsely Rejecting the Null, 20% Chance of Falsely Accepting the Null

25 / 51

Power of a Test

If $\beta$ is the probability of committing a type II error, 1 - $\beta$ is the power of a test.
The higher the power, the less of a chance of committing a type II error.
We often want a power of 0.8 or higher. (20% chance of failing to reject a false null)

$\alpha = 0.05$ & $\beta = 0.20$

5% Chance of Falsely Rejecting the Null, 20% Chance of Falsely Accepting the Null

Are you comfortable with this? Why or why not?

25 / 51

What is Power, Anyway?

Given that we often begin by setting our acceptable $\alpha$ , how do we then determine $\beta$ given our sample design?

Formula for a specific test, using sample size, effect size, etc.
Simulate many samples, and see how often we get the wrong answer assuming a given $\alpha$ !

26 / 51

How do you talk about results from a p-value?

Based on your experimental design, what is a reasonable range of p-values to expect if the null is false
Smaller p values indicate stronger support for rejection, larger ones weaker. Use that language! Not significance!
Accumulate multiple lines of evidence so that the entire edifice of your research does not rest on a single p-value!!!!

27 / 51

For example, what does p = 0.061 mean?

There is a 6.1% chance of obtaining the observed data or more extreme data given that the null hypothesis is true.
If you choose to reject the null, you have a ~ 1 in 16 chance of being wrong
Are you comfortable with that?
OR - What other evidence would you need to make you more or less comfortable?

28 / 51

How I talk about p-valuesAt different p-values, based on your study design, you will have different levels of confidence about rejecting your null. For example, based on the design of one study...
29 / 51

How I talk about p-values

At different p-values, based on your study design, you will have different levels of confidence about rejecting your null. For example, based on the design of one study...
A p value of less than 0.0001 means I have very high confidence in rejecting the null

29 / 51

How I talk about p-values

At different p-values, based on your study design, you will have different levels of confidence about rejecting your null. For example, based on the design of one study...
A p value of less than 0.0001 means I have very high confidence in rejecting the null
A p-value of 0.01 means I have high confidence in rejecting the null

29 / 51

How I talk about p-values

At different p-values, based on your study design, you will have different levels of confidence about rejecting your null. For example, based on the design of one study...
A p value of less than 0.0001 means I have very high confidence in rejecting the null
A p-value of 0.01 means I have high confidence in rejecting the null
A p-value between 0.05 and 0.1 means I have some confidence in rejecting the null

29 / 51

How I talk about p-values

At different p-values, based on your study design, you will have different levels of confidence about rejecting your null. For example, based on the design of one study...
A p value of less than 0.0001 means I have very high confidence in rejecting the null
A p-value of 0.01 means I have high confidence in rejecting the null
A p-value between 0.05 and 0.1 means I have some confidence in rejecting the null
A p-value of > 0.1 means I have low confidence in rejecting the null

29 / 51

My Guiding Light

30 / 51

Why we need to be careful (because power!)

In the search for the Higgs Boson, scientists studied billions of particles
They used a "five-sigma" threshold (e.g., an observation beyond 5 SD)
This corresponds to an $\alpha$ of 0.0000003

31 / 51

Problems with NHST When Used by Humans

32 / 51

Take home: p-values are great for some kinds of inference. But, they only provide one answer at a time. Use carefully, and never depend on one p-value alone!33 / 51

Testing Our Models

How do we Know
Evaluating a Null Hypothesis.
Null Hypothesis Significance Testing: Friend of Foe of Science?
Testing Linear Models

34 / 51

Common Regression Test Statistics

Does my model explain variability in the data?
- Null Hypothesis: The ratio of variability from your predictors versus noise is 1
- Test Statistic: F distribution (describes ratio of two variances)
Are my coefficients not 0?
- Null Hypothesis: Coefficients are 0
- Test Statistic: T distribution (normal distribution modified for low sample size)

35 / 51

Does my model explain variability in the data?

Ho = The model predicts no variation in the data.

Ha = The model predicts variation in the data.

36 / 51

Does my model explain variability in the data?

Ho = The model predicts no variation in the data.

Ha = The model predicts variation in the data.

To evaluate these hypotheses, we need to have a measure of variation explained by data versus error - the sums of squares!

36 / 51

Does my model explain variability in the data?

Ho = The model predicts no variation in the data.

Ha = The model predicts variation in the data.

To evaluate these hypotheses, we need to have a measure of variation explained by data versus error - the sums of squares! $SS_{Total} = SS_{Regression} + SS_{Error}$

36 / 51

Sums of Squares of Error, Visually

37 / 51

Sums of Squares of Regression, Visually

38 / 51

Sums of Squares of Regression, Visually

Distance from $\hat{y}$ to $\bar{y}$

39 / 51

Components of the Total Sums of Squares

$SS_{R} = \sum(\hat{Y_{i}} - \bar{Y})^{2}$ , df=1

$SS_{E} = \sum(Y_{i} - \hat{Y}_{i})^2$ , df=n-2

40 / 51

Components of the Total Sums of Squares

$SS_{R} = \sum(\hat{Y_{i}} - \bar{Y})^{2}$ , df=1

$SS_{E} = \sum(Y_{i} - \hat{Y}_{i})^2$ , df=n-2

To compare them, we need to correct for different DF. This is the Mean Square.

MS=SS/DF

e.g, $MS_{E} = \frac{SS_{E}}{n-2}$

40 / 51

The F Distribution and Ratios of Variances

$F = \frac{MS_{R}}{MS_{E}}$ with DF=1,n-2

41 / 51

F-Test and Pufferfish

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
resemblance	1	255.1532	255.153152	27.37094	5.64e-05
Residuals	18	167.7968	9.322047	NA	NA

42 / 51

F-Test and Pufferfish

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
resemblance	1	255.1532	255.153152	27.37094	5.64e-05
Residuals	18	167.7968	9.322047	NA	NA

We reject the null hypothesis that resemblance does not explain variability in predator approaches

42 / 51

Testing the CoefficientsF-Tests evaluate whether elements of the model contribute to variability in the dataAre modeled predictors just noise?
What's the difference between a model with only an intercept and an intercept and slope?

43 / 51

Testing the CoefficientsF-Tests evaluate whether elements of the model contribute to variability in the dataAre modeled predictors just noise?
What's the difference between a model with only an intercept and an intercept and slope?

T-tests evaluate whether coefficients are different from 0
43 / 51

Testing the Coefficients

F-Tests evaluate whether elements of the model contribute to variability in the data
- Are modeled predictors just noise?
- What's the difference between a model with only an intercept and an intercept and slope?

T-tests evaluate whether coefficients are different from 0
Often, F and T agree - but not always
- T can be more sensitive with multiple predictors

43 / 51

xkcd

44 / 51

45 / 51

46 / 51

T-Distributions are What You'd Expect Sampling a Standard Normal Population with a Small Sample Size

t = mean/SE, DF = n-1
It assumes a normal population with mean of 0 and SD of 1

47 / 51

Error in the Slope Estimate

$\Large SE_{b} = \sqrt{\frac{MS_{E}}{SS_{X}}}$

95% CI = $b \pm t_{\alpha,df}SE_{b}$

(~ 1.96 when N is large)

48 / 51

Assessing the Slope with a T-Test

$\Large t_{b} = \frac{b - \beta_{0}}{SE_{b}}$

DF=n-2

$H_0: \beta_{0} = 0$ , but we can test other hypotheses

49 / 51

Slope of Puffer Relationship (DF = 1 for Parameter Tests)

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	1.925	1.506	1.278	0.218
resemblance	2.989	0.571	5.232	0.000

We reject the hypothesis of no slope for resemblance, but fail to reject it for the intercept.

50 / 51

So, what can we say in a null hypothesis testing framework?

We reject that there is no relationship between resemblance and predator visits in our experiment.
0.6 of the variability in predator visits is associated with resemblance.

51 / 51

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Linear Regression and Frequentist Hypothesis Testing

Etherpad

https://etherpad.wikimedia.org/p/607-lm-eval-2020

Putting Linear Regression Into Practice with Pufferfish

Question of the day: Does Resembling a Pufferfish Reduce Predator Visits?

Testing Our Models

So.... how do you draw conclusions from an experiment or observation?

Inductive v. Deductive Reasoning

Applying Different Styles of Inference

Null Hypothesis Testing is a Form of Deductive Inference

Null Hypothesis Testing is a Form of Deductive Inference

A Bigger View of Dedictive Inference

Testing Our Models

Reifying Refutation - What is the probability something is false?

Reifying Refutation - What is the probability something is false?

To falsify the 2:1 hypothesis, we need to know the probability of observing 3, or something GREATER than 3.

To falsify the 2:1 hypothesis, we need to know the probability of observing 3, or something GREATER than 3.

To falsify the 2:1 hypothesis, we need to know the probability of observing 3, or something GREATER than 3.

Null hypothesis testing is asking what is the probability of our observation or more extreme observation given that some null expectation is true.

(it is NOT the probability of any particular alternate hypothesis being true)

R.A. Fisher and The P-Value For Null Hypotheses

Applying Fisher: Evaluation of a Test Statistic

Problems with P

Problems with P

Problems with P

Problems with P

Problems with P

Testing Our Models

Neyman-Pearson Hypothesis Testing and Decision Making: What if you have to make a choice?

Neyman-Pearson Null Hypothesis Significance Testing

NHST in a nutshell

NHST in a nutshell

NHST in a nutshell

NHST in a nutshell

But... Statistical Significance is NOT Biological Signficance.

Types of Errors in a NHST framework

Type I & II Error

Power of a Test

α=0.05\alpha = 0.05 & β=0.20\beta = 0.20

Power of a Test

α=0.05\alpha = 0.05 & β=0.20\beta = 0.20

What is Power, Anyway?

How do you talk about results from a p-value?

For example, what does p = 0.061 mean?

How I talk about p-values

How I talk about p-values

How I talk about p-values

How I talk about p-values

How I talk about p-values

My Guiding Light

Why we need to be careful (because power!)

Problems with NHST When Used by Humans

Take home: p-values are great for some kinds of inference. But, they only provide one answer at a time. Use carefully, and never depend on one p-value alone!

Testing Our Models

Common Regression Test Statistics

Does my model explain variability in the data?

Does my model explain variability in the data?

Does my model explain variability in the data?

Sums of Squares of Error, Visually

Sums of Squares of Regression, Visually

Sums of Squares of Regression, Visually

Components of the Total Sums of Squares

Components of the Total Sums of Squares

The F Distribution and Ratios of Variances

F-Test and Pufferfish

F-Test and Pufferfish

Testing the Coefficients

Testing the Coefficients

Testing the Coefficients

T-Distributions are What You'd Expect Sampling a Standard Normal Population with a Small Sample Size

Error in the Slope Estimate

95% CI = b±tα,dfSEbb \pm t_{\alpha,df}SE_{b}

Assessing the Slope with a T-Test

DF=n-2

Slope of Puffer Relationship (DF = 1 for Parameter Tests)

So, what can we say in a null hypothesis testing framework?

Etherpad

https://etherpad.wikimedia.org/p/607-lm-eval-2020

Help

$\alpha = 0.05$ & $\beta = 0.20$

$\alpha = 0.05$ & $\beta = 0.20$

95% CI = $b \pm t_{\alpha,df}SE_{b}$