How do we Know
Evaluating a Null Hypothesis.
Null Hypothesis Significance Testing: Friend of Foe of Science?
Testing Linear Models
Deductive Inference: A larger theory is used to devise
many small tests.
Inductive Inference: Small pieces of evidence are used to shape a larger theory and degree of belief.
Null Hypothesis Testing: What's the probability that things are not influencing our data?
Cross-Validation: How good are you at predicting new data?
Model Comparison: Comparison of alternate hypotheses
Probabilistic Inference: What's our degree of belief in a data?
Falsification of hypotheses is key!
A theory should be considered scientific if, and only if, it is falsifiable.
Falsification of hypotheses is key!
A theory should be considered scientific if, and only if, it is falsifiable.
Look at a whole research program and falsify auxilliary hypotheses
https://plato.stanford.edu/entries/lakatos/#ImprPoppScie
How do we Know
Evaluating a Null Hypothesis.
Null Hypothesis Significance Testing: Friend of Foe of Science?
Testing Linear Models
What if our hypothesis was that the resemblance-predator relationship was 2:1. We know our SE of our estimate is 0.57, so, we have a distribution of what we could observe.
BUT - our estimated slope is 3.
We want to know if we did this experiment again and again, what's the probability of observing what we saw or worse (frequentist!)
We want to know if we did this experiment again and again, what's the probability of observing what we saw or worse (frequentist!)
Probability = 0.04
We want to know if we did this experiment again and again, what's the probability of observing what we saw or worse (frequentist!)
Probability = 0.04
P-value: The Probability of making an observation or more extreme observation given that the null hypothesis is true.
We use our data to calculate a test statistic that maps to a value of the null distribution.
We can then calculate the probability of observing our data, or of observing data even more extreme, given that the null hypothesis is true.
P(X≤Data|H0)
Interpretation of P-values as confirmation of an alternative hypothesis
Like SE, it gets smaller with sample size!
Interpretation of P-values as confirmation of an alternative hypothesis
Like SE, it gets smaller with sample size!
Misususe of setting a threshold for rejecting a hypothesis
How do we Know
Evaluating a Null Hypothesis.
Null Hypothesis Significance Testing: Friend of Foe of Science?
Testing Linear Models
Jerzy Neyman
Egon Pearson
For Industrial Quality Control, NHST was introduced to establish cutoffs of reasonable p, called an α
This corresponds to Confidence intervals - 1-$\alpha$ = CI of interest
This has become weaponized so that α=0.05 has become a norm.... and often determines if something is worthy of being published?
Establish a critical threshold below which one rejects a null hypothesis - α. A priori reasoning sets this threshold.
Neyman and Pearon state that if p ≤ α then we reject the null.
Establish a critical threshold below which one rejects a null hypothesis - α. A priori reasoning sets this threshold.
Neyman and Pearon state that if p ≤ α then we reject the null.
Fisher suggested a typical α of 0.05 as indicating statistical significance, although eschewed the idea of a decision procedure where a null is abandoned.
Establish a critical threshold below which one rejects a null hypothesis - α. A priori reasoning sets this threshold.
Neyman and Pearon state that if p ≤ α then we reject the null.
Fisher suggested a typical α of 0.05 as indicating statistical significance, although eschewed the idea of a decision procedure where a null is abandoned.
Ho is True | Ho is False | |
---|---|---|
Reject Ho | Type I Error | Correct or Type S Error |
Fail to Reject Ho | - | Type II Error |
Possibility of Type I error regulated by choice of α
Probability of Type II error regulated by choice of β
Probability of Type S error is called δ
If β is the probability of committing a type II error, 1 - β is the power of a test.
The higher the power, the less of a chance of committing a type II error.
We often want a power of 0.8 or higher. (20% chance of failing to reject a false null)
5% Chance of Falsely Rejecting the Null, 20% Chance of Falsely Accepting the Null
If β is the probability of committing a type II error, 1 - β is the power of a test.
The higher the power, the less of a chance of committing a type II error.
We often want a power of 0.8 or higher. (20% chance of failing to reject a false null)
5% Chance of Falsely Rejecting the Null, 20% Chance of Falsely Accepting the Null
Are you comfortable with this? Why or why not?
Given that we often begin by setting our acceptable α, how do we then determine β given our sample design?
Formula for a specific test, using sample size, effect size, etc.
Simulate many samples, and see how often we get the wrong answer assuming a given α!
Based on your experimental design, what is a reasonable range of p-values to expect if the null is false
Smaller p values indicate stronger support for rejection, larger ones weaker. Use that language! Not significance!
Accumulate multiple lines of evidence so that the entire edifice of your research does not rest on a single p-value!!!!
There is a 6.1% chance of obtaining the observed data or more extreme data given that the null hypothesis is true.
If you choose to reject the null, you have a ~ 1 in 16 chance of being wrong
Are you comfortable with that?
OR - What other evidence would you need to make you more or less comfortable?
At different p-values, based on your study design, you will have different levels of confidence about rejecting your null. For example, based on the design of one study...
A p value of less than 0.0001 means I have very high confidence in rejecting the null
At different p-values, based on your study design, you will have different levels of confidence about rejecting your null. For example, based on the design of one study...
A p value of less than 0.0001 means I have very high confidence in rejecting the null
A p-value of 0.01 means I have high confidence in rejecting the null
At different p-values, based on your study design, you will have different levels of confidence about rejecting your null. For example, based on the design of one study...
A p value of less than 0.0001 means I have very high confidence in rejecting the null
A p-value of 0.01 means I have high confidence in rejecting the null
A p-value between 0.05 and 0.1 means I have some confidence in rejecting the null
At different p-values, based on your study design, you will have different levels of confidence about rejecting your null. For example, based on the design of one study...
A p value of less than 0.0001 means I have very high confidence in rejecting the null
A p-value of 0.01 means I have high confidence in rejecting the null
A p-value between 0.05 and 0.1 means I have some confidence in rejecting the null
A p-value of > 0.1 means I have low confidence in rejecting the null
In the search for the Higgs Boson, scientists studied billions of particles
They used a "five-sigma" threshold (e.g., an observation beyond 5 SD)
This corresponds to an α of 0.0000003
How do we Know
Evaluating a Null Hypothesis.
Null Hypothesis Significance Testing: Friend of Foe of Science?
Testing Linear Models
Does my model explain variability in the data?
Are my coefficients not 0?
Ho = The model predicts no variation in the data.
Ha = The model predicts variation in the data.
Ho = The model predicts no variation in the data.
Ha = The model predicts variation in the data.
To evaluate these hypotheses, we need to have a measure of variation explained by data versus error - the sums of squares!
Ho = The model predicts no variation in the data.
Ha = The model predicts variation in the data.
To evaluate these hypotheses, we need to have a measure of variation explained by data versus error - the sums of squares! SSTotal=SSRegression+SSError
Distance from ˆy to ˉy
SSR=∑(^Yi−ˉY)2, df=1
SSE=∑(Yi−ˆYi)2, df=n-2
SSR=∑(^Yi−ˉY)2, df=1
SSE=∑(Yi−ˆYi)2, df=n-2
To compare them, we need to correct for different DF. This is the Mean Square.
MS=SS/DF
e.g, MSE=SSEn−2
F=MSRMSE with DF=1,n-2
Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
---|---|---|---|---|---|
resemblance | 1 | 255.1532 | 255.153152 | 27.37094 | 5.64e-05 |
Residuals | 18 | 167.7968 | 9.322047 | NA | NA |
Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
---|---|---|---|---|---|
resemblance | 1 | 255.1532 | 255.153152 | 27.37094 | 5.64e-05 |
Residuals | 18 | 167.7968 | 9.322047 | NA | NA |
We reject the null hypothesis that resemblance does not explain variability in predator approaches
T-tests evaluate whether coefficients are different from 0
Often, F and T agree - but not always
xkcd
SEb=√MSESSX
(~ 1.96 when N is large)
tb=b−β0SEb
H0:β0=0, but we can test other hypotheses
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 1.925 | 1.506 | 1.278 | 0.218 |
resemblance | 2.989 | 0.571 | 5.232 | 0.000 |
We reject the hypothesis of no slope for resemblance, but fail to reject it for the intercept.
We reject that there is no relationship between resemblance and predator visits in our experiment.
0.6 of the variability in predator visits is associated with resemblance.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |