class: middle, center background-position: center background-size: cover # Sample Distributions <br> ![](images/03/barbie_sample_distribution_mean.jpg) --- # Outline 1. What is a Good Sample? 2. Sample Distributions and Parameters 3. Non-parameteric Properties of Samples and Populations --- class:center # What is a population? <img src="03_sampling_lecture_x_files/figure-html/population-1.png" style="display: block; margin: auto;" /> **Population** = All Individuals --- class: center, middle # Population .pull-left[ <img src="03_sampling_lecture_x_files/figure-html/population-1.png" style="display: block; margin: auto;" /> ] .pull-right[ <img src="03_sampling_lecture_x_files/figure-html/normplot3-1.png" style="display: block; margin: auto;" /> ] --- class: center # What is a sample? <img src="03_sampling_lecture_x_files/figure-html/sampleSpread-1.png" style="display: block; margin: auto;" /> -- A **sample** of individuals in a randomly distributed population. --- class: center, middle # Population .pull-left[ <img src="03_sampling_lecture_x_files/figure-html/sampleSpread-1.png" style="display: block; margin: auto;" /> ] .pull-right[ <img src="03_sampling_lecture_x_files/figure-html/normplotsamp-1.png" style="display: block; margin: auto;" /> ] --- # Properties of a good sample 1. Validity - Yes, this is a measure of what I am interested in 2. Reliability - If I sample again, I'll get something similar 3. Representative - Sample reflects the population - Unbiased --- # Validity: Is it measuring what I think it's measuring? .center[![:scale 75%](./images/03/23andme_health_ancestry_kit.jpg)] --- # Reliability: Is my sample/measure repeatable? .center[![](./images/03/painscale.jpg)] --- # Are You Representative? Or Biased from Unequal Representation <img src="03_sampling_lecture_x_files/figure-html/spatialBias-1.png" style="display: block; margin: auto;" /> If you only chose individuals from the bottom, you would only get one range of sizes. --- class: center, middle ## The key for most statistical models is that a replicates in a sample are i.i.d. <br> ## Independent and Identically Distributed --- # Exercise: <br><br> ### 1. What is a population you sample? <br><br> ### 2. How do you ensure validity, reliability, and representativeness of a sample from your population? --- # Outline 1. What is a Good Sample? 2. .red[Sample Distributions and Parameters] 3. Non-parameteric Properties of Samples and Populations --- # Taking a Descriptive <img src="03_sampling_lecture_x_files/figure-html/sampleSpread-1.png" style="display: block; margin: auto;" /> <center>How big are individuals in this population? --- # Our 'Sample' <table class="table" style="margin-left: auto; margin-right: auto;"> <tbody> <tr> <td style="text-align:right;"> 41.11041 </td> <td style="text-align:right;"> 41.85113 </td> <td style="text-align:right;"> 46.56152 </td> <td style="text-align:right;"> 60.69390 </td> <td style="text-align:right;"> 47.13263 </td> </tr> <tr> <td style="text-align:right;"> 42.11062 </td> <td style="text-align:right;"> 43.02022 </td> <td style="text-align:right;"> 51.24369 </td> <td style="text-align:right;"> 40.91309 </td> <td style="text-align:right;"> 47.22189 </td> </tr> <tr> <td style="text-align:right;"> 48.31628 </td> <td style="text-align:right;"> 46.86495 </td> <td style="text-align:right;"> 40.79575 </td> <td style="text-align:right;"> 34.11458 </td> <td style="text-align:right;"> 48.15706 </td> </tr> <tr> <td style="text-align:right;"> 45.44011 </td> <td style="text-align:right;"> 47.03553 </td> <td style="text-align:right;"> 46.86209 </td> <td style="text-align:right;"> 55.74212 </td> <td style="text-align:right;"> 44.90765 </td> </tr> <tr> <td style="text-align:right;"> 51.28539 </td> <td style="text-align:right;"> 38.87441 </td> <td style="text-align:right;"> 31.41510 </td> <td style="text-align:right;"> 49.46008 </td> <td style="text-align:right;"> 34.67087 </td> </tr> </tbody> </table> --- # Visualizing Our Sample as Counts with 20 Bins <img src="03_sampling_lecture_x_files/figure-html/sampPlot-1.png" style="display: block; margin: auto;" /> --- # Visualizing Our Sample as Frequencies <img src="03_sampling_lecture_x_files/figure-html/sampPlotFreq-1.png" style="display: block; margin: auto;" /> -- Frequency = % of Sample with that Value --- # Visualizing Our Sample as Frequencies <img src="03_sampling_lecture_x_files/figure-html/sampPlotFreq-1.png" style="display: block; margin: auto;" /> Frequency = Probability of Drawing that Value from a Sample --- # Visualizing Our Sample as Continuous Probability Densities <img src="03_sampling_lecture_x_files/figure-html/sampPlotDens-1.png" style="display: block; margin: auto;" /> Probability Density = Relative likleihood of encountering a value. Sums to One. --- class: center, middle # Our goal is to get samples representative of a population, and estimate population parameters. We assume a **distribution** of values to the population from which we then estimate parameters. Makes it easier to describe nature! --- # A Population versus Sample Distribution: Normal <img src="03_sampling_lecture_x_files/figure-html/normplot-1.png" style="display: block; margin: auto;" /> --- # The Normal (Gaussian) Distribution <img src="03_sampling_lecture_x_files/figure-html/normplot2-1.png" style="display: block; margin: auto;" /> - Arises from some deterministic value and many small additive deviations. - Variability is additive over time. - VERY common. --- class: center # Understanding Gaussian Distributions with a Galton Board (Quinqunx) <video controls loop><source src="03_sampling_lecture_x_files/figure-html/quincunx.webm" /></video> --- # We see this pattern everywhere - the Random or Drunkard's Walk aka Brownian Motion <img src="03_sampling_lecture_x_files/figure-html/walk-1.png" style="display: block; margin: auto;" /> --- # A Normal Result for Final Position <img src="03_sampling_lecture_x_files/figure-html/final_walk-1.png" style="display: block; margin: auto;" /> --- # Other Distributions Possible: Lognormal and Multiplicative error <img src="03_sampling_lecture_x_files/figure-html/lnormplot-1.png" style="display: block; margin: auto;" /> - e.g., `\(N_t = \lambda N_{t-1}, \lambda \sim \mathcal{N}(\mu, \sigma)\)` - Error compounds over time --- # Other Distributions: Binomial from Binary Yes/Nos <img src="03_sampling_lecture_x_files/figure-html/binomplot-1.png" style="display: block; margin: auto;" /> - Discrete, not continuous. - From underlying 1/0 trials, e.g., drug success, probability of predation --- # This is just a sample <img src="03_sampling_lecture_x_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" /> --- class: center, middle # We chose it to be representative of the population. --- # But - how does my sample compare to a population? <img src="03_sampling_lecture_x_files/figure-html/samp_pop_plot-1.png" style="display: block; margin: auto;" /> --- # Describing a Gaussian population using a representative sample -- - **The mean**: If I were to draw from this sample, or it's like, what value would I most likely get? -- - **The standard deviation**: If I assume a normal distribution, what's the range of 66% or 95% of the variability? -- - **Skew and kurtosis**: Is this sample peaked, flat, shifted one way or another? --- # Expected Value: the Mean <img src="03_sampling_lecture_x_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- # Sample Estimate of Population Properties: **Mean** `$$\bar{x} = \frac{ \displaystyle \sum_{i=1}^{n}{x_{i}} }{n}$$` `\(\large \bar{x}\)` - The average value of a sample `\(x_{i}\)` - The value of a measurement for a single individual n - The number of individuals in a sample `\(\mu\)` - The average value of a population (Greek = population, Latin = Sample) --- # Sample versus Population - Latin characters (e.g., `\(\bar{x}\)`) for **sample ** - Greek chracters (e.g., `\(\mu\)`) for **population** <!-- .center[https://istats.shinyapps.io/sampdist_cont/] --> --- # How Variable is the Population <img src="03_sampling_lecture_x_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- # Sample Estimate of Population Properties: **Variance** How variable was that population? `$$\large s^2= \frac{\displaystyle \sum_{i=1}^{n}{(X_i - \bar{X})^2}} {n-1}$$` * Sums of Squares over n-1 * n-1 corrects for biase from estimating from a sample. * `\(\sigma^2\)` if describing the population * Units in square of measurement... --- # Sample Estimate of Population Properties: Standard Deviation $$ \large s = \sqrt{s^2}$$ * Units the same as the measurement * If distribution is normal, 67% of data within 1 SD * 95% within 2 SD * `\(\sigma\)` if describing the population --- # Sample Estimate of Population Properties: **Skew** How centrally distributed is your population? `$$\large s^2= \frac{\displaystyle \sum_{i=1}^{n}{(X_i - \bar{X})^3}} {(n-1)s^3}$$` * Positive, peak is shifted right * Negative, peak is shifted left * Zero, it's nice and normal --- # Sample Estimate of Population Properties: **Kurtosis** How much of your population is close to the mode versus way out in the tails? `$$\large s^2= \frac{\displaystyle \sum_{i=1}^{n}{(X_i - \bar{X})^4}} {(n-1)s^4}$$` * Larger than 3, sharply peaked distribution, narrow tails - leptokurtic * Less than 3, flat distribution, wide tails - platykurtic * 3 - normal --- # Always Remember - What is your Sample? What is your Population? Is one representative of the other? <img src="03_sampling_lecture_x_files/figure-html/samp_pop_plot-1.png" style="display: block; margin: auto;" /> --- # Outline 1. What is a Good Sample? 2. Sample Distributions and Parameters 3. .red[Non-parameteric Properties of Samples and Populations] --- # What if we don't want to assume Gaussian? Other ways to describe a population from a sample: -- - **Median**: What is the value smack in the middle? -- - **Upper and Lower Quantiles/Percentiles**: What are large and small values like? -- - **Interquartile Range**: Is our sample clustered, or waaaay spread out? --- # The Empirical Cummulative Distribution Plot of our Sample <img src="03_sampling_lecture_x_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- # The Box Plot (or Box-and-Whisker Plot) ![](images/03/Box-Plot-and-Whisker-Plot-1.png) --- # The Box Plot (or Box-and-Whisker Plot) <img src="03_sampling_lecture_x_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- # Two ways of seeing the same data <img src="03_sampling_lecture_x_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- # Median: Middle of the Sample <img src="03_sampling_lecture_x_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- # Mean versus Median .center[https://istats.shinyapps.io/MeanvsMedian/] --- # Quartiles: Upper and Lower Quarter <img src="03_sampling_lecture_x_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> 1<sup>st</sup> and 3<sup>rd</sup> **Quartile** = 25<sup>th</sup> and 75<sup>th</sup> **Percentile** --- # Reasonable Range of Data (Beyond which are Outliers) <img src="03_sampling_lecture_x_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- # Where's the action in R? ```r # Quantiles quantile(samp) ``` ``` 0% 25% 50% 75% 100% 31.41510 41.11041 46.56152 48.15706 60.69390 ``` ```r # Interquartile range IQR(samp) ``` ``` [1] 7.046652 ```