library(tidyverse) # to manipule data and make plot
library(palmerpenguins) # we load the data
library(pwr) # for power analysis
Tutorial 3: Statistical testing
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License This pratical is heavily inspired by the work of Yury Zablotski and Benjamin Yakir
\[ \def\BB{{\mathcal{B}}} \def\EE{{\mathbb{E}}} \def\NN{{\mathcal{N}}} \def\poi{{\mathcal{P}}} \def\PP{{\mathbb{P}}} \def\TT{{\mathcal{T}}} \def\UU{{\mathcal{U}}} \def\VV{{\mathbb{V}}} \]
Introduction
Hypothesis testing tries to answer the question: Is there a phenomenon at all? The basic approach is to determine whether the observed data can or cannot be reasonably explained by a model of randomness that does not involve the phenomena.
After the statistical model has been set, one may split the process of testing a statistical hypothesis into three steps:
- formulation of the hypotheses,
- \(H_0\): range of parameter values where the phenomenon is absent the null hypothesis
- \(H_1\): alternative hypothesis
- specification of the test,
- The decision rule is composed of a statistic and a subset of values of the statistic that correspond to the rejection of the null hypothesis
- the statistic is called the test statistic
- the subset of values is called the rejection region.
- reaching the final conclusion.
- Computing the observed value of the test statistic and checking whether or not the observed value belongs to the rejection region.
When we are building a statistical test, we make a trade off between two types of error:
- Type I error: rejecting a correct null hypothesis
- Type II error: not rejecting an incorrect null hypothesis
The test’s decision rule is designed so as to assure an acceptable probability of making a Type I error. Reducing the probability of a Type II error is desirable, but is of secondary importance.
The statistical power is the probability of rejecting the null hypothesis when the state of nature is the alternative hypothesis.
\[\text{Statistical Power} = 1 − \text{Probability of Type II Error}\]
The p-value is a function of the data. The \(p\)-value is equal to the significance level of the test in which the observed value of the statistic serves as the threshold. This allows us to control for the Type I error.
\[p\text{-value} = P(|T| > |t|)\] A statistical power of \(80\%\) (though higher is certainly better) and a significance level of \(5\%\) are often considered to be acceptable. This means
- an \(80\%\) chance of detecting a true effect (true positive)
- a \(20\%\) chance of missing a true effect (false negative)
- a \(95\%\) chance of detecting the absence of an effect (true negative)
- \(5\%\) chance of detecting an effect which isn’t there (false positive).
In this practical you will practice all these notions.
Packages
For this practical you will need the following packages:
Some custom functions are available in this file. You can load them in R with the following command:
source("https://gitbio.ens-lyon.fr/LBMC/hub/formations/ens_l3_stats/-/raw/main/tutorial_3_tests/utils.R?inline=false")
Loading the data
We are going to work on the famous Palmer penguins dataset. This dataset is an integrative study of the breeding ecology and population structure of Pygoscelis penguins along the western Antarctic Peninsula. These data were collected from 2007 - 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network.
The palmerpenguins
data contains size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica.
The palmerpenguins
library load the penguins
dataset into your R environment. If you are not familiar with tibble
, you just have to know that they are equivalent to data.frame
.
penguins
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
You can use the function summary()
to get relevant moment of each variable (column)
summary(penguins)
species island bill_length_mm bill_depth_mm
Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
Mean :43.92 Mean :17.15
3rd Qu.:48.50 3rd Qu.:18.70
Max. :59.60 Max. :21.50
NA's :2 NA's :2
flipper_length_mm body_mass_g sex year
Min. :172.0 Min. :2700 female:165 Min. :2007
1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
Median :197.0 Median :4050 NA's : 11 Median :2008
Mean :200.9 Mean :4202 Mean :2008
3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
Max. :231.0 Max. :6300 Max. :2009
NA's :2 NA's :2
What are the continuous and categorical variables in this table ? What is going to be the problem with the raw penguins
table ?
Solution
We first are going to drop the missing values (NA
)
<- penguins %>% drop_na()
data dim(data) # displan the number of rows and columns
[1] 333 8
%>%
operator or pipe in R: It takes the output of the function on the left and pass it as the first argument of the function on the right.
Empirical \(p-\)value
What is the observed sex ratio in the Gentoo penguins, regardless of their localization ?
%>%
data ::filter(species == "Gentoo") %>%
dplyrsummarize(
n_female = sum(sex == "female"),
size = n(),
ratio = n_female / size,
)
# A tibble: 1 × 3
n_female size ratio
<int> <int> <dbl>
1 58 119 0.487
We wonder if the sex ratio of the Gentoo penguins, regardless of their localization, is different compared to \(0.5\).
How would you formulate this question in therms of a null and an alternative hypothesis ? Which parameter(s) of the model are you going to study ?
Solution
- \(H_0\): The sex ratio in the penguins \(p = 0.5\)
- \(H_1\): The sex ratio in the penguins \(p \neq 0.5\)
Is this a two-sided or one-sided hypothesis test?
Solution
A two-sided hypothesis test, we can reject \(H_0\) with \(p < 0.5\) or \(p > 0.5\), i.e. \(|p - 0.5| > 0\)
How would you simulate this sample (with which probability distribution) under \(H_0\) (with which parameter values) ?
Solution
With the binomial distribution rbinom
with prob=0.5
and size=119
rbinom(1, size=119, prob=0.5)
[1] 56
Compute an empirical sex ratio distribution with this tool, generate \(1000\) values under the null hypothesis.
Solution
<- tibble(
empirical_h0 n_female = rbinom(1000, size=119, prob=0.5)
%>%
) mutate(
ratio = n_female / 119
)
With this simulation, how could you compute the probability of observing a sample with a sex ratio as low as \(0.487395\) under \(H_0\) ?
Solution
We can compute the number of times where empirical_h0$ratio
is less or equal to \(0.487395\)
<- empirical_h0 %>%
empirical_h0 mutate(
lower = ratio < 0.487395
)mean(empirical_h0$lower)
[1] 0.406
%>%
empirical_h0 ggplot(aes(x = ratio)) +
stat_ecdf(geom = "step") +
geom_vline(xintercept = 0.487395, col = "green") +
geom_hline(yintercept = mean(empirical_h0$lower), col = "red") +
theme_bw()
Are we missing something to quantify our confidence in the null hypothesis ?
Solution
We have a two-sided test, we need to take into account the probability of being greater than \(0.5\)
<- empirical_h0 %>%
empirical_h0 mutate(
lower = ratio < 0.487395,
upper = ratio > 1- 0.487395
)mean(empirical_h0$lower | empirical_h0$upper)
[1] 0.833
How would you find the \(p\)-value (and not empirical \(p\)-value) associated with this test ?
Solution
2*pbinom(58, size = 119, prob=0.5)
[1] 0.85463
or
pbinom(58, size=119, prob=0.5) + pbinom(119 - 58 - 1, size=119, prob=0.5, lower.tail = F)
[1] 0.85463
(you can check the ?pbinom
to understand the 119 - 58 - 1
)
What can you do to make your empirical \(p\)-value closer to the \(p\)-value ?
Solution
Increase the size of the simulation
Sensitivity and specificity of a test
Sensitivity and specificity mathematically describe the accuracy of a test which reports the presence or absence of a condition. Individuals for which the condition is satisfied, are considered positive and those for which it is not, are considered negative.
- Sensitivity (true positive rate) refers to the probability of a positive test, conditioned on truly being positive.
- Specificity (true negative rate) refers to the probability of a negative test, conditioned on truly being negative.
\[\text{sensitivity} = \frac{\text{number of true positives}}{\text{number of true positives} + \text{number of false negatives}}\] \[\text{specificity} = \frac{\text{number of true negatives}}{\text{number of true negatives} + \text{number of false positives}}\]
You can use the following function to generate binomial sample for which we know if they are following \(H_0\) or \(H_1\)
<- sim_binom(
simulation replicate = 1000, # number of simulation
size = 119, # size of the simulated sample
p_h0 = 0.5, # p under H_0
p_h1 = 0.6, # p under H_1
threshold = seq(from = 0.001, to = 0.999, by = 0.001) # alpha value
) simulation
# A tibble: 999,000 × 9
truth sample statistic alpha rejection TP FP FN TN
<lgl> <list> <dbl> <dbl> <lgl> <lgl> <lgl> <lgl> <lgl>
1 TRUE <int [1]> 0.0437 0.001 FALSE FALSE FALSE TRUE FALSE
2 TRUE <int [1]> 0.0437 0.002 FALSE FALSE FALSE TRUE FALSE
3 TRUE <int [1]> 0.0437 0.003 FALSE FALSE FALSE TRUE FALSE
4 TRUE <int [1]> 0.0437 0.004 FALSE FALSE FALSE TRUE FALSE
5 TRUE <int [1]> 0.0437 0.005 FALSE FALSE FALSE TRUE FALSE
6 TRUE <int [1]> 0.0437 0.006 FALSE FALSE FALSE TRUE FALSE
7 TRUE <int [1]> 0.0437 0.007 FALSE FALSE FALSE TRUE FALSE
8 TRUE <int [1]> 0.0437 0.008 FALSE FALSE FALSE TRUE FALSE
9 TRUE <int [1]> 0.0437 0.009 FALSE FALSE FALSE TRUE FALSE
10 TRUE <int [1]> 0.0437 0.01 FALSE FALSE FALSE TRUE FALSE
# ℹ 998,990 more rows
You can use the following function to compute the sensitivity and specificity depending on the risk \(\alpha\).
<- compute_sens_spec(simulation) sens_spec
You can then plot the sensitivity and specificity for your test as a function of the risk \(\alpha\).
%>%
sens_spec ggplot() +
geom_line(aes(x = alpha, y = sensitivity)) +
xlim(c(0,1)) + ylim(c(0,1)) + coord_fixed() +
theme_bw()
%>%
sens_spec ggplot() +
geom_line(aes(x = alpha, y = specificity)) +
xlim(c(0,1)) + ylim(c(0,1)) + coord_fixed() +
theme_bw()
The ROC (receiver operating characteristic) curve plot the False Positive Rate (FPR) against the sensitivity (True Positive Rate).
%>%
sens_spec ggplot() +
geom_line(aes(x = FPR, y = sensitivity)) +
xlim(c(0,1)) + ylim(c(0,1)) + coord_fixed() +
theme_bw()
What shape of ROC curves do you expect if the conclusion of your test was the results of flipping a coin ?
Solution
The ROC curve would follow the diagonal of the plot.
One sample Student test
The male and female are similar in plumage and size for the emperor penguin, reaching 100cm in length and weighing from 22 to 45kg (\(\mu_0 = 33500\)g) (source Wikipedia).
We can ask ourselves if the Gentoo penguins that are the heavier of the 3 species could be mistaken for an emperor penguin based on their weights.
%>%
data group_by(species) %>%
summarise(
mean(body_mass_g)
)
# A tibble: 3 × 2
species `mean(body_mass_g)`
<fct> <dbl>
1 Adelie 3706.
2 Chinstrap 3733.
3 Gentoo 5092.
How would you formulate this question in therms of a null and an alternative hypothesis ?
Solution
- \(H_0\): The average weight of the Gentoo penguins is equal to the average weight of the emperor penguins \(E(X) = \mu_0\)
- \(H_1\): The average weight of the Gentoo penguins is different to the average weight of the emperor penguins \(E(X) \neq \mu_0\)
where \(X\) is the random variable giving the weight of a Gentoo penguin.
Or equivalently:
- \(H_0\): \(\mu = \mu_0\)
- \(H_1\): \(\mu \neq \mu_0\)
when noting \(\mu\) the Gentoo penguins population average weight.
Is this a two-sided or one-sided hypothesis test? How many variables are in this model?
Solution
A two-sided hypothesis test, we can reject \(H_0\) with \(E(X) < \mu_0\) or \(E(X) > \mu_0\), i.e. \(|E(X) - \mu_0| > 0\)
Can we consider the weight distribution of the Gentoo penguins to follow a Normal distribution ?
%>%
data filter(species == "Gentoo") %>%
ggplot() +
geom_histogram(aes(x = body_mass_g), binwidth = 100) +
theme_bw()
Solution
We can use a QQ-plot
%>%
data filter(species == "Gentoo") %>%
ggplot(aes(sample=body_mass_g)) +
stat_qq() +
stat_qq_line()
It’s not perfect but we will continue with the Normal distribution hypothesis.
We will see later how to quantitatively answer this question.Which testing statistic would you propose to challenge the null hypothesis ?
Solution
We are going to study the statistic of the difference between a means estimator and a population average.
To take into account the sampling variability we need to scale this statistic by the standard error.\(t\)-value
The \(t\)-value is a measure of similarity between compared mean scaled by the standard error.
\[t\text{-value} = \frac{\text{mean} - \text{expected mean}}{\text{standard error}}\] with \(\text{standard error} = \frac{S(X)}{\sqrt(n)}\)
What is the distribution of the numerator and of the denominator?
Solution
Under the Gaussian assumption on the data:
\[\widehat{\mu}(X)-\mu_0 \underset{H_0}{\sim} \mathcal{N}(0, \sigma^2(1/n))\] \[S^2(X) \sim \sigma^2\chi^2(n-1)\]
The \(t\)-value follows a known distribution: the Student distribution with \(n-1\) degree of freedom.
You can use the t.test()
command to test \(H_0\).
<- data %>% filter(species == "Gentoo") %>%
gentoo_weight pull(body_mass_g)
t.test(gentoo_weight, mu = 33500)
One Sample t-test
data: gentoo_weight
t = -617.96, df = 118, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 33500
95 percent confidence interval:
5001.403 5183.471
sample estimates:
mean of x
5092.437
The output of the t.test()
function gives you:
- the \(t\)-value : \(-617.96\)
- the degree of freedom : \(118\)
- the \(p\)-value : \(2.2\times 10^{-16}\)
- \(\widehat{\mu}(X)\) : \(5092.437\)
- A \(0.95\) confidence interval on \(\widehat{\mu}(X)\) : \([5001.403, 5183.471]\)
Do you understand all these outputs ?
We can decompose this computation, for the \(t\)-values.
<- (mean(gentoo_weight) - 33500) / (sd(gentoo_weight) / sqrt(length(gentoo_weight)))
t_stat t_stat
[1] -617.9555
and for the \(p\)-value (two different way to compute it from the cumulative distribution function in the two-sided case):
pt(- abs(t_stat), length(gentoo_weight) - 1) * 2
[1] 5.825327e-209
2 * (1 - pt(abs(t_stat), length(gentoo_weight) - 1))
[1] 0
Depending on your computer, the t.test()
\(p\)-value and the pt()
may not by equal. To check our formula, test the following null hypothesis:
\(H_0\) : \(E(X) = 5000\)
Solution
For the t.test
:
t.test(gentoo_weight, mu = 5000)$p.value
[1] 0.04662625
for the pt
:
<- (mean(gentoo_weight) - 5000) / (sd(gentoo_weight) / sqrt(length(gentoo_weight - 1)))
t_stat pt(-abs(t_stat), length(gentoo_weight) - 1) * 2
[1] 0.04662625
2 * (1 - pt(abs(t_stat), length(gentoo_weight) - 1))
[1] 0.04662625
The two \(p\)-values are now exactly the same. Be mindful of the numerical limitations of your computer when working with tiny numbers.
The following function plots the density of the \(T\) statistic, the red area corresponding to the \(p\)-value for a given \(\mu\) and sample \(x\), with the observed \(t\)-value (and its opposite) as vertical bars, in the two-sided case.
plot_student(x = gentoo_weight, mu = 33500)
plot_student(x = gentoo_weight, mu = 5000)
With a threshold of \(0.05\) would you take the risk of rejecting \(H_0\) “the average weight of Gentoo penguins is equal to the average weight of the emperor penguins” ?
Solution
Yes, expect maybe for a baby emperor penguin.
Two samples Student test
Instead of comparing our Gentoo penguins to a reference, we now want to compare the weights of Adelie and Chinstrap penguins.
%>%
data group_by(species) %>%
summarise(
mean(body_mass_g)
)
# A tibble: 3 × 2
species `mean(body_mass_g)`
<fct> <dbl>
1 Adelie 3706.
2 Chinstrap 3733.
3 Gentoo 5092.
Based on the observed means which hypothesis could you want to test ?
Solution
We note \(\hat\mu_a\) and \(\hat\mu_c\) the observed average weights respectively for the Adelie and Chinstrap penguins, and \(\mu_a\) and \(\mu_c\) the (unknown) population average weights respectively for the Adelie and Chinstrap penguins:
Either the two-sided test:
- \(H_0\): \(\mu_a = \mu_c\)
- \(H_1\): \(\mu_a \neq \mu_c\)
Or one of the one-sided test
- \(H_0\): \(\mu_a \geq \mu_c\)
- \(H_1\): \(\mu_a < \mu_c\)
or
- \(H_0\): \(\mu_a \leq \mu_c\)
- \(H_1\): \(\mu_a > \mu_c\)
We are going to extract the sampled penguins weights per species.
<- data %>% filter(species == "Adelie") %>%
adelie_weight pull(body_mass_g)
<- data %>% filter(species == "Chinstrap") %>%
chinstrap_weight pull(body_mass_g)
Like for the one sample Student test, we are going to use the \(t\)-statistic. However, there are two versions of this statistic for the two sample case:
- A statistic for the case where \(\sigma_a^2 = \sigma_c^2\), with \(\sigma_a^2\) the Adelie penguins weight variance and \(\sigma_c^2\) the Chinstrap penguins weight variance
- A statistic for the case where \(\sigma_a^2 \neq \sigma_c^2\)
%>%
data group_by(species) %>%
summarise(
mean(body_mass_g),
var(body_mass_g),
sd(body_mass_g),
)
# A tibble: 3 × 4
species `mean(body_mass_g)` `var(body_mass_g)` `sd(body_mass_g)`
<fct> <dbl> <dbl> <dbl>
1 Adelie 3706. 210332. 459.
2 Chinstrap 3733. 147713. 384.
3 Gentoo 5092. 251478. 501.
We will see later how to test \(H_0\) : \(\sigma_a^2 = \sigma_c^2\).
In the meanwhile we will work under the hypothesis that the variances are not equal between the two species, which is the default for the t.test()
function.
Two-sided hypothesis
t.test(adelie_weight, chinstrap_weight)
Welch Two Sample t-test
data: adelie_weight and chinstrap_weight
t = -0.44793, df = 154.03, p-value = 0.6548
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-145.66494 91.81724
sample estimates:
mean of x mean of y
3706.164 3733.088
What are the differences compared to the one sample Student test output ?
Solution
We have two sample estimates of the mean
What can you conclude for \(H_0\): \(\mu_a = \mu_c\) ?
The following plot displays the observed \(t\)-value as a vertical black line and the rejection area for a given \(\alpha\) in red. You can play the the alpha
parameter.
plot_student_2_alpha_twosided(adelie_weight, chinstrap_weight, alpha = 0.05)
Solution
We don’t have enought evidence to reject \(H_0\)
One-sided hypothesis
We now want to test if the Adelie penguins are lighter than the Chinstrap penguins.
What is our null hypothesis ?
Solution
The null hypothesis correspond to the abscence of effect. Here it would mean that Adelie penguins are heavier of of equal weigth compared to the Chinstrap penguins.
- \(H_0\): \(\mu_a \geq \mu_c\), with \(\mu_a\) the average weight of the Adelie penguins and \(\mu_c\) the average weight of the Chinstrap penguins
- \(H_1\): \(\mu_a < \mu_c\)
We can use the alternative = c("two.sided", "less", "greater")
option.
The default is "two.sided"
. Which option between "less"
and "greater"
should you use ?
Solution
You test the difference between the means:
\(H_0\): \(\mu_a \geq \mu_c\) is equivalent to
\(H_0\): \(\mu_a - \mu_c \geq 0\)
Therefore the alternative is \(\mu_a - \mu_c < 0\) and we should use "less"
t.test(adelie_weight, chinstrap_weight, alternative = "less")
Welch Two Sample t-test
data: adelie_weight and chinstrap_weight
t = -0.44793, df = 154.03, p-value = 0.3274
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf 72.54212
sample estimates:
mean of x mean of y
3706.164 3733.088
Why is the \(p\)-value different to the one computed for the two-sided test ?
Solution
We are only considering the probability for the true difference in means of being less than zero while ignoring the greater than zero possibilities.
For these 3 plots, the red area under the curve is the same, you can play with the alpha
parameter.
plot_student_2_alpha_twosided(adelie_weight, chinstrap_weight, alpha=0.05)
plot_student_2_alpha_less(adelie_weight, chinstrap_weight, alpha=0.05)
plot_student_2_alpha_greater(adelie_weight, chinstrap_weight, alpha=0.05)
Homogeneous variance
We made the assumption that \(\sigma_a^2 \neq \sigma_c^2\), what happens to the two-sided test if \(\sigma_a^2 = \sigma_c^2\) ?
Solution
For \(\sigma_a^2 \neq \sigma_c^2\) we have
\[T(X) = \frac{\overline{X_a} -\overline{X_c} }{\sqrt{S^2(X_a)/n_a+S^2(X_c)/n_c}}\]
For \(\sigma_a^2 = \sigma_c^2\) we have
\[T(X) = \frac{\overline{X_a} -\overline{X_c} }{S(X_{ac}) \sqrt{ 1/n_a + 1/n_c}}\]How will the \(p\)-value change with the homogeneous variance hypothesis ?
You can use the option var.equal = T
of the t.test()
function to explore this question.
Solution
t.test(adelie_weight, chinstrap_weight)
Welch Two Sample t-test
data: adelie_weight and chinstrap_weight
t = -0.44793, df = 154.03, p-value = 0.6548
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-145.66494 91.81724
sample estimates:
mean of x mean of y
3706.164 3733.088
t.test(adelie_weight, chinstrap_weight, var.equal = T)
Two Sample t-test
data: adelie_weight and chinstrap_weight
t = -0.42011, df = 212, p-value = 0.6748
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-153.2538 99.4061
sample estimates:
mean of x mean of y
3706.164 3733.088
Note the difference in degree of freedom between the two tests
You can use the following function to see the effect of the degree of freedom on the Student distribution shape. Play with the df
parameter.
plot_student_df(df = 3, alpha = 0.05)
Power analysis
The statistical power is the probability of rejecting the null hypothesis when the state of nature is the alternative hypothesis.
\[\text{Statistical Power} = 1 − \text{Probability of Type II Error}\]
For statistical testing, the power of a test depends on 4 criteria
- the sample size: larger sample size makes it easier to detect the effect and so increases the statistical power of the experiment
- the significance level: the higher the threshold, the better, because it increases the power of a test and the chance of getting a statistically significant result while decreasing the number of observations needed to get a significant result.
- the effect size: the formula to describe the effect size differ between the tests. The larger the effect size the higher the power to detect it will be.
- the statistical power: the probability of finding a real effect (true positive). Your goal is to maximize the statistical power of the experiment while minimizing the sample size.
Knowing any three of them allow you to estimate the fourth.
You can use a power analysis before an experiment to calculate the optimal sample size for a specific confidence (power). And you can also do a power analysis after an experiment or sampling to determine the power of our experiment showing the plausibility of our results.
library(pwr)
The pwr
package uses the Cohen’s effect size definition as parameter. For the two samples Student test the effect size formula is the following
\[\text{effect size} = \frac{E(X_a) - E(X_b)}{\sqrt{\frac{S^2(X_a) + S^2(X_b)}{2}}}\]
What happens to the effect size when the pooled standard deviation increases ?
Solution
The effective effect size decreases. The more variability in the data the larger the difference between the means must be to be detected.
Power
You can use the pwr.t2n.test()
function to study the relation between :
- the sample size:
n1
andn2
- the significance level:
sig.level
- the effect size:
d
- the statistical power:
power
Setting one of these parameters to NULL
tell the function to compute it from the 3 others.
pwr.t2n.test(
n1 = length(adelie_weight),
n2 = length(chinstrap_weight),
d = effect_size_t(adelie_weight, chinstrap_weight),
sig.level = 0.05,
power = NULL
)
t test power calculation
n1 = 146
n2 = 68
d = 0.06363307
sig.level = 0.05
power = 0.07158508
alternative = two.sided
What do you think of the power
of our previous analysis ?
Solution
\(0.07158508\) doesn’t seem like a loot of power.
\[\text{Statistical Power} = 1 − \text{Probability of Type II Error}\]
It corresponds to a \(0.9284149\) probability of Type II error, the error we make when we are not rejecting an incorrect null hypothesis.How many penguins would have been needed to detect a weight difference between the two populations ? For this you have to choose a good level of power and use the pwr.t.test()
function which has a unique n
parameter for groups of equal size.
Solution
pwr.t.test(
n = NULL,
d = effect_size_t(adelie_weight, chinstrap_weight),
sig.level = 0.05,
power = 0.8
)
Two-sample t test power calculation
n = 3877.738
d = 0.06363307
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number in *each* group
Effect size
When we plan an experiment, in addition to the exact power we often don’t know the effect size beforehand in our experiments. A power analysis can help you determine the size of your sample.
Play with the parameters of the following function to see their impact on the sample size required to detect the effect. The doted line correspond to \(0.5\) power.
plot_effect_size(
effect_sizes = seq(from = 0.1, to = 1.5, length.out = 20),
power = 0.8,
sig_level = 0.05
)
plot_effect_size_power(
effect_sizes = seq(from = 0.3, to = 1.5, length.out = 20),
powers = seq(from = 0.5, to = 0.95, by = 0.1),
sig_level = 0.05
)
Paired data for two sample student tests
When data are paired, we can use another version of the two samples Student test. Two samples are paired if values in both groups are somehow connected. Thus, for paired \(t\)-test, the two sample sizes are supposed to be equal (they can be different in the unpaired case that we saw previously).
For example, in the penguins data the bill length and the bill depth are paired they belong to the same individual.
<- data %>%
adelie_bill_length ::filter(species == "Adelie") %>%
dplyrpull(bill_length_mm)
<- data %>%
adelie_bill_depth ::filter(species == "Adelie") %>%
dplyrpull(bill_depth_mm)
Since the samples are connected and we want to know the difference between them, we need to calculate this difference by simply subtracting one value from the other.
Compare the effect of the paired = T
option between the paired and unpaired Student test
t.test(adelie_bill_depth, adelie_bill_length, paired = T)
Paired t-test
data: adelie_bill_depth and adelie_bill_length
t = -100.42, df = 145, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-20.87975 -20.07368
sample estimates:
mean difference
-20.47671
t.test(adelie_bill_depth, adelie_bill_length, paired = F)
Welch Two Sample t-test
data: adelie_bill_depth and adelie_bill_length
t = -84.487, df = 203.26, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-20.95459 -19.99884
sample estimates:
mean of x mean of y
18.34726 38.82397
Solution
We can see that even if the \(p\)-values are very small for the two tests, the paired test has a higher statistic. The estimated sample estimate is only the mean difference
instead of mean of x
and mean of y
.
For the same data, taking into account the nature of the data (paired) increases the power of the test.
The same is true for thevar.equal = T
parameter of the t.test()
Fisher variance test
In the “Student test” section, we encountered the following question: Are the variances of the weight of the Adelie and the Chinstrap penguins equal?
You have seen that the estimator for the variance follows a \(\chi^2\) distribution.
\[S_a^2(X_a) \sim \sigma^2\chi^2(n_a-1), \qquad S_c^2(X_c) \sim \sigma^2\chi^2(n_c-1).\] And that the ratio of to independent \(\chi^2\) follows a Fisher distribution
\[T(X) = \frac{S_a^2(X_a)}{S_c^2(X_c)} \underset{H_0}{\sim} \mathcal{F}(n_a-1,n_c-1)\]
From this statistic, which hypothesis could you want to test ?
Solution
The two-sided test:
- \(H_0\): \(\frac{S_a^2(X_a)}{S_c^2(X_c)} = 1\)
- \(H_1\): \(\frac{S_a^2(X_a)}{S_c^2(X_c)} \neq 1\)
The var.test()
function allows to test this hypothesis.
Explore the output of this function
Solution
var.test(adelie_weight, chinstrap_weight)
F test to compare two variances
data: adelie_weight and chinstrap_weight
F = 1.4239, num df = 145, denom df = 67, p-value = 0.1047
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.9280647 2.1167534
sample estimates:
ratio of variances
1.423922
- The estimation of the ratio of variance is \(1.423922\)
- \(0.95\) confidence interval: \([0.9280647, 2.1167534]\)
- numerator degree of freedom: \(145\)
- denominator degree of freedom: \(67\)
- \(p\)-value: \(0.1047\)
- Statistic : \(1.4239\)
You can display the Fisher density function under the null hypothesis with the following function. Play with the alpha
parameter.
plot_fisher_var_test_alpha(adelie_weight, chinstrap_weight, alpha=0.05)
\(\chi^2\) (Chi-square)
\(\chi^2\) goodness of fit
With the \(\chi^2\) goodness of fit test, we can find out whether observed counts (frequencies) differ from expected counts (if the data were distributed according to a given probability distribution).
From the island
variable we can wonder whether the Adelie penguins are uniformly distributed across the three islands (Biscoe, Dream and Torgersen).
table(data$species, data$island)
Biscoe Dream Torgersen
Adelie 44 55 47
Chinstrap 0 68 0
Gentoo 119 0 0
What is our null hypothesis ?
Solution
- \(H_0\): there are equal Adelie penguins count (frequencies) across islands
- \(H_1\): the Adelie penguins count (frequencies) differs across the island
The statistic of this test is the following:
\[ T(N) = \sum_{j=1}^{k-1}\frac{\left(\overline{N}_j-np_j\right)^2}{np_j} \underset{n \rightarrow \infty}{\sim} \chi^2(k-1) \]
With:
- \(j\) the island
- \(N_j\) the number of penguins in the island \(j\)
- \(p_j\) the expected proportion of penguins in the island \(j\)
- \(n\) the total number of penguins
We can use the chisq.test()
function to perform this test.
chisq.test(x = c(44, 55, 47), p = c(1/3, 1/3, 1/3))
Chi-squared test for given probabilities
data: c(44, 55, 47)
X-squared = 1.3288, df = 2, p-value = 0.5146
How do you interpret the \(p\)-value of this test ?
Solution
From the data, with a threshold of \(0.05\), we cannot reject the null hypothesis \(H_0\).
You can use the following function to plot the density of the \(\chi^2\) under \(H_0\).
plot_chi2_fit_alpha(x = c(44, 55, 47), p = c(1/3, 1/3, 1/3), alpha=0.05)
Why is the rejection area only on the upper tail of the density function ?
Solution
The \(\chi^2\) statistic is a scaled square distance. We cannot encounter the "less"
one-sided hypothesis.
We can compute the \(p\)-value with the following command
1-pchisq(1.3288, 2)
[1] 0.5145822
Increase the size of your sample with the function rep()
. What is happening to the density of the \(\chi^2\) under \(H_0\) ?
Solution
chisq.test(x = rep(c(44, 55, 47), 3), p = rep(1/9, 9))
Chi-squared test for given probabilities
data: rep(c(44, 55, 47), 3)
X-squared = 3.9863, df = 8, p-value = 0.8584
plot_chi2_fit_alpha(x = rep(c(44, 55, 47), 3), p = rep(1/9, 9), alpha=0.05)
\(\chi^2\) independence of categorical variables
Independence test checks whether there is a relationship between categorical variables. If variables are dependent, we can find (predict) values of one variable knowing only values of the other.
You can use the pairs()
function to display the pairwise relation between continuous variables:
%>% select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) %>% pairs() data
You are going to learn how to study these relations after the pratical 5.
Is the pairs()
plot informative for categorical variable ? (species, island, sex, year
)
Solution
You can make a mosaic plot (mosaicplot()
) of a contingency table (table()
) to display the reparation.
%>% select(species, island, sex, year) %>% table() %>% mosaicplot(color=T) data
We are going to ask yourself if the sampling effort in the three species was the same between 2007 and 2008.
%>%
data select(year, species) %>%
table()
species
year Adelie Chinstrap Gentoo
2007 44 26 33
2008 50 18 45
2009 52 24 41
What is our null hypothesis ?
Solution
- \(H_0\): the categorical variable year and island are independent
- \(H_1\): the categorical variable year and island are not independent (some islands are more or less sampled depending on the year)
You can use the chisq.test()
function to test this hypothesis
chisq.test(x = data$year, data$species)
Pearson's Chi-squared test
data: data$year and data$species
X-squared = 3.2711, df = 4, p-value = 0.5135
Would you change the sampling strategy ?
Solution
If something don’t seems to be broken, don’t fix it (with an \(\alpha\) threshold of 0.05)
To study the \(\chi^2\) independence, we are going to generate fake sampling year with the fake_year()
function.
In 2010, you sent a student on the field to sample 200 penguins. However, for some reason, this student seems to prefer the Chinstrap penguins.
<- data %>%
fake_data select(year, species) %>%
bind_rows(
fake_year(2010, size = 200, p=c(1/10, 8/10, 1/10))
)table(fake_data)
species
year Adelie Chinstrap Gentoo
2007 44 26 33
2008 50 18 45
2009 52 24 41
2010 14 168 18
Can you detect this sampling bias in your data ?
Solution
chisq.test(fake_data$year, fake_data$species)
Pearson's Chi-squared test
data: fake_data$year and fake_data$species
X-squared = 207.75, df = 6, p-value < 2.2e-16
We reject \(H_0\) with the additional sampling of 2010.
When using the \(\chi^2\) independence test, we don’t know which part of the data is going to show some dependence. Here, for most of the data (3/4 of the sampling year) the two variables don’t show sign of dependence.Sadly for the penguins field, there was some sever cuts in the sub-antarctic island trip budget. The number of penguins that you were able to sample (after a serious discussion with your student) dropped dramatically.
<- data %>%
poor_fake_data select(year, species) %>%
bind_rows(
fake_year(2010, size = 200, p=c(1/10, 8/10, 1/10)),
fake_year(2011, size = 100, p=c(1/10, 8/10, 1/10)),
fake_year(2012, size = 50, p=c(1/3, 1/3, 1/3)),
fake_year(2013, size = 10, p=c(1/3, 1/3, 1/3))
)table(poor_fake_data)
species
year Adelie Chinstrap Gentoo
2007 44 26 33
2008 50 18 45
2009 52 24 41
2010 28 149 23
2011 9 77 14
2012 16 13 21
2013 1 4 5
What is different in the chisq.test()
output ?
Solution
chisq.test(poor_fake_data$year, poor_fake_data$species)
Pearson's Chi-squared test
data: poor_fake_data$year and poor_fake_data$species
X-squared = 209.73, df = 12, p-value < 2.2e-16
We have a warning because one of the categories (2013) don’t have enough numbers. Instead of relying on the asymptotic \(\chi^2\) distribution of the test statistic, you can use the simulate.p.value = T
parameter.
chisq.test(poor_fake_data$year, poor_fake_data$species, simulate.p.value = T)
Pearson's Chi-squared test with simulated p-value (based on 2000
replicates)
data: poor_fake_data$year and poor_fake_data$species
X-squared = 209.73, df = NA, p-value = 0.0004998
B = 2000
parameter implies a minimum p-value of about \(0.0005\) (1/(B+1)
).
(optional) Asymptotic vs exact independence tests
For unknown reasons, the penguins have recently fled the sampling locations and it has become difficult to sample penguins.
Here are the new data sampled from 2015 to 2018:
<- tibble(
new_data year = c(
2014, 2014, 2014, 2014, 2014, 2014, 2014,
2014, 2014, 2014, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015,
2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016,
2016, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017,
2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018),
species = factor(c(
"Adelie", "Chinstrap", "Chinstrap", "Adelie", "Adelie", "Chinstrap",
"Gentoo", "Gentoo", "Gentoo", "Gentoo", "Gentoo", "Gentoo", "Gentoo",
"Gentoo", "Gentoo", "Gentoo", "Gentoo", "Gentoo", "Gentoo", "Gentoo",
"Gentoo", "Chinstrap", "Gentoo", "Adelie", "Chinstrap", "Chinstrap",
"Gentoo", "Adelie", "Adelie", "Gentoo", "Adelie", "Gentoo", "Gentoo",
"Gentoo", "Chinstrap", "Chinstrap", "Gentoo", "Chinstrap", "Gentoo",
"Adelie", "Chinstrap", "Gentoo", "Adelie", "Gentoo", "Adelie",
"Adelie", "Gentoo", "Chinstrap", "Chinstrap", "Adelie"))
)
table(new_data$year, new_data$species)
Adelie Chinstrap Gentoo
2014 3 3 4
2015 0 0 10
2016 3 3 4
2017 2 3 5
2018 4 3 3
Again, we want to know if the individuals from the different species are uniformly sampled across the years, or in other words, if the variable year
and species
are independent.
What test could we use? and what would be the hypotheses in this case?
Solution
We could use a \(\chi^2\) independence test with the following hypotheses:
- \(H_0\): the categorical variable
year
andspecies
are independent - \(H_1\): the categorical variable
year
andspecies
are not independent (some species are more or less sampled depending on the year)
Run the test? What is the problem here?
Solution
chisq.test(new_data$year, new_data$species)
Pearson's Chi-squared test
data: new_data$year and new_data$species
X-squared = 12.756, df = 8, p-value = 0.1205
The sample size is too small, so the test statistics might not follow the asymptotic \(\chi^2\) distribution.
In this specific case, since we are working with a contingency table (i.e. crossed headcounts between conditions recorded by two categorical variables), we can build an exact test (i.e. a test that does not rely on an asymptotic approximation), e.g. the Fisher’s exact test here.
“As pointed out by Fisher, this leads under a null hypothesis of independence to a hypergeometric distribution of the numbers in the cells of the table.” (wikipedia)
We can run this test with:
fisher.test(new_data$year, new_data$species)
Fisher's Exact Test for Count Data
data: new_data$year and new_data$species
p-value = 0.08728
alternative hypothesis: two.sided
What can you conclude (at a risk \(\alpha=10\%\))? Compare the result with the \(\chi^2\) independence test.
Solution
Here the \(p\)-value is equal to 0.0872844 which is lower than \(\alpha=10\%\) so we reject the null hypothesis of independence between the year and the species.
When using the \(\chi^2\) independence test, the \(p\)-value with equals to 0.1205101 and we would have not rejected the null hypothesis of independence between the year and the species at a level \(\alpha=10\%\).
Nonparametric tests (optional)
Nonparametric tests, like rank tests belong to a branch of statistics that is not based solely on parametrized families of probability distributions. For example, rank tests only relies on the ranks of \(x\).
Shapiro–Wilk test
The Shapiro-Wilk test is a non-parametric test to test the normality of a distribution. You can check the Wikipedia page of the test for the statistic formula.
shapiro.test(gentoo_weight)
Shapiro-Wilk normality test
data: gentoo_weight
W = 0.98606, p-value = 0.2605
Was our choice of using the \(t\)-test, the right one ?
Solution
We don’t have enough evidence to reject \(H_0\) with a threshold of \(\alpha = 0.05\) so the \(t\)-test was a good choice.
Kolmogorov–Smirnov test
The Kolmogorov–Smirnov test is also a non-parametric test to test if two samples are drawn from the same distribution. Intuitively, the statistic takes the largest absolute difference between the two empirical cumulative distribution functions across all values.
Why is the following command returns a p-value < 2.2e-16
?
ks.test(gentoo_weight, rnorm(length(gentoo_weight)))
Asymptotic two-sample Kolmogorov-Smirnov test
data: gentoo_weight and rnorm(length(gentoo_weight))
D = 1, p-value < 2.2e-16
alternative hypothesis: two-sided
Solution
The two empirical CDF are extremely different
plot_2_ecdf(
x = gentoo_weight,
y = rnorm(length(gentoo_weight))
)
We need to center and scale our gentoo_weight
.
plot_2_ecdf(
x = scale(gentoo_weight),
y = rnorm(length(gentoo_weight))
)
ks.test(scale(gentoo_weight), rnorm(length(gentoo_weight)))
Asymptotic two-sample Kolmogorov-Smirnov test
data: scale(gentoo_weight) and rnorm(length(gentoo_weight))
D = 0.07563, p-value = 0.8855
alternative hypothesis: two-sided
Why use rnorm()
when we know the CDF of \(\mathcal{N}(0,1)\) distribution ?
ks.test(scale(gentoo_weight), pnorm)
Asymptotic one-sample Kolmogorov-Smirnov test
data: scale(gentoo_weight)
D = 0.069124, p-value = 0.6204
alternative hypothesis: two-sided
Wilcoxon test
Two samples paired Wilcoxon test is a non-parametric equivalent to two samples paired t-test, which compares medians (instead of means) of two groups.
wilcox.test(
adelie_bill_depth,
adelie_bill_length,paired = T
)
Wilcoxon signed rank test with continuity correction
data: adelie_bill_depth and adelie_bill_length
V = 0, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
Compare the results of this test to the Two samples Student test
Solution
t.test(
adelie_bill_depth,
adelie_bill_length,paired = T
)
Paired t-test
data: adelie_bill_depth and adelie_bill_length
t = -100.42, df = 145, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-20.87975 -20.07368
sample estimates:
mean difference
-20.47671
Wilcoxon-Mann-Whitney test
Two samples Wilcoxon-Mann-Whitney test (WMW-test) is a non-parametric equivalent to two-samples unpaired Student test, which compares medians (instead of means) of two groups.
wilcox.test(adelie_weight, chinstrap_weight)
Wilcoxon rank sum test with continuity correction
data: adelie_weight and chinstrap_weight
W = 4710, p-value = 0.5476
alternative hypothesis: true location shift is not equal to 0
Compare the results of this test to the Two samples Student test
Solution
t.test(adelie_weight, chinstrap_weight)
Welch Two Sample t-test
data: adelie_weight and chinstrap_weight
t = -0.44793, df = 154.03, p-value = 0.6548
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-145.66494 91.81724
sample estimates:
mean of x mean of y
3706.164 3733.088
If the data follow the distribution assumption of the parametric test, the parametric test is always more powerful than the nonparametric test.
Kruskal–Wallis test
The Kruskal–Wallis test extends the Wilcoxon-Mann-Whitney test to more than 2 groups. If there is at least one group following a different distribution than the others, the \(p\)-value will be small.
kruskal.test(list(gentoo_weight, adelie_weight, chinstrap_weight))
Kruskal-Wallis rank sum test
data: list(gentoo_weight, adelie_weight, chinstrap_weight)
Kruskal-Wallis chi-squared = 212.09, df = 2, p-value < 2.2e-16