Tutorial 3: Statistical testing

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License This pratical is heavily inspired by the work of Yury Zablotski and Benjamin Yakir

Introduction

Hypothesis testing tries to answer the question: Is there a phenomenon at all? The basic approach is to determine whether the observed data can or cannot be reasonably explained by a model of randomness that does not involve the phenomena.

After the statistical model has been set, one may split the process of testing a statistical hypothesis into three steps:

formulation of the hypotheses,

$H_0$: range of parameter values where the phenomenon is absent the null hypothesis
$H_1$: alternative hypothesis

specification of the test,

The decision rule is composed of a statistic and a subset of values of the statistic that correspond to the rejection of the null hypothesis
the statistic is called the test statistic
the subset of values is called the rejection region.

reaching the final conclusion.

Computing the observed value of the test statistic and checking whether or not the observed value belongs to the rejection region.

When we are building a statistical test, we make a trade off between two types of error:

Type I error: rejecting a correct null hypothesis
Type II error: not rejecting an incorrect null hypothesis

The test’s decision rule is designed so as to assure an acceptable probability of making a Type I error. Reducing the probability of a Type II error is desirable, but is of secondary importance.

The statistical power is the probability of rejecting the null hypothesis when the state of nature is the alternative hypothesis.

\[\text{Statistical Power} = 1 − \text{Probability of Type II Error}\]

The p-value is a function of the data. The $p$-value is equal to the significance level of the test in which the observed value of the statistic serves as the threshold. This allows us to control for the Type I error.

\[p\text{-value} = P(|T| > |t|)\] A statistical power of $80\%$ (though higher is certainly better) and a significance level of $5\%$ are often considered to be acceptable. This means

an $80\%$ chance of detecting a true effect (true positive)
a $20\%$ chance of missing a true effect (false negative)
a $95\%$ chance of detecting the absence of an effect (true negative)
$5\%$ chance of detecting an effect which isn’t there (false positive).

In this practical you will practice all these notions.

Packages

For this practical you will need the following packages:

library(tidyverse) # to manipule data and make plot
library(palmerpenguins) # we load the data
library(pwr) # for power analysis

Some custom functions are available in this file. You can load them in R with the following command:

source("https://gitbio.ens-lyon.fr/LBMC/hub/formations/ens_l3_stats/-/raw/main/tutorial_3_tests/utils.R?inline=false")

Loading the data

We are going to work on the famous Palmer penguins dataset. This dataset is an integrative study of the breeding ecology and population structure of Pygoscelis penguins along the western Antarctic Peninsula. These data were collected from 2007 - 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network.

The palmerpenguins data contains size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica.

The palmerpenguins library load the penguins dataset into your R environment. If you are not familiar with tibble, you just have to know that they are equivalent to data.frame.

penguins

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

You can use the function summary() to get relevant moment of each variable (column)

summary(penguins)

      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2

What are the continuous and categorical variables in this table ? What is going to be the problem with the raw penguins table ?

Solution

We first are going to drop the missing values (NA)

data <- penguins %>% drop_na()
dim(data) # displan the number of rows and columns

[1] 333   8

If you are not familiar with the %>% operator or pipe in R: It takes the output of the function on the left and pass it as the first argument of the function on the right.

Empirical $p-$value

What is the observed sex ratio in the Gentoo penguins, regardless of their localization ?

data %>% 
    dplyr::filter(species == "Gentoo") %>% 
    summarize(
        n_female = sum(sex == "female"),
        size = n(),
        ratio = n_female / size,
    )

# A tibble: 1 × 3
  n_female  size ratio
     <int> <int> <dbl>
1       58   119 0.487

We wonder if the sex ratio of the Gentoo penguins, regardless of their localization, is different compared to $0.5$.

How would you formulate this question in therms of a null and an alternative hypothesis ? Which parameter(s) of the model are you going to study ?

Solution

$H_0$: The sex ratio in the penguins $p = 0.5$
$H_1$: The sex ratio in the penguins $p \neq 0.5$

Is this a two-sided or one-sided hypothesis test?

Solution

A two-sided hypothesis test, we can reject $H_0$ with $p < 0.5$ or $p > 0.5$, i.e. $|p - 0.5| > 0$

How would you simulate this sample (with which probability distribution) under $H_0$ (with which parameter values) ?

Solution

With the binomial distribution rbinom with prob=0.5 and size=119

rbinom(1, size=119, prob=0.5)

[1] 56

Compute an empirical sex ratio distribution with this tool, generate $1000$ values under the null hypothesis.

Solution

empirical_h0 <- tibble(
        n_female = rbinom(1000, size=119, prob=0.5)
    ) %>% 
    mutate(
        ratio = n_female / 119
    )

With this simulation, how could you compute the probability of observing a sample with a sex ratio as low as $0.487395$ under $H_0$ ?

Solution

We can compute the number of times where empirical_h0$ratio is less or equal to $0.487395$

empirical_h0 <- empirical_h0 %>% 
    mutate(
        lower = ratio < 0.487395
    )
mean(empirical_h0$lower)

[1] 0.406

empirical_h0 %>% 
    ggplot(aes(x = ratio)) +
    stat_ecdf(geom = "step") +
    geom_vline(xintercept = 0.487395, col = "green") +
    geom_hline(yintercept = mean(empirical_h0$lower), col = "red") +
    theme_bw()

Are we missing something to quantify our confidence in the null hypothesis ?

Solution

We have a two-sided test, we need to take into account the probability of being greater than $0.5$

empirical_h0 <- empirical_h0 %>% 
    mutate(
        lower = ratio < 0.487395,
        upper = ratio > 1- 0.487395
    )
mean(empirical_h0$lower | empirical_h0$upper)

[1] 0.833

This value is called the empirical $p$-value associated with the rejection of $H_0$, we have more than a $0.8$ probability of $H_0$ being true.

How would you find the $p$-value (and not empirical $p$-value) associated with this test ?

Solution

2*pbinom(58, size = 119, prob=0.5)

[1] 0.85463

pbinom(58, size=119, prob=0.5) + pbinom(119 - 58 - 1, size=119, prob=0.5, lower.tail = F)

[1] 0.85463

(you can check the ?pbinom to understand the 119 - 58 - 1)

What can you do to make your empirical $p$-value closer to the $p$-value ?

Solution

Increase the size of the simulation

Sensitivity and specificity of a test

Sensitivity and specificity mathematically describe the accuracy of a test which reports the presence or absence of a condition. Individuals for which the condition is satisfied, are considered positive and those for which it is not, are considered negative.

Sensitivity (true positive rate) refers to the probability of a positive test, conditioned on truly being positive.
Specificity (true negative rate) refers to the probability of a negative test, conditioned on truly being negative.

\[\text{sensitivity} = \frac{\text{number of true positives}}{\text{number of true positives} + \text{number of false negatives}}\] \[\text{specificity} = \frac{\text{number of true negatives}}{\text{number of true negatives} + \text{number of false positives}}\]

You can use the following function to generate binomial sample for which we know if they are following $H_0$ or $H_1$

simulation <- sim_binom(
    replicate = 1000, # number of simulation
    size = 119, # size of the simulated sample
    p_h0 = 0.5, # p under H_0
    p_h1 = 0.6, # p under H_1
    threshold = seq(from = 0.001, to = 0.999, by = 0.001) # alpha value
)
simulation

# A tibble: 999,000 × 9
   truth sample    statistic alpha rejection TP    FP    FN    TN   
   <lgl> <list>        <dbl> <dbl> <lgl>     <lgl> <lgl> <lgl> <lgl>
 1 TRUE  <int [1]>    0.0437 0.001 FALSE     FALSE FALSE TRUE  FALSE
 2 TRUE  <int [1]>    0.0437 0.002 FALSE     FALSE FALSE TRUE  FALSE
 3 TRUE  <int [1]>    0.0437 0.003 FALSE     FALSE FALSE TRUE  FALSE
 4 TRUE  <int [1]>    0.0437 0.004 FALSE     FALSE FALSE TRUE  FALSE
 5 TRUE  <int [1]>    0.0437 0.005 FALSE     FALSE FALSE TRUE  FALSE
 6 TRUE  <int [1]>    0.0437 0.006 FALSE     FALSE FALSE TRUE  FALSE
 7 TRUE  <int [1]>    0.0437 0.007 FALSE     FALSE FALSE TRUE  FALSE
 8 TRUE  <int [1]>    0.0437 0.008 FALSE     FALSE FALSE TRUE  FALSE
 9 TRUE  <int [1]>    0.0437 0.009 FALSE     FALSE FALSE TRUE  FALSE
10 TRUE  <int [1]>    0.0437 0.01  FALSE     FALSE FALSE TRUE  FALSE
# ℹ 998,990 more rows

You can use the following function to compute the sensitivity and specificity depending on the risk $\alpha$.

sens_spec <- compute_sens_spec(simulation)

You can then plot the sensitivity and specificity for your test as a function of the risk $\alpha$.

sens_spec %>% 
    ggplot()  +
    geom_line(aes(x = alpha, y = sensitivity)) + 
    xlim(c(0,1)) + ylim(c(0,1)) + coord_fixed() +
    theme_bw()

sens_spec %>% 
    ggplot()  +
    geom_line(aes(x = alpha, y = specificity)) +
    xlim(c(0,1)) + ylim(c(0,1)) + coord_fixed() +
    theme_bw()

The ROC (receiver operating characteristic) curve plot the False Positive Rate (FPR) against the sensitivity (True Positive Rate).

sens_spec %>% 
    ggplot()  +
    geom_line(aes(x = FPR, y = sensitivity)) +
    xlim(c(0,1)) + ylim(c(0,1)) + coord_fixed() +
    theme_bw()

What shape of ROC curves do you expect if the conclusion of your test was the results of flipping a coin ?

Solution

The ROC curve would follow the diagonal of the plot.

One sample Student test

The male and female are similar in plumage and size for the emperor penguin, reaching 100cm in length and weighing from 22 to 45kg ($\mu_0 = 33500$g) (source Wikipedia).

We can ask ourselves if the Gentoo penguins that are the heavier of the 3 species could be mistaken for an emperor penguin based on their weights.

data %>% 
    group_by(species) %>% 
    summarise(
        mean(body_mass_g)
    )

# A tibble: 3 × 2
  species   `mean(body_mass_g)`
  <fct>                   <dbl>
1 Adelie                  3706.
2 Chinstrap               3733.
3 Gentoo                  5092.

How would you formulate this question in therms of a null and an alternative hypothesis ?

Solution

$H_0$: The average weight of the Gentoo penguins is equal to the average weight of the emperor penguins $E(X) = \mu_0$
$H_1$: The average weight of the Gentoo penguins is different to the average weight of the emperor penguins $E(X) \neq \mu_0$

where $X$ is the random variable giving the weight of a Gentoo penguin.

Or equivalently:

$H_0$: $\mu = \mu_0$
$H_1$: $\mu \neq \mu_0$

when noting $\mu$ the Gentoo penguins population average weight.

Is this a two-sided or one-sided hypothesis test? How many variables are in this model?

Solution

A two-sided hypothesis test, we can reject $H_0$ with $E(X) < \mu_0$ or $E(X) > \mu_0$, i.e. $|E(X) - \mu_0| > 0$

Can we consider the weight distribution of the Gentoo penguins to follow a Normal distribution ?

data %>% 
    filter(species == "Gentoo") %>% 
    ggplot() +
    geom_histogram(aes(x = body_mass_g), binwidth = 100) +
    theme_bw()

Solution

We can use a QQ-plot

data %>% 
    filter(species == "Gentoo") %>% 
    ggplot(aes(sample=body_mass_g)) + 
    stat_qq() +
    stat_qq_line()

It’s not perfect but we will continue with the Normal distribution hypothesis.

We will see later how to quantitatively answer this question.

Which testing statistic would you propose to challenge the null hypothesis ?

Solution

We are going to study the statistic of the difference between a means estimator and a population average.

To take into account the sampling variability we need to scale this statistic by the standard error.

$t$-value

The $t$-value is a measure of similarity between compared mean scaled by the standard error.

\[t\text{-value} = \frac{\text{mean} - \text{expected mean}}{\text{standard error}}\] with $\text{standard error} = \frac{S(X)}{\sqrt(n)}$

What is the distribution of the numerator and of the denominator?

Solution

Under the Gaussian assumption on the data:

\[\widehat{\mu}(X)-\mu_0 \underset{H_0}{\sim} \mathcal{N}(0, \sigma^2(1/n))\] \[S^2(X) \sim \sigma^2\chi^2(n-1)\]

The $t$-value follows a known distribution: the Student distribution with $n-1$ degree of freedom.

You can use the t.test() command to test $H_0$.

gentoo_weight <- data %>% filter(species == "Gentoo") %>% 
    pull(body_mass_g)

t.test(gentoo_weight, mu = 33500)


    One Sample t-test

data:  gentoo_weight
t = -617.96, df = 118, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 33500
95 percent confidence interval:
 5001.403 5183.471
sample estimates:
mean of x 
 5092.437

The output of the t.test() function gives you:

the $t$-value : $-617.96$
the degree of freedom : $118$
the $p$-value : $2.2\times 10^{-16}$
$\widehat{\mu}(X)$ : $5092.437$
A $0.95$ confidence interval on $\widehat{\mu}(X)$ : $[5001.403, 5183.471]$

Do you understand all these outputs ?

We can decompose this computation, for the $t$-values.

t_stat <- (mean(gentoo_weight) - 33500) / (sd(gentoo_weight) / sqrt(length(gentoo_weight)))
t_stat

[1] -617.9555

and for the $p$-value (two different way to compute it from the cumulative distribution function in the two-sided case):

pt(- abs(t_stat), length(gentoo_weight) - 1) * 2

[1] 5.825327e-209

2 * (1 - pt(abs(t_stat), length(gentoo_weight) - 1))

[1] 0

Depending on your computer, the t.test() $p$-value and the pt() may not by equal. To check our formula, test the following null hypothesis:

$H_0$ : $E(X) = 5000$

Solution

For the t.test :

t.test(gentoo_weight, mu = 5000)$p.value

[1] 0.04662625

for the pt:

t_stat <- (mean(gentoo_weight) - 5000) / (sd(gentoo_weight) / sqrt(length(gentoo_weight - 1)))
pt(-abs(t_stat), length(gentoo_weight) - 1) * 2

[1] 0.04662625

2 * (1 - pt(abs(t_stat), length(gentoo_weight) - 1))

[1] 0.04662625

The two $p$-values are now exactly the same. Be mindful of the numerical limitations of your computer when working with tiny numbers.

The following function plots the density of the $T$ statistic, the red area corresponding to the $p$-value for a given $\mu$ and sample $x$, with the observed $t$-value (and its opposite) as vertical bars, in the two-sided case.

plot_student(x = gentoo_weight, mu = 33500)

plot_student(x = gentoo_weight, mu = 5000)

With a threshold of $0.05$ would you take the risk of rejecting $H_0$ “the average weight of Gentoo penguins is equal to the average weight of the emperor penguins” ?

Solution

Yes, expect maybe for a baby emperor penguin.

Two samples Student test

Instead of comparing our Gentoo penguins to a reference, we now want to compare the weights of Adelie and Chinstrap penguins.

data %>% 
    group_by(species) %>% 
    summarise(
        mean(body_mass_g)
    )

# A tibble: 3 × 2
  species   `mean(body_mass_g)`
  <fct>                   <dbl>
1 Adelie                  3706.
2 Chinstrap               3733.
3 Gentoo                  5092.

Based on the observed means which hypothesis could you want to test ?

Solution

We note $\hat\mu_a$ and $\hat\mu_c$ the observed average weights respectively for the Adelie and Chinstrap penguins, and $\mu_a$ and $\mu_c$ the (unknown) population average weights respectively for the Adelie and Chinstrap penguins:

Either the two-sided test:

$H_0$: $\mu_a = \mu_c$
$H_1$: $\mu_a \neq \mu_c$

Or one of the one-sided test

$H_0$: $\mu_a \geq \mu_c$
$H_1$: $\mu_a < \mu_c$

$H_0$: $\mu_a \leq \mu_c$
$H_1$: $\mu_a > \mu_c$

We are going to extract the sampled penguins weights per species.

adelie_weight <- data %>% filter(species == "Adelie") %>% 
    pull(body_mass_g)
chinstrap_weight <- data %>% filter(species == "Chinstrap") %>% 
    pull(body_mass_g)

Like for the one sample Student test, we are going to use the $t$-statistic. However, there are two versions of this statistic for the two sample case:

A statistic for the case where $\sigma_a^2 = \sigma_c^2$, with $\sigma_a^2$ the Adelie penguins weight variance and $\sigma_c^2$ the Chinstrap penguins weight variance
A statistic for the case where $\sigma_a^2 \neq \sigma_c^2$

data %>% 
    group_by(species) %>% 
    summarise(
        mean(body_mass_g),
        var(body_mass_g),
        sd(body_mass_g),
    )

# A tibble: 3 × 4
  species   `mean(body_mass_g)` `var(body_mass_g)` `sd(body_mass_g)`
  <fct>                   <dbl>              <dbl>             <dbl>
1 Adelie                  3706.            210332.              459.
2 Chinstrap               3733.            147713.              384.
3 Gentoo                  5092.            251478.              501.

We will see later how to test $H_0$ : $\sigma_a^2 = \sigma_c^2$.

In the meanwhile we will work under the hypothesis that the variances are not equal between the two species, which is the default for the t.test() function.

Two-sided hypothesis

t.test(adelie_weight, chinstrap_weight)


    Welch Two Sample t-test

data:  adelie_weight and chinstrap_weight
t = -0.44793, df = 154.03, p-value = 0.6548
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -145.66494   91.81724
sample estimates:
mean of x mean of y 
 3706.164  3733.088

What are the differences compared to the one sample Student test output ?

Solution

We have two sample estimates of the mean

What can you conclude for $H_0$: $\mu_a = \mu_c$ ?

The following plot displays the observed $t$-value as a vertical black line and the rejection area for a given $\alpha$ in red. You can play the the alpha parameter.

plot_student_2_alpha_twosided(adelie_weight, chinstrap_weight, alpha = 0.05)

Solution

We don’t have enought evidence to reject $H_0$

One-sided hypothesis

We now want to test if the Adelie penguins are lighter than the Chinstrap penguins.

What is our null hypothesis ?

Solution

The null hypothesis correspond to the abscence of effect. Here it would mean that Adelie penguins are heavier of of equal weigth compared to the Chinstrap penguins.

$H_0$: $\mu_a \geq \mu_c$, with $\mu_a$ the average weight of the Adelie penguins and $\mu_c$ the average weight of the Chinstrap penguins
$H_1$: $\mu_a < \mu_c$

We can use the alternative = c("two.sided", "less", "greater") option.

The default is "two.sided". Which option between "less" and "greater" should you use ?

Solution

You test the difference between the means:

$H_0$: $\mu_a \geq \mu_c$ is equivalent to

$H_0$: $\mu_a - \mu_c \geq 0$

Therefore the alternative is $\mu_a - \mu_c < 0$ and we should use "less"

t.test(adelie_weight, chinstrap_weight, alternative = "less")


    Welch Two Sample t-test

data:  adelie_weight and chinstrap_weight
t = -0.44793, df = 154.03, p-value = 0.3274
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
     -Inf 72.54212
sample estimates:
mean of x mean of y 
 3706.164  3733.088

Why is the $p$-value different to the one computed for the two-sided test ?

Solution

We are only considering the probability for the true difference in means of being less than zero while ignoring the greater than zero possibilities.

For these 3 plots, the red area under the curve is the same, you can play with the alpha parameter.

plot_student_2_alpha_twosided(adelie_weight, chinstrap_weight, alpha=0.05)

plot_student_2_alpha_less(adelie_weight, chinstrap_weight, alpha=0.05)

plot_student_2_alpha_greater(adelie_weight, chinstrap_weight, alpha=0.05)

Homogeneous variance

We made the assumption that $\sigma_a^2 \neq \sigma_c^2$, what happens to the two-sided test if $\sigma_a^2 = \sigma_c^2$ ?

Solution

For $\sigma_a^2 \neq \sigma_c^2$ we have

\[T(X) = \frac{\overline{X_a} -\overline{X_c} }{\sqrt{S^2(X_a)/n_a+S^2(X_c)/n_c}}\]

For $\sigma_a^2 = \sigma_c^2$ we have

\[T(X) = \frac{\overline{X_a} -\overline{X_c} }{S(X_{ac}) \sqrt{ 1/n_a + 1/n_c}}\]

How will the $p$-value change with the homogeneous variance hypothesis ?

You can use the option var.equal = T of the t.test() function to explore this question.

Solution

t.test(adelie_weight, chinstrap_weight)


    Welch Two Sample t-test

data:  adelie_weight and chinstrap_weight
t = -0.44793, df = 154.03, p-value = 0.6548
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -145.66494   91.81724
sample estimates:
mean of x mean of y 
 3706.164  3733.088

t.test(adelie_weight, chinstrap_weight, var.equal = T)


    Two Sample t-test

data:  adelie_weight and chinstrap_weight
t = -0.42011, df = 212, p-value = 0.6748
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -153.2538   99.4061
sample estimates:
mean of x mean of y 
 3706.164  3733.088

Note the difference in degree of freedom between the two tests

You can use the following function to see the effect of the degree of freedom on the Student distribution shape. Play with the df parameter.

plot_student_df(df = 3, alpha = 0.05)

Power analysis

The statistical power is the probability of rejecting the null hypothesis when the state of nature is the alternative hypothesis.

\[\text{Statistical Power} = 1 − \text{Probability of Type II Error}\]

For statistical testing, the power of a test depends on 4 criteria

the sample size: larger sample size makes it easier to detect the effect and so increases the statistical power of the experiment
the significance level: the higher the threshold, the better, because it increases the power of a test and the chance of getting a statistically significant result while decreasing the number of observations needed to get a significant result.
the effect size: the formula to describe the effect size differ between the tests. The larger the effect size the higher the power to detect it will be.
the statistical power: the probability of finding a real effect (true positive). Your goal is to maximize the statistical power of the experiment while minimizing the sample size.

Knowing any three of them allow you to estimate the fourth.

You can use a power analysis before an experiment to calculate the optimal sample size for a specific confidence (power). And you can also do a power analysis after an experiment or sampling to determine the power of our experiment showing the plausibility of our results.

library(pwr)

The pwr package uses the Cohen’s effect size definition as parameter. For the two samples Student test the effect size formula is the following

\[\text{effect size} = \frac{E(X_a) - E(X_b)}{\sqrt{\frac{S^2(X_a) + S^2(X_b)}{2}}}\]

What happens to the effect size when the pooled standard deviation increases ?

Solution

The effective effect size decreases. The more variability in the data the larger the difference between the means must be to be detected.

Power

You can use the pwr.t2n.test() function to study the relation between :

the sample size: n1 and n2
the significance level: sig.level
the effect size: d
the statistical power: power

Setting one of these parameters to NULL tell the function to compute it from the 3 others.

pwr.t2n.test(
    n1 = length(adelie_weight),
    n2 = length(chinstrap_weight),
    d = effect_size_t(adelie_weight, chinstrap_weight),
    sig.level = 0.05,
    power = NULL
)


     t test power calculation 

             n1 = 146
             n2 = 68
              d = 0.06363307
      sig.level = 0.05
          power = 0.07158508
    alternative = two.sided

What do you think of the power of our previous analysis ?

Solution

$0.07158508$ doesn’t seem like a loot of power.

\[\text{Statistical Power} = 1 − \text{Probability of Type II Error}\]

It corresponds to a $0.9284149$ probability of Type II error, the error we make when we are not rejecting an incorrect null hypothesis.

How many penguins would have been needed to detect a weight difference between the two populations ? For this you have to choose a good level of power and use the pwr.t.test() function which has a unique n parameter for groups of equal size.

Solution

pwr.t.test(
    n = NULL,
    d = effect_size_t(adelie_weight, chinstrap_weight),
    sig.level = 0.05,
    power = 0.8
)


     Two-sample t test power calculation 

              n = 3877.738
              d = 0.06363307
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group

We need at least 3878 penguins in the smaller group

Effect size

When we plan an experiment, in addition to the exact power we often don’t know the effect size beforehand in our experiments. A power analysis can help you determine the size of your sample.

Play with the parameters of the following function to see their impact on the sample size required to detect the effect. The doted line correspond to $0.5$ power.

plot_effect_size(
    effect_sizes = seq(from = 0.1, to = 1.5, length.out = 20),
    power = 0.8,
    sig_level = 0.05
)

plot_effect_size_power(
    effect_sizes = seq(from = 0.3, to = 1.5, length.out = 20),
    powers = seq(from = 0.5, to = 0.95, by = 0.1),
    sig_level = 0.05
)

Paired data for two sample student tests

When data are paired, we can use another version of the two samples Student test. Two samples are paired if values in both groups are somehow connected. Thus, for paired $t$-test, the two sample sizes are supposed to be equal (they can be different in the unpaired case that we saw previously).

For example, in the penguins data the bill length and the bill depth are paired they belong to the same individual.

adelie_bill_length <- data %>%
    dplyr::filter(species == "Adelie") %>% 
    pull(bill_length_mm)
adelie_bill_depth <- data %>%
    dplyr::filter(species == "Adelie") %>% 
    pull(bill_depth_mm)

Since the samples are connected and we want to know the difference between them, we need to calculate this difference by simply subtracting one value from the other.

Compare the effect of the paired = T option between the paired and unpaired Student test

t.test(adelie_bill_depth, adelie_bill_length, paired = T)


    Paired t-test

data:  adelie_bill_depth and adelie_bill_length
t = -100.42, df = 145, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -20.87975 -20.07368
sample estimates:
mean difference 
      -20.47671

t.test(adelie_bill_depth, adelie_bill_length, paired = F)


    Welch Two Sample t-test

data:  adelie_bill_depth and adelie_bill_length
t = -84.487, df = 203.26, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -20.95459 -19.99884
sample estimates:
mean of x mean of y 
 18.34726  38.82397

Solution

We can see that even if the $p$-values are very small for the two tests, the paired test has a higher statistic. The estimated sample estimate is only the mean difference instead of mean of x and mean of y.

For the same data, taking into account the nature of the data (paired) increases the power of the test.

The same is true for the var.equal = T parameter of the t.test()

Fisher variance test

In the “Student test” section, we encountered the following question: Are the variances of the weight of the Adelie and the Chinstrap penguins equal?

You have seen that the estimator for the variance follows a $\chi^2$ distribution.

\[S_a^2(X_a) \sim \sigma^2\chi^2(n_a-1), \qquad S_c^2(X_c) \sim \sigma^2\chi^2(n_c-1).\] And that the ratio of to independent $\chi^2$ follows a Fisher distribution

\[T(X) = \frac{S_a^2(X_a)}{S_c^2(X_c)} \underset{H_0}{\sim} \mathcal{F}(n_a-1,n_c-1)\]

From this statistic, which hypothesis could you want to test ?

Solution

The two-sided test:

$H_0$: $\frac{S_a^2(X_a)}{S_c^2(X_c)} = 1$
$H_1$: $\frac{S_a^2(X_a)}{S_c^2(X_c)} \neq 1$

The var.test() function allows to test this hypothesis.

Explore the output of this function

Solution

var.test(adelie_weight, chinstrap_weight)


    F test to compare two variances

data:  adelie_weight and chinstrap_weight
F = 1.4239, num df = 145, denom df = 67, p-value = 0.1047
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.9280647 2.1167534
sample estimates:
ratio of variances 
          1.423922

The estimation of the ratio of variance is $1.423922$
$0.95$ confidence interval: $[0.9280647, 2.1167534]$
numerator degree of freedom: $145$
denominator degree of freedom: $67$
$p$-value: $0.1047$
Statistic : $1.4239$

You can display the Fisher density function under the null hypothesis with the following function. Play with the alpha parameter.

plot_fisher_var_test_alpha(adelie_weight, chinstrap_weight, alpha=0.05)

$\chi^2$ (Chi-square)

$\chi^2$ goodness of fit

With the $\chi^2$ goodness of fit test, we can find out whether observed counts (frequencies) differ from expected counts (if the data were distributed according to a given probability distribution).

From the island variable we can wonder whether the Adelie penguins are uniformly distributed across the three islands (Biscoe, Dream and Torgersen).

table(data$species, data$island)

           
            Biscoe Dream Torgersen
  Adelie        44    55        47
  Chinstrap      0    68         0
  Gentoo       119     0         0

What is our null hypothesis ?

Solution

$H_0$: there are equal Adelie penguins count (frequencies) across islands
$H_1$: the Adelie penguins count (frequencies) differs across the island

The statistic of this test is the following:

\[ T(N) = \sum_{j=1}^{k-1}\frac{\left(\overline{N}_j-np_j\right)^2}{np_j} \underset{n \rightarrow \infty}{\sim} \chi^2(k-1) \]

With:

$j$ the island
$N_j$ the number of penguins in the island $j$
$p_j$ the expected proportion of penguins in the island $j$
$n$ the total number of penguins

We can use the chisq.test() function to perform this test.

chisq.test(x = c(44, 55, 47), p = c(1/3, 1/3, 1/3))


    Chi-squared test for given probabilities

data:  c(44, 55, 47)
X-squared = 1.3288, df = 2, p-value = 0.5146

How do you interpret the $p$-value of this test ?

Solution

From the data, with a threshold of $0.05$, we cannot reject the null hypothesis $H_0$.

You can use the following function to plot the density of the $\chi^2$ under $H_0$.

plot_chi2_fit_alpha(x = c(44, 55, 47), p = c(1/3, 1/3, 1/3), alpha=0.05)

Why is the rejection area only on the upper tail of the density function ?

Solution

The $\chi^2$ statistic is a scaled square distance. We cannot encounter the "less" one-sided hypothesis.

We can compute the $p$-value with the following command

1-pchisq(1.3288, 2)

[1] 0.5145822

Increase the size of your sample with the function rep(). What is happening to the density of the $\chi^2$ under $H_0$ ?

Solution

chisq.test(x = rep(c(44, 55, 47), 3), p = rep(1/9, 9))


    Chi-squared test for given probabilities

data:  rep(c(44, 55, 47), 3)
X-squared = 3.9863, df = 8, p-value = 0.8584

plot_chi2_fit_alpha(x = rep(c(44, 55, 47), 3), p = rep(1/9, 9), alpha=0.05)

$\chi^2$ independence of categorical variables

Independence test checks whether there is a relationship between categorical variables. If variables are dependent, we can find (predict) values of one variable knowing only values of the other.

You can use the pairs() function to display the pairwise relation between continuous variables:

data %>% select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) %>% pairs()

You are going to learn how to study these relations after the pratical 5.

Is the pairs() plot informative for categorical variable ? (species, island, sex, year)

Solution

You can make a mosaic plot (mosaicplot()) of a contingency table (table()) to display the reparation.

data %>% select(species, island, sex, year) %>% table() %>%  mosaicplot(color=T)

We are going to ask yourself if the sampling effort in the three species was the same between 2007 and 2008.

data %>%
    select(year, species) %>% 
    table()

      species
year   Adelie Chinstrap Gentoo
  2007     44        26     33
  2008     50        18     45
  2009     52        24     41

What is our null hypothesis ?

Solution

$H_0$: the categorical variable year and island are independent
$H_1$: the categorical variable year and island are not independent (some islands are more or less sampled depending on the year)

You can use the chisq.test() function to test this hypothesis

chisq.test(x = data$year, data$species)


    Pearson's Chi-squared test

data:  data$year and data$species
X-squared = 3.2711, df = 4, p-value = 0.5135

Would you change the sampling strategy ?

Solution

If something don’t seems to be broken, don’t fix it (with an $\alpha$ threshold of 0.05)

To study the $\chi^2$ independence, we are going to generate fake sampling year with the fake_year() function.

In 2010, you sent a student on the field to sample 200 penguins. However, for some reason, this student seems to prefer the Chinstrap penguins.

fake_data <- data %>%
    select(year, species) %>% 
    bind_rows(
        fake_year(2010, size = 200, p=c(1/10, 8/10, 1/10))
    )
table(fake_data)

      species
year   Adelie Chinstrap Gentoo
  2007     44        26     33
  2008     50        18     45
  2009     52        24     41
  2010     14       168     18

Can you detect this sampling bias in your data ?

Solution

chisq.test(fake_data$year, fake_data$species)


    Pearson's Chi-squared test

data:  fake_data$year and fake_data$species
X-squared = 207.75, df = 6, p-value < 2.2e-16

We reject $H_0$ with the additional sampling of 2010.

When using the $\chi^2$ independence test, we don’t know which part of the data is going to show some dependence. Here, for most of the data (3/4 of the sampling year) the two variables don’t show sign of dependence.

Sadly for the penguins field, there was some sever cuts in the sub-antarctic island trip budget. The number of penguins that you were able to sample (after a serious discussion with your student) dropped dramatically.

poor_fake_data <- data %>%
    select(year, species) %>% 
    bind_rows(
        fake_year(2010, size = 200, p=c(1/10, 8/10, 1/10)),
        fake_year(2011, size = 100, p=c(1/10, 8/10, 1/10)),
        fake_year(2012, size = 50, p=c(1/3, 1/3, 1/3)),
        fake_year(2013, size = 10, p=c(1/3, 1/3, 1/3))
    )
table(poor_fake_data)

      species
year   Adelie Chinstrap Gentoo
  2007     44        26     33
  2008     50        18     45
  2009     52        24     41
  2010     28       149     23
  2011      9        77     14
  2012     16        13     21
  2013      1         4      5

What is different in the chisq.test() output ?

Solution

chisq.test(poor_fake_data$year, poor_fake_data$species)


    Pearson's Chi-squared test

data:  poor_fake_data$year and poor_fake_data$species
X-squared = 209.73, df = 12, p-value < 2.2e-16

We have a warning because one of the categories (2013) don’t have enough numbers. Instead of relying on the asymptotic $\chi^2$ distribution of the test statistic, you can use the simulate.p.value = T parameter.

chisq.test(poor_fake_data$year, poor_fake_data$species, simulate.p.value = T)


    Pearson's Chi-squared test with simulated p-value (based on 2000
    replicates)

data:  poor_fake_data$year and poor_fake_data$species
X-squared = 209.73, df = NA, p-value = 0.0004998

Now the precision of the $p$-value is limited by the number of replicate. The default B = 2000 parameter implies a minimum p-value of about $0.0005$ (1/(B+1)).

(optional) Asymptotic vs exact independence tests

For unknown reasons, the penguins have recently fled the sampling locations and it has become difficult to sample penguins.

Here are the new data sampled from 2015 to 2018:

new_data <- tibble(
    year = c(
        2014, 2014, 2014, 2014, 2014, 2014, 2014, 
        2014, 2014, 2014, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 
        2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 
        2016, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 
        2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018),
    species = factor(c(
        "Adelie", "Chinstrap", "Chinstrap", "Adelie", "Adelie", "Chinstrap", 
        "Gentoo", "Gentoo", "Gentoo", "Gentoo", "Gentoo", "Gentoo", "Gentoo", 
        "Gentoo", "Gentoo", "Gentoo", "Gentoo", "Gentoo", "Gentoo", "Gentoo", 
        "Gentoo", "Chinstrap", "Gentoo", "Adelie", "Chinstrap", "Chinstrap", 
        "Gentoo", "Adelie", "Adelie", "Gentoo", "Adelie", "Gentoo", "Gentoo", 
        "Gentoo", "Chinstrap", "Chinstrap", "Gentoo", "Chinstrap", "Gentoo", 
        "Adelie", "Chinstrap", "Gentoo", "Adelie", "Gentoo", "Adelie", 
        "Adelie", "Gentoo", "Chinstrap", "Chinstrap", "Adelie"))
)

table(new_data$year, new_data$species)

      
       Adelie Chinstrap Gentoo
  2014      3         3      4
  2015      0         0     10
  2016      3         3      4
  2017      2         3      5
  2018      4         3      3

Again, we want to know if the individuals from the different species are uniformly sampled across the years, or in other words, if the variable year and species are independent.

What test could we use? and what would be the hypotheses in this case?

Solution

We could use a $\chi^2$ independence test with the following hypotheses:

$H_0$: the categorical variable year and species are independent
$H_1$: the categorical variable year and species are not independent (some species are more or less sampled depending on the year)

Run the test? What is the problem here?

Solution

chisq.test(new_data$year, new_data$species)


    Pearson's Chi-squared test

data:  new_data$year and new_data$species
X-squared = 12.756, df = 8, p-value = 0.1205

The sample size is too small, so the test statistics might not follow the asymptotic $\chi^2$ distribution.

In this specific case, since we are working with a contingency table (i.e. crossed headcounts between conditions recorded by two categorical variables), we can build an exact test (i.e. a test that does not rely on an asymptotic approximation), e.g. the Fisher’s exact test here.

“As pointed out by Fisher, this leads under a null hypothesis of independence to a hypergeometric distribution of the numbers in the cells of the table.” (wikipedia)

We can run this test with:

fisher.test(new_data$year, new_data$species)


    Fisher's Exact Test for Count Data

data:  new_data$year and new_data$species
p-value = 0.08728
alternative hypothesis: two.sided

What can you conclude (at a risk $\alpha=10\%$)? Compare the result with the $\chi^2$ independence test.

Solution

Here the $p$-value is equal to 0.0872844 which is lower than $\alpha=10\%$ so we reject the null hypothesis of independence between the year and the species.

When using the $\chi^2$ independence test, the $p$-value with equals to 0.1205101 and we would have not rejected the null hypothesis of independence between the year and the species at a level $\alpha=10\%$.

Nonparametric tests (optional)

Nonparametric tests, like rank tests belong to a branch of statistics that is not based solely on parametrized families of probability distributions. For example, rank tests only relies on the ranks of $x$.

Shapiro–Wilk test

The Shapiro-Wilk test is a non-parametric test to test the normality of a distribution. You can check the Wikipedia page of the test for the statistic formula.

shapiro.test(gentoo_weight)


    Shapiro-Wilk normality test

data:  gentoo_weight
W = 0.98606, p-value = 0.2605

Was our choice of using the $t$-test, the right one ?

Solution

We don’t have enough evidence to reject $H_0$ with a threshold of $\alpha = 0.05$ so the $t$-test was a good choice.

Kolmogorov–Smirnov test

The Kolmogorov–Smirnov test is also a non-parametric test to test if two samples are drawn from the same distribution. Intuitively, the statistic takes the largest absolute difference between the two empirical cumulative distribution functions across all values.

Why is the following command returns a p-value < 2.2e-16 ?

ks.test(gentoo_weight, rnorm(length(gentoo_weight)))


    Asymptotic two-sample Kolmogorov-Smirnov test

data:  gentoo_weight and rnorm(length(gentoo_weight))
D = 1, p-value < 2.2e-16
alternative hypothesis: two-sided

Solution

The two empirical CDF are extremely different

plot_2_ecdf(
    x = gentoo_weight,
    y = rnorm(length(gentoo_weight))
)

We need to center and scale our gentoo_weight.

plot_2_ecdf(
    x = scale(gentoo_weight),
    y = rnorm(length(gentoo_weight))
)

ks.test(scale(gentoo_weight), rnorm(length(gentoo_weight)))


    Asymptotic two-sample Kolmogorov-Smirnov test

data:  scale(gentoo_weight) and rnorm(length(gentoo_weight))
D = 0.07563, p-value = 0.8855
alternative hypothesis: two-sided

Why use rnorm() when we know the CDF of $\mathcal{N}(0,1)$ distribution ?

ks.test(scale(gentoo_weight), pnorm)


    Asymptotic one-sample Kolmogorov-Smirnov test

data:  scale(gentoo_weight)
D = 0.069124, p-value = 0.6204
alternative hypothesis: two-sided

Wilcoxon test

Two samples paired Wilcoxon test is a non-parametric equivalent to two samples paired t-test, which compares medians (instead of means) of two groups.

wilcox.test(
    adelie_bill_depth,
    adelie_bill_length,
    paired = T
)


    Wilcoxon signed rank test with continuity correction

data:  adelie_bill_depth and adelie_bill_length
V = 0, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0

Compare the results of this test to the Two samples Student test

Solution

t.test(
    adelie_bill_depth,
    adelie_bill_length,
    paired = T
)


    Paired t-test

data:  adelie_bill_depth and adelie_bill_length
t = -100.42, df = 145, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -20.87975 -20.07368
sample estimates:
mean difference 
      -20.47671

Wilcoxon-Mann-Whitney test

Two samples Wilcoxon-Mann-Whitney test (WMW-test) is a non-parametric equivalent to two-samples unpaired Student test, which compares medians (instead of means) of two groups.

wilcox.test(adelie_weight, chinstrap_weight)


    Wilcoxon rank sum test with continuity correction

data:  adelie_weight and chinstrap_weight
W = 4710, p-value = 0.5476
alternative hypothesis: true location shift is not equal to 0

Compare the results of this test to the Two samples Student test

Solution

t.test(adelie_weight, chinstrap_weight)


    Welch Two Sample t-test

data:  adelie_weight and chinstrap_weight
t = -0.44793, df = 154.03, p-value = 0.6548
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -145.66494   91.81724
sample estimates:
mean of x mean of y 
 3706.164  3733.088

If the data follow the distribution assumption of the parametric test, the parametric test is always more powerful than the nonparametric test.

Kruskal–Wallis test

The Kruskal–Wallis test extends the Wilcoxon-Mann-Whitney test to more than 2 groups. If there is at least one group following a different distribution than the others, the $p$-value will be small.

kruskal.test(list(gentoo_weight, adelie_weight, chinstrap_weight))


    Kruskal-Wallis rank sum test

data:  list(gentoo_weight, adelie_weight, chinstrap_weight)
Kruskal-Wallis chi-squared = 212.09, df = 2, p-value < 2.2e-16

Tutorial 3: Statistical testing

Introduction

Packages

Loading the data

Empirical \(p-\)value

Sensitivity and specificity of a test

One sample Student test

\(t\)-value

Two samples Student test

Two-sided hypothesis

One-sided hypothesis

Homogeneous variance

Power analysis

Power

Effect size

Paired data for two sample student tests

Fisher variance test

\(\chi^2\) (Chi-square)

\(\chi^2\) goodness of fit

\(\chi^2\) independence of categorical variables

(optional) Asymptotic vs exact independence tests

Nonparametric tests (optional)

Shapiro–Wilk test

Kolmogorov–Smirnov test

Wilcoxon test

Wilcoxon-Mann-Whitney test

Kruskal–Wallis test