Tutorial 2: Simulation, Estimation, Sampling variability and Robustness

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

Requirements

We first load the tidyverse meta-package that we will be needing.

# required packages
library(tidyverse)

Note: we will only need the ggplot2 and dplyr packages from the tidyverse meta-package. Thus, if you encounter issues when installing tidyverse, you can only install these two packages, and load them with the commands library(ggplot2) and library(dplyr).

Some custom function are available in this file. You can load them in R with the following command:

source("https://gitbio.ens-lyon.fr/LBMC/hub/formations/ens_l3_stats/-/raw/main/tutorial_2_estimation_simulation/utils.R?inline=false")

Standard probability distributions and simulations

First, we will learn how to generate random values from a given probability distribution, i.e. draw observations of a random variable following a given probability distribution.

What is a collection of observations from a random variable? How would you define it in a formal way?

Solution

Such a collections of observations from a random variable is called a sample.

Here is a formal definition: assuming we have a random variable \(X\) following a given probability distribution. A \(n\)-sample associated to \(X\) is a collection of \(n\) random variables \((X_i)_{i=1,...,n}\) independent and identically distributed (i.i.d.) according to the probability distribution of \(X\). The corresponding observed sample \((x_i)_{i=1,...,n}\) contains the actual observed value \(x_i\) for each random variable \(X_i\). It corresponds to repeated measures of the variable \(X\) on different individuals selected at random.

In the following, we will simulate the observed values \((x_i)_{i=1,...,n}\) according to a generative model, i.e. when assuming that \(X\) follows a known and given probability distribution among standard probability distributions.


Discrete probability distributions

What are the discrete probability distributions that you could think of? and what are they useful to model?

Solution

Bernoulli (binary outcome), Binomial (counting binary outcome), Multinomial (mutli-categorical outcome), Poisson (counting number of discrete occurrences),

and more: Geometric, Negative Binomial, Hypergeometric, etc. (that we may not see)


Use R functions to generate random values drawn from the Bernoulli, Binomial and Poisson distributions. Compute the corresponding empirical mean and variance, and compare these to their theoretical values.

Hint:

# Bernoulli B(0.3)
obs <- rbinom(n = 100, size = 1, prob = 0.3)
# Binomial B(10, 0.4)
obs <- rbinom(n = 100, size = 1, prob = 0.3)
# Poisson P(5)
obs <- rpois(n = 100, lambda = 5)
# empirical mean
mean(obs)
# empirical variance
var(obs)
Solution

  • Bernoulli \(\BB(0.3)\): mean \(= 0.3\) and variance \(= 0.3 \times (1 - 0.3) =\) 0.21
# random data
obs <- rbinom(n = 100, size = 1, prob = 0.3)
# empirical mean
mean(obs)
[1] 0.32
# empirical variance
var(obs)
[1] 0.219798
  • Binomial \(\BB(10, 0.3)\): mean \(= 10 \times 0.3 =\) 3 and variance \(= 0.3 \times (1 - 0.3) =\) 2.1
# random data
obs <- rbinom(n = 100, size = 10, prob = 0.3)
# empirical mean
mean(obs)
[1] 3.06
# empirical variance
var(obs)
[1] 1.794343
  • Poisson \(\poi(5)\): mean \(= 5\) and variance \(= 5\)
# random data
obs <- rpois(n = 100, lambda = 5)
# empirical mean
mean(obs)
[1] 4.76
# empirical variance
var(obs)
[1] 5.234747

Continuous probability distributions

What are the continuous probability distributions that you could think of?

Solution

Uniform, Gaussian,

Student, \(\chi^2\), Fisher (c.f. later)

and more: Exponential, Gamma, Beta, Dirichlet, etc. (that we may not see)


Use R functions to generate random values drawn from the Uniform and Gaussian distributions. Compute the corresponding empirical mean and variance, and compare these to their theoretical values.

Hint:

# Uniform U[0, 10]
obs <- runif(n = 100, min = 0, max = 10)
# Gaussian N(3, 4)
obs <- rnorm(n = 100, mean = 3, sd = 4)
Solution

  • Uniform \(\UU(0, 10)\): mean \(= \frac{0 + 10}{2} =\) 5 and variance \(= \frac{(10-1)^2}{12} =\) 8.3333333
# random data
obs <- runif(n = 100, min = 0, max = 10)
# empirical mean
mean(obs)
[1] 5.568295
# empirical variance
var(obs)
[1] 8.630765
  • Gaussian \(\NN(3, 4)\): mean \(= 3\) and standard-deviation \(= 4\)
# random data
obs <- rnorm(n = 100, mean = 3, sd = 4)
# empirical mean
mean(obs)
[1] 3.368577
# empirical variance
var(obs)
[1] 14.7001

Sampling variability

What is the link between the empirical mean/variance and the theoretical mean/variance respectively?

Solution

The empirical mean and variance are estimators of their theoretical counterpart. The observed value for the mean (resp. the variance) on a given observed sample is an estimation of the theoretical mean (resp. variance).


Why is the empirical mean (resp. variance) not exactly equals to the theoretical mean (resp. variance) in the simulations in the previous section?

Solution

Because of the variability due to the randomness of the sampling.


We generate multiple random samples of size 100 following the same \(\UU[0,10]\) uniform distribution and we compute the empirical mean for each sample. What do you see?

mean_estim <- replicate(1000, mean(runif(n = 100, min = 0, max = 10)))
hist(mean_estim)

Solution

We drew the histogram of all values of the empirical mean computed on the different samples.

The observed values for the empirical mean varies across samples. This phenomenon is called sampling variability. And this variability due to the randomness of the samples follows a specific pattern (c.f. later).

The sample corresponds to a subset of the entire population. It is gathered by selecting random individuals because we generally cannot measure the quantity of interest on the entire population. Multiple samples will generally give different values for the estimation of a given quantity.


Asymptotic behavior

Estimating the mean

We want to generate a random sample from a given probability distribution and compute the empirical mean for different sample sizes.

Proceed and verify the value of empirical mean when the sample size grows.

What do you see? What is the name of this result?

Hint: You can use the following code, just replace xxxx by the corresponding sample generation using the R generator for standard probability distributions (see rgauss(), rpois(), rbinom(), rexp(), etc.). And replace yyyy by the theoretical value of the mean that you used to simulate the data.

# define increasing sample sizes (from n = 10 to n = 10000)
sample_size <- seq(10, 10000, by = 10)
# generate the samples for each sample size and compute the empirical mean
mean_estim <- sapply(sample_size, function(n) mean( xxxx ))
# define the theoretical mean
theoretical_mean <- yyyy
# plot the empirical mean values depending on the sample size
plot_mean_vs_sample_size(sample_size, mean_estim, theoretical_mean)

Note: the plot_mean_vs_sample_size() custom function should be available if you run the source("https://...") command from the Requirements section.

Solution

For the Gaussian distribution, rnorm(n) generate a sample of size n from the standard Gaussian distribution \(\NN(0,1)\) with theoretical mean = 0.

sample_size <- seq(10, 10000, by = 10)
mean_estim <- sapply(sample_size, function(n) mean( rnorm(n) ))
theoretical_mean <- 0
plot_mean_vs_sample_size(sample_size, mean_estim, theoretical_mean)

For the Bernoulli distribution, rbinom(n, 1, 0.3) generate a sample of size n from the Bernoulli distribution \(\BB(0.3)\) with theoretical mean = 0.3.

sample_size <- seq(10, 10000, by = 10)
mean_estim <- sapply(sample_size, function(n) mean( rbinom(n, 1, 0.3) ))
theoretical_mean <- 0.3
plot_mean_vs_sample_size(sample_size, mean_estim, theoretical_mean)

For the Poisson distribution, rpois(n, 10) generate a sample of size n from the Poisson distribution \(\poi(10)\) with theoretical mean = 10.

sample_size <- seq(10, 10000, by = 10)
mean_estim <- sapply(sample_size, function(n) mean( rpois(n, 10) ))
theoretical_mean <- 10
plot_mean_vs_sample_size(sample_size, mean_estim, theoretical_mean)

In all cases, the empirical mean converges toward the value of the theoretical mean when the sample size grows. This result is known as the law of large numbers.


Quantifying the uncertainty regarding the empirical mean

Now we generate multiple random samples of a fixed size \(n = 100\) from the \(\UU[0,10]\) uniform distribution, and we compute the empirical mean on each sample.

# generate 1000 samples (from the uniform distribution U[0,10]) of size 100
samples <- replicate(1000, runif(n = 100, min = 0, max = 10), simplify = FALSE)
# compute the empirical mean for each sample
mean_estim <- sapply(samples, mean)

How could we represent the empirical distribution of the empirical means across all samples?

Solution

We can use an histogram or represent the empirical density.


Represent the empirical distribution of the empirical mean. What can you say about this distribution?

Solution

# plot the empirical distribution
hist(mean_estim)

This empirical distribution seems to look like a Gaussian density.


If we generate random data from a given probability distribution with expectation \(\mu\) and variance \(\sigma^2\). What would the value of the expectation \(\EE(\bar{X}_n)\) and variance \(\VV(\bar{X}_n)\) of the empirical mean \(\bar{X}_n\)?

Solution

\(\EE(\bar{X}_n) = \mu\) and \(\VV(\bar{X}_n) = \frac{\sigma^2}{n}\)


We now generate random samples from the \(\poi(5)\) Poisson distribution.

# generate 1000 samples (from the Poisson distribution P(5)) of size 100
samples <- replicate(1000, rpois(n = 100, lambda = 5), simplify = FALSE)
# compute the empirical mean for each sample
mean_estim <- sapply(samples, mean)

What are the value of \(\mu\) and \(\sigma\) in this case?

Solution

\(\mu = 5\) and \(\sigma = \sqrt{5}\)


Represent the empirical distribution of the empirical mean across the different samples and compare it to the \(\NN(\mu,\sigma^2/n)\) Gaussian distribution density.

What can you say? What result are we illustrating?

Hint: you can use the custom function hist_mean_vs_gaussian_density(mean_estim, n, mu, sigma).

Solution

hist_mean_vs_gaussian_density(mean_estim, n = 100, mu = 5, sigma = sqrt(5))

The empirical distribution of the empirical mean \(\bar{X}_n\) is very close to a Gaussian distribution \(\NN(\mu,\sigma^2/n)\).

When the sample size \(n\) grows, the distribution of \(\bar{X}_n\) converges towards a \(\NN(\mu,\sigma^2/n)\) Gaussian distribution, whatever the probability distribution underlying the generation of the sample1. This result is given by the central limit theorem. Equivalently, the distribution of the following quantity \(\frac{\bar{X}_n - \mu}{\sigma \sqrt{n}}\) converges towards a \(\NN(0,1)\) Gaussian distribution.


Now, generate the samples from a \(\BB(0.15)\) Bernoulli distribution but decrease the sample size \(n\).

Again, represent the empirical distribution of the empirical mean across the different samples and compare it to the \(\NN(\mu,\sigma^2/n)\) Gaussian distribution density.

What would you expect? What do you see?

Hint: what are the values of \(\mu\) and \(\sigma\) in this case?

Solution

For a Bernoulli distribution \(\BB(p)\), we have \(\mu = p\) and \(\sigma = \sqrt{p(1-p)}\)

# n = 50
samples <- replicate(
    1000, rbinom(n = 50, size = 1, prob = 0.15), simplify = FALSE)
mean_estim <- sapply(samples, mean)
hist_mean_vs_gaussian_density(
    mean_estim, n = 50, mu = 0.15, sigma = sqrt(0.15*(1-0.15)))

# n = 25
samples <- replicate(
    1000, rbinom(n = 25, size = 1, prob = 0.15), simplify = FALSE)
mean_estim <- sapply(samples, mean)
hist_mean_vs_gaussian_density(
    mean_estim, n = 25, mu = 0.15, sigma = sqrt(0.15*(1-0.15)))

# n = 10
samples <- replicate(
    1000, rbinom(n = 10, size = 1, prob = 0.15), simplify = FALSE)
mean_estim <- sapply(samples, mean)
hist_mean_vs_gaussian_density(
    mean_estim, n = 10, mu = 0.15, sigma = sqrt(0.15*(1-0.15)))

When the sample size is decreasing, the distribution of the empirical mean looks less and less alike a Gaussian distribution.

In particular, in this case, values far away from the theoretical mean becomes more frequent than what would be expected with a Gaussian distribution.

It means that when the sample size is large, you can use the central limit theorem to approximate the distribution of the empirical mean by a Gaussian distribution, but when the sample size is too small, we cannot.


(Optional) Quantifying the uncertainty regarding the empirical variance

We generate multiple random samples of a fixed size \(n = 10\) from a \(\NN(\mu,\sigma^2)\) Gaussian distribution with \(\mu = 3\) and \(\sigma^2 = 4\), and we compute the empirical variance on each sample.

# generate 1000 samples (from the Gaussian distribution N(3,4)) of size 10
samples <- replicate(
    1000, rnorm(n = 10, mean = 3, sd = sqrt(4)), simplify = FALSE)
# compute the empirical variance for each sample
var_estim <- sapply(samples, var)

Represent the empirical distribution of the empirical variance across the different samples and compare it to a \(\chi^2(n-1)\) distribution2 density.

What can you say? What result are we illustrating?

Hint: you can use the custom function hist_variance_vs_chi2(var_estim, n, sigma).

Solution

hist_variance_vs_chi2(var_estim, n = 10, sigma = sqrt(4))

The probability distribution of the empirical variance is a rescaled \(\chi^2(n-1)\) distribution.

In particular, the empirical variance estimator \(S^2\) verifies \(\frac{n-1}{\sigma^2}S^2 \sim \chi^2(n-1)\).

Note: \(S^2\) is the unbiased estimator of the variance, c.f. later.


Why is the empirical variance following a \(\chi^2\) distribution in this case?

Solution

The empirical variance is (with a scaling factor3) a sum of squared Gaussian random variable because we assumed that the sample is a collection of observations of independent and identically distributed Gaussian random variable. And a sum of \(n\) i.i.d. Gaussian random variables follows a \(\chi^2(n-1)\) distribution4.

This result is exact and not asymptotical thanks to the Gaussian assumption on the data.


(Optional) Combining the uncertainty regarding the empirical mean and the empirical variance

Again, we generate multiple random samples of a fixed size \(n = 50\) from a \(\NN(\mu,\sigma^2)\) Gaussian distribution with \(\mu = 3\) and \(\sigma^2 = 4\), and we now compute the empirical mean \(\bar{X}_n\) and variance \(S_2\) on each sample.

# generate 1000 samples (from the Gaussian distribution N(3,4)) of size 50
samples <- replicate(
    1000, rnorm(n = 10, mean = 3, sd = sqrt(4)), simplify = FALSE)
# compute the empirical mean for each sample
mean_estim <- sapply(samples, mean)
# compute the empirical variance for each sample
var_estim <- sapply(samples, var)

Represent the empirical distribution of the quantity \(\sqrt{n}\frac{\bar{X}_n - \mu}{\sqrt{S^2}}\) across the different samples and compare it to a \(\TT(n-1)\) Student distribution density and a \(\NN(0,1)\) Gaussian distribution density.

What can you say? What result are we illustrating?

Hint: you can use the custom function hist_scaled_mean_vs_student(mean_estim, var_estim, n, mu, sigma).

Solution

hist_scaled_mean_vs_student(
    mean_estim, var_estim, n = 10, 
    mu = 3, sigma = sqrt(4))

The probability distribution of the quantity \(\sqrt{n}\frac{\bar{X}_n - \mu}{\sqrt{S^2}}\) is a \(\TT(n-1)\) Student distribution.


Why is the quantity \(\sqrt{n}\frac{\bar{X}_n - \mu}{\sqrt{S^2}}\) following a \(\TT(n-1)\) Student distribution in this case?

Solution

Thanks to the Gaussian assumption on the data, the quantity \(\sqrt{n}\frac{\bar{X}_n - \mu}{\sqrt{S^2}}\) is the ratio of a Gaussian distribution and a \(\chi^2\) distribution, which eventually define the Student distribution,

This result is exact and not asymptotical thanks to the Gaussian assumption on the data.


(optional) Unbiased empirical variance

We generate a sample of a fixed size \(n = 50\) from the \(\NN(0,1)\) standard Gaussian distribution, and we compute the empirical variance on each sample with the var() function and with \(\hat{\sigma}^2_n = \frac{1}{n} \sum_{i=1} (X_i - \bar{X}_n)^2\).

# generate a sample
obs <- rnorm(50)
# compute the empirical variance
var(obs)
[1] 1.173189
# compute hat_sigma2_n
sum((obs - mean(obs))^2) / length(obs)
[1] 1.149725

Can you explain the difference between the value computed by the var() function and the value of \(\hat{\sigma}^2_n\)?

Solution

The var() function computes the unbiased estimation of the variance: \(S^2 = \frac{1}{n-1} \sum_{i=1} (X_i - \bar{X}_n)^2\)

# compute the empirical variance
var(obs)
[1] 1.173189
# compute S^2
sum((obs - mean(obs))^2) / (length(obs) - 1)
[1] 1.173189

We can show that \(\EE[\hat{\sigma}^2_n] \ne \sigma^2\) where \(\sigma^2\) is the true value of the variance, whereas \(\EE[S^2] = \sigma^2\).


Estimation

In practice (when doing scientific research, e.g. analyzing some data), we generally cannot collect multiple samples. Indeed, it is generally not possible to generate/collect as much data as we want potentially in multiple samples, like we did previously with simulations. We have a given and fixed size sample, and we should work with it.

In that case, how can we quantify the variability and uncertainty regarding a quantity or character of interest with a single sample (e.g. using a single value of the empirical mean and of the empirical covariance)?

Solution

To quantify the variability and uncertainty regarding a quantity or character, we have to rely on statistical models, that are either pertinent regarding the data characteristics and/or that rely on statistical approximation results.


In this section, we will do a real data analysis. We will work with the penguins dataset from the palmerpenguins package [1].

[1] Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0 (https://allisonhorst.github.io/palmerpenguins/) doi: 10.5281/zenodo.3960218

We first load the dataset. It contains various physical characteristics of penguins sampled from different populations (different species in this case).

library(palmerpenguins)
data(penguins)
penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

We will focus on the “Adelie” penguin species. We select the corresponding individuals and remove any individual with missing data.

adelie_penguins <- penguins %>% filter(species == "Adelie") %>% drop_na()

For now on, we will work with the adelie_penguins data table.

adelie_penguins
# A tibble: 146 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           36.7          19.3               193        3450
 5 Adelie  Torgersen           39.3          20.6               190        3650
 6 Adelie  Torgersen           38.9          17.8               181        3625
 7 Adelie  Torgersen           39.2          19.6               195        4675
 8 Adelie  Torgersen           41.1          17.6               182        3200
 9 Adelie  Torgersen           38.6          21.2               191        3800
10 Adelie  Torgersen           34.6          21.1               198        4400
# ℹ 136 more rows
# ℹ 2 more variables: sex <fct>, year <int>

How many observations are there in the adelie_penguins dataset?

Solution

nrow(adelie_penguins)
[1] 146

Formalization of the statistical problem

Say we want to study a certain attribute/feature of the Adelie penguins, e.g. their body mass.

How would we proceed? Formalize the problem from the statistical point of view.

Solution

We would collect a sample and measure the attribute/feature of interest on each individual in the collected sample. Then, we can compute the empirical mean for this attribute/feature, compute confidence interval for the population mean, and so much more.

In a formal statistical point of view, we say that the value of the attribute/feature of interest measured on a given individual penguin is given by a random variable \(X\). The sample will be a collection of independent and identically distributed (i.i.d.) random variables \((X_i)_{i=1,...,n}\) following the same probability distribution than \(X\). The observed values for the \(X_i\)’s in the sample are noted \((x_i)_{i=1,...,n}\).


Sampling uncertainty quantification

How can we quantify the uncertainty and variability regarding the attribute/feature of interest in the entire population using a single sample?

In particular, which tools can we use to construct a probability distribution regarding the empirical mean for this attribute/feature of interest?

Solution

We will need to construct a statistical model. Depending on the situation, we can use two different approaches (in general).

  1. Depending on the characteristics of the attribute/feature of interest, we can use a statistical model, i.e. we can assume that the corresponding random variable follows a given probability distribution, i.e. Bernoulli if we are studying a binary feature, Poisson if we are counting occurrences of a phenomenon, Gaussian if we are studying a numerical feature.

Based on this assumption, we can sometimes determine the explicit probability distribution for the empirical mean (e.g. normalized binomial if the \(X_i\) are Bernoulli, Gaussian if the \(X_i\) are Gaussian) and use this distribution to quantify uncertainty regarding the attribute/feature of interest.

  1. If the sample size is large enough, we can use an asymptotic result like the central limit theorem to approximate the probability distribution of the empirical mean.

However, in all cases, the true values of the parameters of the underlying distributions5 are unknown (on contrary to simulation studies where we know them), and we can only work with their estimation.


What about the empirical variance? How can we derive its probability distribution?

Solution

Regarding the empirical variance, we generally need a statistical model. For instance, we saw that assuming that the data are Gaussian, i.e. assuming that the random variable measuring the attribute/feature of interest follows a Gaussian distribution, we can derive that the empirical variance follows a \(\chi^2\) distribution.

However, even with a statistical model, it is sometimes very complicated (or even impossible) to derive the probability distribution for the empirical variance.

It can be sometimes possible to derive asymptotic convergence for the probability distribution of the empirical variance but it can be difficult outside the Gaussian framework.


Descriptive statistics

We will focus on the body mass of the Adelie penguins (column body_mass_g in the adelie_penguin table).

body_mass <- adelie_penguins$body_mass_g

Compute the empirical mean, empirical variance and for the Adelie penguin body mass manually and with the corresponding R function?

You can also compute the empirical standard deviation.

For the variance, should we divide by \(n\) or \(n-1\)?

Solution

# number of observations
n_obs <- length(body_mass)
# empirical mean
sum(body_mass) / n_obs
[1] 3706.164
mean(body_mass)
[1] 3706.164
# empirical variance
sum((body_mass - mean(body_mass))^2) / n_obs
[1] 208891.8
sum((body_mass - mean(body_mass))^2) / (n_obs - 1)
[1] 210332.4
var(body_mass)
[1] 210332.4
# standard deviation
sqrt(var(body_mass))
[1] 458.6201
sd(body_mass)
[1] 458.6201

Reminder: the var() function computes the unbiased estimator for the variance, i.e. sum((body_mass - mean(body_mass))^2) / (n_obs - 1), whereas sum((body_mass - mean(body_mass))^2) / n_obs is the biased estimator for the variance (c.f. previous section).

We should use the unbiased estimator (i.e. when dividing the sum of squared deviation with the mean value divided by \(n-1\)).


What descriptive statistics and corresponding graphical representations can we use to evaluate the distribution of the observations in the sample?

Solution

We can compute the usual empirical quantiles, and use the boxplot visualization:

boxplot(body_mass, horizontal=TRUE)

We can also represent the empirical distribution of the observed values with an histogram:

hist(body_mass)


Relevance of the model

To go further, we will need to use a statistical model regarding our data. Here, the body mass is a numerical and continuous feature.

In this case, if the empirical distribution is symmetric, the default model would be to assume a Gaussian model for the feature of interest, i.e. the body mass here, because it gives access to various statistical tools (c.f later).

To use a Gaussian model, we have to verify that the empirical distribution of the observations is indeed symmetrical and look alike a Gaussian distribution.

We can use various tools to do so. How could we proceed?

Solution

We do not have access to the true probability distribution of the feature of interest because we cannot measure it on the entire population. Thus, we need to verify that the empirical distribution of the observations in the sample is close to a Gaussian distribution.

  1. We can compare the empirical cumulative distribution function of the standardized observations to the standard Gaussian cumulative distribution function (you can use the custom function plot_ecdf()):
plot_ecdf(body_mass)

  1. We can compare the empirical distribution of the standardized observations to the density of the standard Gaussian distribution (you can use the custom function hist_obs_vs_gaussian_density():
hist_obs_vs_gaussian_density(body_mass)

  1. We can compare the empirical quantiles of the standardized observations to the theoretical quantiles of the standard Gaussian distribution using a “quantile-quantile plot” (a.k.a qqplot). You can do this with the qqnorm() function6. The point should approximately align on the line \(y = x\) meaning that the empirical quantiles approximately correspond the standard Gaussian quantile.
qqnorm(scale(body_mass))
abline(a=0, b=1, col = "red") # y=x line

Note: the standardised observations can be computed with the scale() function (e.g. scale(body_mass)).

Here, we can say that the observations are approximately Gaussian.

These results are visual and not quantitative (some tests exists to verify if a set of observations is significantly not Gaussian but that goes beyond the scope of this tutorial).


Important: a model is a simplified and quantified version of the reality (driven by complex and random phenomenons and processes). It will generally never be true, but one can hope that is close enough to the reality to be used in a statistical analysis to derive pertinent results and conclusions about the studied feature.

Confidence interval

Since we have a Gaussian model, we will use it to build a confidence interval for the population average body mass of the Adelie penguin (which is unknown but that we are studying).

What are the limitations with the empirical mean estimation?

Solution

The empirical mean is a single value that is expected to be not very far (depending on the sample size) from the population average but it does not give any insight about how far the estimation is actually from true value for the population average.


How can we proceed to quantify the uncertainty about how far the observed empirical mean is from the true population average value.

Solution

We can construct an interval estimator that would contain the true value of the population average with high probability, called a confidence interval.


What are the boundaries of a confidence interval?

Solution

The boundaries of a confidence interval are random variables.


From the Gaussian model for the body mass, build the confidence interval for the population average that we note \(\mu\).

Solution

We assumed that our sample \((X_i)_{i=1,...,n}\) is i.i.d. following a Gaussian distribution of population unknown mean \(\mu\) and variance \(\sigma^2\).

From this, we can derive the following:

  • \(\bar{X}_n \sim \NN(\mu, \sigma^2/\sqrt{n})\)

  • \(\frac{n-1}{\sigma} S^2 \sim \chi^2(n-1)\)

Consequence: we have that \(\sqrt{n} \frac{\bar{X}_n - \mu}{\sqrt{S^2}} \sim \TT(n-1)\) i.e. a Student distribution with \(n-1\) degrees of freedom

Eventually, we can use the Student probability function to write: \[\PP\Big( -t_{n-1,1-\alpha/2} \leq \sqrt{n} \frac{\bar{X}_n - \mu}{\sqrt{S^2}} \leq t_{n-1,1-\alpha/2} \Big) = 1 - \alpha\] for a given \(0 < \alpha < 1\) where \(t_{n-1,1-\alpha/2}\) is the quantile of level \(1-\alpha/2\) of the \(\TT(n-1)\) Student distribution.

Here is a reminder about the quantile definition for a symmetric confidence interval and a symmetric probability distribution:

Now we can inverse the inequalities to write an interval with random boundaries (because depending on the estimators \(\bar{X}_n\) and \(S^2\)):

\[\PP\Big(\bar{X}_n - \frac{\sqrt{S^2}}{\sqrt{n}}t_{n-1,1-\alpha/2} \leq \mu \leq \bar{X}_n + \frac{\sqrt{S^2}}{\sqrt{n}}t_{n-1,1-\alpha/2} \Big) = 1 - \alpha\] The confidence interval for \(\mu\) is then \(\Big[ \bar{X}_n - \frac{\sqrt{S^2}}{\sqrt{n}}t_{n-1,1-\alpha/2}, \bar{X}_n + \frac{\sqrt{S^2}}{\sqrt{n}}t_{n-1,1-\alpha/2}\Big]\).

The observed confidence interval is \(\Big[ \bar{x}_n - \frac{\sqrt{s^2}}{\sqrt{n}}t_{n-1,1-\alpha/2}, \bar{x}_n + \frac{\sqrt{s^2}}{\sqrt{n}}t_{n-1,1-\alpha/2}\Big]\) where \(\bar{x}_n\) is the observed estimation of the empirical mean on the sample and \(s^2\) the the observed estimation of the empirical variance on the sample.


Compute the observed confidence interval for the body mass?

Hint: you can use the qt() function to compute quantiles of the Student distribution.

Solution

# sample size
n_obs = length(body_mass)

# empirical mean
x_bar <- mean(body_mass)

# empirical variance
s2 <- var(body_mass)

# confidence interval risk 5%
alpha <- 0.05

# quantile of level 1-alpha/2 for the T(n-1) distribution
t_quant <- qt(1-alpha/2, df = n_obs-1)

# lower bound
lower_bound <- x_bar - t_quant * sqrt(s2) / sqrt(n_obs)

# upper bound
upper_bound <- x_bar + t_quant * sqrt(s2) / sqrt(n_obs)

The observed confidence interval is \([\) 3631.1465319, 3781.1822353 \(]\).


Is it true that the population average of the body mass is in the interval \([\) 3631.1465319, 3781.1822353 \(]\) with a 95% probability?

Solution

No, it is not true.

The confidence interval defined with random variable boundaries contains the population average with a 95% probability.

But the observation of this interval is a deterministic value given the observations in the sample. Their is no random variable in the interval 3631.1465319 \(\leq \mu \leq\) 3781.1822353 since \(\mu\) is also a fixed (though unknown) value. Thus, it makes no sense to speak about a probability for the true parameter \(\mu\) to be in the observed interval.

However, what is true is that if we repeat the sampling numerous times, \(\mu\) will be in the observed confidence interval 95% of the time, but we generally cannot do that (unless we are working with simulations) because we generally have access to a limited number of samples.


(optional) Robustness

The robustness of a statistical method (e.g. an estimation method) is its ability to not be affected by outliers, i.e. data points that significantly differs from other observations.

Here are information about the distribution of the observations in the sample regarding the body mass:

summary(body_mass)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2850    3362    3700    3706    4000    4775 
hist(body_mass)

We have two new observations for penguins with weight 6000g and 6500g.

new_body_mass <- c(body_mass, 6000, 6500)

Why can these two new observations considered as outliers in the sample?

Solution

These two new observations are significantly different from all the others.

hist(new_body_mass)


What could be the effect of outliers on the mean and variance estimators? Compare the mean and variance estimation in the original sample and in the new sample.

Solution

Outliers can modify the value of the estimations regarding the mean and the variance.

# mean
mean(body_mass)
[1] 3706.164
mean(new_body_mass)
[1] 3740.541
# mean ratio
(mean(new_body_mass) - mean(body_mass))/mean(body_mass)
[1] 0.0092754
# variance
var(body_mass)
[1] 210332.4
var(new_body_mass)
[1] 295173.5
# variance ratio
(var(new_body_mass) - var(body_mass))/var(body_mass)
[1] 0.4033667
# standard deviation
sd(body_mass)
[1] 458.6201
sd(new_body_mass)
[1] 543.2987
# standard deviation ratio
(sd(new_body_mass) - sd(body_mass))/sd(body_mass)
[1] 0.1846378

Both estimations are affected.

The effect on the empirical variance is bigger since it becomes 40.3366653% bigger, whereas the empirical mean only becomes 0.92754% bigger when accouting for the outliers.

Even, when considering the standard deviation (which is the squared root of the variance and then homogeneous to the empirical mean), it becomes 18.4637773% bigger when accounting for the outliers.

Thus, the effect of the outliers in our sample is bigger on the dispersion statistics like the empirical variance and standard deviation than on the empirical mean which is a position statistics.


When computing a confidence interval, we can verify its width, the narrower, the better precision.

The width is the length of the interval, computed as the difference between the upper and the lower boundaries.

What is the width of the confidence interval computed in the previous section?

Solution

The confidence interval width is two times \(\frac{\sqrt{s^2}}{\sqrt{n}}t_{n-1,1-\alpha/2}\).


What could affect the width of this confidence interval?

Solution

If the empirical variance is larger, the width will be larger since it depends on \(\sqrt{s^2}\).

If the sample size is larger, the width will be smaller since it depends on \(\frac{1}{\sqrt{n}}\).


What could affect the robustness of a confidence interval estimator?

Solution

Since outliers tend to increase the empirical variance value, it would increase the observed confidence interval width, and then decrease the confidence interval quality.

Thus, we would need to collect more data, i.e. get more observations in the sample (increase the sample size) to counter-balance the empirical variance increase.

Compare the confidence interval when using the first sample body_mass and in the second sample new_body_mass.

Solution

First sample:

# sample size
n_obs = length(body_mass)

# empirical mean
x_bar <- mean(body_mass)

# empirical variance
s2 <- var(body_mass)

# confidence interval risk 5%
alpha <- 0.05

# quantile of level 1-alpha/2 for the T(n-1) distribution
t_quant <- qt(1-alpha/2, df = n_obs-1)

# lower bound
lower_bound <- x_bar - t_quant * sqrt(s2) / sqrt(n_obs)

# upper bound
upper_bound <- x_bar + t_quant * sqrt(s2) / sqrt(n_obs)

# observed confidence interval width
upper_bound - lower_bound
[1] 150.0357

Second sample:

# sample size
n_obs = length(new_body_mass)

# empirical mean
x_bar <- mean(new_body_mass)

# empirical variance
s2 <- var(new_body_mass)

# confidence interval risk 5%
alpha <- 0.05

# quantile of level 1-alpha/2 for the T(n-1) distribution
t_quant <- qt(1-alpha/2, df = n_obs-1)

# lower bound
lower_bound <- x_bar - t_quant * sqrt(s2) / sqrt(n_obs)

# upper bound
upper_bound <- x_bar + t_quant * sqrt(s2) / sqrt(n_obs)

# observed confidence interval width
upper_bound - lower_bound
[1] 176.5127

(Optional) Sampling bias

We compute the average body mass for the male and female in the sample.

adelie_penguins %>% group_by(sex) %>% summarise(mean_body_mass = mean(body_mass_g))
# A tibble: 2 × 2
  sex    mean_body_mass
  <fct>           <dbl>
1 female          3369.
2 male            4043.

What could be the problem here when studying the body mass in this species? especially regarding the sampling composition?

Solution

The average body mass of male individuals in the sample is higher than the average body mass of female individuals in the sample.

Then, if we sample only males, we risk to over-estimate the Adelie penguin average body mass. If we sample only females, we risk to under-estimate the Adelie penguin average body mass.

This phenomenon is called sampling bias. One must be careful that the sample is representative of the studied population.

Here is the composition of the sample:

table(adelie_penguins$sex)

female   male 
    73     73 

We should study the sex ratio for this species and verify that the sample is representative of the population (c.f next section).


Confidence interval for the proportion

We now focus on the sex ratio in the Adelie penguin species, i.e. the proportion of females/males in the species (which varies across the animal kingdom and is not always around 50%-50%).

We encode the sex as a \(\{0, 1\}\) binary variable, indicating if the considered individual is a female (i.e. 1 = “female” and 0 = “male”). Thus, we will work on the estimation of the female proportion within this species 7.

We extract the corresponding data:

female <- as.integer(adelie_penguins$sex == "female")
female
  [1] 0 1 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 1 0 0 1 0 1 1 0 1 0 1 0 1 0 0 1 1 0 1 0
 [38] 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
 [75] 1 0 1 0 1 0 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
[112] 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 1 0 1 0

What is the estimated proportion of females in the sample?

Solution

The estimated proportion of females in the sample, a.k.a. the empirical frequency, corresponds to the empirical mean of the \(\{0, 1\}\) binary variable indicating if each individual is a female over the sample.

It can also be computed by summing up the number of females in the sample and divided by the sample size.

mean(female)
[1] 0.5
sum(female) / length(female)
[1] 0.5

We have 50% of female in this sample.


Formalize the statistical model that we will use.

Solution

We consider a Bernoulli model for the \(\{0,1\}\) female indicator since it is the only possible probability distribution for a binary value.

More formally, we denote by \(Y\sim\BB(p)\) the random variable indicating if a given Adelie penguin is a female, where \(p\) is the probability that the individual in question is a female.

Then \(p\) will also correspond to the proportion of females in the entire population.

And we have a sample \((Y_i)_{i=1,...,n}\) independent and identically distributed following a \(\BB(p)\) Bernoulli distribution.

From that sample, we want to get an estimation of \(p\).

The empirical mean or empirical frequency \(\bar{Y}_n = \sum_{i=1}^n Y_i\) can also be noted \(\hat{p}\).


What is the probability distribution of the random variable \(B_n = \sum_{i=1}^n Y_i\) where the the \(Y_i\) are independent and identically distributed variables following a \(\BB(p)\) Bernoulli distribution?

Solution

In this case, the probability distribution of \(B_n\) is known. It follows a \(\BB(n,p)\) Binomial distribution.


We question the true sex ratio in the Adelie penguin population. How could we get more information about the proportion of females in the species based on its estimation in the sample than just a single estimated value?

Solution

We could compute a confidence interval for the female proportion in the population using its observed estimation \(\hat{p}\) in the sample.


What tool could we use to build a confidence interval for the proportion of females in the population?

\(Hint:\) we have two possibilities.

Solution

  1. Using the exact distribution (more complicated):

Since all \(Y_i\)s follow a \(\BB(p)\) Bernoulli distribution, we know that \(B_n = \sum_{i=1}^n Y_i\) follows a \(\BB(n,p)\) Binomial distribution. Thus the probability distribution of the quantity \(n \times \bar{Y}_n = n \times \hat{p} = B_n\) is known and we can try to compute the \(\BB(n,p)\) Binomial distribution quantiles \(q_1\) and \(q_2\) such that \[\PP\Big(u_1 \leq B_n \leq u_2\Big) = 1 - \alpha\]

But remember that \[\PP\Big(u_1 \leq B_n \leq u_2\Big) = \sum_{k = u_1}^{u_2} \PP(B_n = k)\] and \[\PP(B_n = k) = \frac{n!}{k!(n-k)!} p^k (1-p)^{n-k}\]

Thus, the computation of boundaries over \(p\) will be very complicated.

  1. Asymptotic approximation:

If the sample size is large enough, we could use the central limit theorem, which says that \[\sqrt{n}\frac{\bar{Y}_n - \mu}{\sigma} \sim_n \NN(0,1)\] where \(\mu\) is the expectation and \(\sigma\) the variance of a variable following a \(\BB(p)\) Bernoulli distribution, which are?

If the sample size is too small, we cannot use this method and have no choice but to use the complicated exact one (c.f. previous point).


Build the confidence interval for the population average proportion of females \(p\).

Solution

We assume that our sample is large enough and we use the Gaussian approximation.

Here, \(\mu = p\) and \(\sigma^2 = p(1-p)\), thus \[\sqrt{n}\frac{\bar{Y}_n - p}{\sqrt{p(1-p)}} \sim \NN(0,1)\]

Then we can write \[\PP\Big( -z_{1-\alpha/2} \leq \sqrt{n} \frac{\bar{Y}_n - p}{\sqrt{p(1-p)}} \leq z_{1-\alpha/2} \Big) = 1 - \alpha\] for a given \(0 < \alpha < 1\) where \(z_{1-\alpha/2}\) is the quantile of level \(1-\alpha/2\) of the \(\NN(0,1)\) standard Gaussian distribution.

By inverting the following inequality \[-z_{1-\alpha/2} \leq \sqrt{n} \frac{\bar{Y}_n - p}{\sqrt{p(1-p)}} \leq z_{1-\alpha/2}\] we can get a confidence interval for \(p\) where the boundaries depends on the random variable \(\bar{Y}_n\).

But we have a problem since it is very complicated to extract \(p\) from the expression \(\frac{\bar{Y}_n - p}{\sqrt{p(1-p)}}\).


How can we proceed to simplify the confidence interval computation for \(p\) when using the Gaussian approximation?

Solution

In the formulation \(\sqrt{n}\frac{\bar{Y}_n - \mu}{\sigma} \sim_n \NN(0,1)\), we can replace the standard deviation \(\sigma\) by its estimation \(\sqrt{\hat{p}(1-\hat{p})}\) (remember we have a Bernoulli model) without modifying the Gaussian approximation8 and then we get \[\sqrt{n}\frac{\bar{Y}_n - p}{\sqrt{\hat{p}(1-\hat{p})}} \sim_n \NN(0,1)\]

Thus, we can write \[\PP\Big( -z_{1-\alpha/2} \leq \sqrt{n}\frac{\bar{Y}_n - p}{\sqrt{\hat{p}(1-\hat{p})}} \leq z_{1-\alpha/2} \Big) = 1 - \alpha\] for a given \(0 < \alpha < 1\) where \(z_{1-\alpha/2}\) is the quantile of level \(1-\alpha/2\) of the \(\NN(0,1)\) standard Gaussian distribution.

And by inverting the inequality \(-z_{1-\alpha/2} \leq \sqrt{n}\frac{\bar{Y}_n - p}{\sqrt{\hat{p}(1-\hat{p})}} \leq z_{1-\alpha/2}\), we find the confidence interval \[\bar{Y}_n - \frac{\sqrt{\hat{p}(1-\hat{p})}}{\sqrt{n}} z_{1-\alpha/2} \leq p \leq \bar{Y}_n + \frac{\sqrt{\hat{p}(1-\hat{p})}}{\sqrt{n}} z_{1-\alpha/2}\]

And the observed confidence interval is \[\Big[ \hat{p} - \frac{\sqrt{\hat{p}(1-\hat{p})}}{\sqrt{n}} z_{1-\alpha/2}, \hat{p} + \frac{\sqrt{\hat{p}(1-\hat{p})}}{\sqrt{n}} z_{1-\alpha/2} \Big]\]


Compute the observed confidence interval for the female proportion?

Hint: you can use the qnorm() function to compute quantiles of the Gaussian distribution.

Solution

# sample size
n_obs = length(female)

# empirical mean
hat_p <- mean(female)

# confidence interval risk 5%
alpha <- 0.05

# quantile of level 1-alpha/2 for the N(0,1) Gaussian distribution
z_quant <- qnorm(1-alpha/2)

# lower bound
lower_bound <- hat_p - z_quant * sqrt(hat_p * (1 - hat_p)) / sqrt(n_obs)

# upper bound
upper_bound <- hat_p + z_quant * sqrt(hat_p * (1 - hat_p)) / sqrt(n_obs)

The observed confidence interval is \([\) 0.4188961, 0.5811039 \(]\).

Reminding: we cannot say the \(p\) is in the observed confidence interval \([\) 0.4188961, 0.5811039 \(]\) with probability 95%, we can also say that if we repeat the sampling procedure numerous times, the confidence interval will contain the population proportion \(p\) 95% of the time.


Question ?

Solution

Solution

Footnotes

  1. as long as the variance of this distribution is finite, i.e. \(\sigma^2<+\infty\)↩︎

  2. a.k.a. “chi square” distribution↩︎

  3. i.e. the sum of squared deviation with the mean value divided by \(n-1\)↩︎

  4. per the Cochran theorem↩︎

  5. e.g. \(p\) for Bernoulli distribution, or \(\mu\) and \(\sigma\) for the Gaussian distribution↩︎

  6. The qqplot() function also exists but requires to provide the theoretical quantiles in input.↩︎

  7. we could inverse the encoding 1 = “male” and 0 = “female”, and study the proportion of males, which is equivalent since we have only two classes of individuals↩︎

  8. It can be proven↩︎