# required packages
library(tidyverse)
Tutorial 2: Simulation, Estimation, Sampling variability and Robustness
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License
\[ \def\BB{{\mathcal{B}}} \def\EE{{\mathbb{E}}} \def\NN{{\mathcal{N}}} \def\poi{{\mathcal{P}}} \def\PP{{\mathbb{P}}} \def\TT{{\mathcal{T}}} \def\UU{{\mathcal{U}}} \def\VV{{\mathbb{V}}} \]
Requirements
We first load the tidyverse
meta-package that we will be needing.
Note: we will only need the
ggplot2
anddplyr
packages from thetidyverse
meta-package. Thus, if you encounter issues when installingtidyverse
, you can only install these two packages, and load them with the commandslibrary(ggplot2)
andlibrary(dplyr)
.
Some custom function are available in this file. You can load them in R with the following command:
source("https://gitbio.ens-lyon.fr/LBMC/hub/formations/ens_l3_stats/-/raw/main/tutorial_2_estimation_simulation/utils.R?inline=false")
Standard probability distributions and simulations
First, we will learn how to generate random values from a given probability distribution, i.e. draw observations of a random variable following a given probability distribution.
What is a collection of observations from a random variable? How would you define it in a formal way?
Solution
Such a collections of observations from a random variable is called a sample.
Here is a formal definition: assuming we have a random variable \(X\) following a given probability distribution. A \(n\)-sample associated to \(X\) is a collection of \(n\) random variables \((X_i)_{i=1,...,n}\) independent and identically distributed (i.i.d.) according to the probability distribution of \(X\). The corresponding observed sample \((x_i)_{i=1,...,n}\) contains the actual observed value \(x_i\) for each random variable \(X_i\). It corresponds to repeated measures of the variable \(X\) on different individuals selected at random.
In the following, we will simulate the observed values \((x_i)_{i=1,...,n}\) according to a generative model, i.e. when assuming that \(X\) follows a known and given probability distribution among standard probability distributions.
Discrete probability distributions
What are the discrete probability distributions that you could think of? and what are they useful to model?
Solution
Bernoulli (binary outcome), Binomial (counting binary outcome), Multinomial (mutli-categorical outcome), Poisson (counting number of discrete occurrences),
and more: Geometric, Negative Binomial, Hypergeometric, etc. (that we may not see)
Use R functions to generate random values drawn from the Bernoulli, Binomial and Poisson distributions. Compute the corresponding empirical mean and variance, and compare these to their theoretical values.
Hint:
# Bernoulli B(0.3)
<- rbinom(n = 100, size = 1, prob = 0.3)
obs # Binomial B(10, 0.4)
<- rbinom(n = 100, size = 1, prob = 0.3)
obs # Poisson P(5)
<- rpois(n = 100, lambda = 5) obs
# empirical mean
mean(obs)
# empirical variance
var(obs)
Solution
- Bernoulli \(\BB(0.3)\): mean \(= 0.3\) and variance \(= 0.3 \times (1 - 0.3) =\) 0.21
# random data
<- rbinom(n = 100, size = 1, prob = 0.3)
obs # empirical mean
mean(obs)
[1] 0.32
# empirical variance
var(obs)
[1] 0.219798
- Binomial \(\BB(10, 0.3)\): mean \(= 10 \times 0.3 =\) 3 and variance \(= 0.3 \times (1 - 0.3) =\) 2.1
# random data
<- rbinom(n = 100, size = 10, prob = 0.3)
obs # empirical mean
mean(obs)
[1] 3.06
# empirical variance
var(obs)
[1] 1.794343
- Poisson \(\poi(5)\): mean \(= 5\) and variance \(= 5\)
# random data
<- rpois(n = 100, lambda = 5)
obs # empirical mean
mean(obs)
[1] 4.76
# empirical variance
var(obs)
[1] 5.234747
Continuous probability distributions
What are the continuous probability distributions that you could think of?
Solution
Uniform, Gaussian,
Student, \(\chi^2\), Fisher (c.f. later)
and more: Exponential, Gamma, Beta, Dirichlet, etc. (that we may not see)
Use R functions to generate random values drawn from the Uniform and Gaussian distributions. Compute the corresponding empirical mean and variance, and compare these to their theoretical values.
Hint:
# Uniform U[0, 10]
<- runif(n = 100, min = 0, max = 10)
obs # Gaussian N(3, 4)
<- rnorm(n = 100, mean = 3, sd = 4) obs
Solution
- Uniform \(\UU(0, 10)\): mean \(= \frac{0 + 10}{2} =\) 5 and variance \(= \frac{(10-1)^2}{12} =\) 8.3333333
# random data
<- runif(n = 100, min = 0, max = 10)
obs # empirical mean
mean(obs)
[1] 5.568295
# empirical variance
var(obs)
[1] 8.630765
- Gaussian \(\NN(3, 4)\): mean \(= 3\) and standard-deviation \(= 4\)
# random data
<- rnorm(n = 100, mean = 3, sd = 4)
obs # empirical mean
mean(obs)
[1] 3.368577
# empirical variance
var(obs)
[1] 14.7001
Sampling variability
What is the link between the empirical mean/variance and the theoretical mean/variance respectively?
Solution
The empirical mean and variance are estimators of their theoretical counterpart. The observed value for the mean (resp. the variance) on a given observed sample is an estimation of the theoretical mean (resp. variance).
Why is the empirical mean (resp. variance) not exactly equals to the theoretical mean (resp. variance) in the simulations in the previous section?
Solution
Because of the variability due to the randomness of the sampling.
We generate multiple random samples of size 100 following the same \(\UU[0,10]\) uniform distribution and we compute the empirical mean for each sample. What do you see?
<- replicate(1000, mean(runif(n = 100, min = 0, max = 10)))
mean_estim hist(mean_estim)
Solution
We drew the histogram of all values of the empirical mean computed on the different samples.
The observed values for the empirical mean varies across samples. This phenomenon is called sampling variability. And this variability due to the randomness of the samples follows a specific pattern (c.f. later).
The sample corresponds to a subset of the entire population. It is gathered by selecting random individuals because we generally cannot measure the quantity of interest on the entire population. Multiple samples will generally give different values for the estimation of a given quantity.
Asymptotic behavior
Estimating the mean
We want to generate a random sample from a given probability distribution and compute the empirical mean for different sample sizes.
Proceed and verify the value of empirical mean when the sample size grows.
What do you see? What is the name of this result?
Hint: You can use the following code, just replace xxxx
by the corresponding sample generation using the R generator for standard probability distributions (see rgauss()
, rpois()
, rbinom()
, rexp()
, etc.). And replace yyyy
by the theoretical value of the mean that you used to simulate the data.
# define increasing sample sizes (from n = 10 to n = 10000)
<- seq(10, 10000, by = 10)
sample_size # generate the samples for each sample size and compute the empirical mean
<- sapply(sample_size, function(n) mean( xxxx ))
mean_estim # define the theoretical mean
<- yyyy
theoretical_mean # plot the empirical mean values depending on the sample size
plot_mean_vs_sample_size(sample_size, mean_estim, theoretical_mean)
Note: the
plot_mean_vs_sample_size()
custom function should be available if you run thesource("https://...")
command from the Requirements section.
Solution
For the Gaussian distribution, rnorm(n)
generate a sample of size n
from the standard Gaussian distribution \(\NN(0,1)\) with theoretical mean = 0.
<- seq(10, 10000, by = 10)
sample_size <- sapply(sample_size, function(n) mean( rnorm(n) ))
mean_estim <- 0
theoretical_mean plot_mean_vs_sample_size(sample_size, mean_estim, theoretical_mean)
For the Bernoulli distribution, rbinom(n, 1, 0.3)
generate a sample of size n
from the Bernoulli distribution \(\BB(0.3)\) with theoretical mean = 0.3.
<- seq(10, 10000, by = 10)
sample_size <- sapply(sample_size, function(n) mean( rbinom(n, 1, 0.3) ))
mean_estim <- 0.3
theoretical_mean plot_mean_vs_sample_size(sample_size, mean_estim, theoretical_mean)
For the Poisson distribution, rpois(n, 10)
generate a sample of size n
from the Poisson distribution \(\poi(10)\) with theoretical mean = 10.
<- seq(10, 10000, by = 10)
sample_size <- sapply(sample_size, function(n) mean( rpois(n, 10) ))
mean_estim <- 10
theoretical_mean plot_mean_vs_sample_size(sample_size, mean_estim, theoretical_mean)
In all cases, the empirical mean converges toward the value of the theoretical mean when the sample size grows. This result is known as the law of large numbers.
Quantifying the uncertainty regarding the empirical mean
Now we generate multiple random samples of a fixed size \(n = 100\) from the \(\UU[0,10]\) uniform distribution, and we compute the empirical mean on each sample.
# generate 1000 samples (from the uniform distribution U[0,10]) of size 100
<- replicate(1000, runif(n = 100, min = 0, max = 10), simplify = FALSE)
samples # compute the empirical mean for each sample
<- sapply(samples, mean) mean_estim
How could we represent the empirical distribution of the empirical means across all samples?
Solution
We can use an histogram or represent the empirical density.
Represent the empirical distribution of the empirical mean. What can you say about this distribution?
Solution
# plot the empirical distribution
hist(mean_estim)
This empirical distribution seems to look like a Gaussian density.
If we generate random data from a given probability distribution with expectation \(\mu\) and variance \(\sigma^2\). What would the value of the expectation \(\EE(\bar{X}_n)\) and variance \(\VV(\bar{X}_n)\) of the empirical mean \(\bar{X}_n\)?
Solution
\(\EE(\bar{X}_n) = \mu\) and \(\VV(\bar{X}_n) = \frac{\sigma^2}{n}\)
We now generate random samples from the \(\poi(5)\) Poisson distribution.
# generate 1000 samples (from the Poisson distribution P(5)) of size 100
<- replicate(1000, rpois(n = 100, lambda = 5), simplify = FALSE)
samples # compute the empirical mean for each sample
<- sapply(samples, mean) mean_estim
What are the value of \(\mu\) and \(\sigma\) in this case?
Solution
\(\mu = 5\) and \(\sigma = \sqrt{5}\)
Represent the empirical distribution of the empirical mean across the different samples and compare it to the \(\NN(\mu,\sigma^2/n)\) Gaussian distribution density.
What can you say? What result are we illustrating?
Hint: you can use the custom function hist_mean_vs_gaussian_density(mean_estim, n, mu, sigma)
.
Solution
hist_mean_vs_gaussian_density(mean_estim, n = 100, mu = 5, sigma = sqrt(5))
The empirical distribution of the empirical mean \(\bar{X}_n\) is very close to a Gaussian distribution \(\NN(\mu,\sigma^2/n)\).
When the sample size \(n\) grows, the distribution of \(\bar{X}_n\) converges towards a \(\NN(\mu,\sigma^2/n)\) Gaussian distribution, whatever the probability distribution underlying the generation of the sample1. This result is given by the central limit theorem. Equivalently, the distribution of the following quantity \(\frac{\bar{X}_n - \mu}{\sigma \sqrt{n}}\) converges towards a \(\NN(0,1)\) Gaussian distribution.
Now, generate the samples from a \(\BB(0.15)\) Bernoulli distribution but decrease the sample size \(n\).
Again, represent the empirical distribution of the empirical mean across the different samples and compare it to the \(\NN(\mu,\sigma^2/n)\) Gaussian distribution density.
What would you expect? What do you see?
Hint: what are the values of \(\mu\) and \(\sigma\) in this case?
Solution
For a Bernoulli distribution \(\BB(p)\), we have \(\mu = p\) and \(\sigma = \sqrt{p(1-p)}\)
# n = 50
<- replicate(
samples 1000, rbinom(n = 50, size = 1, prob = 0.15), simplify = FALSE)
<- sapply(samples, mean)
mean_estim hist_mean_vs_gaussian_density(
n = 50, mu = 0.15, sigma = sqrt(0.15*(1-0.15))) mean_estim,
# n = 25
<- replicate(
samples 1000, rbinom(n = 25, size = 1, prob = 0.15), simplify = FALSE)
<- sapply(samples, mean)
mean_estim hist_mean_vs_gaussian_density(
n = 25, mu = 0.15, sigma = sqrt(0.15*(1-0.15))) mean_estim,
# n = 10
<- replicate(
samples 1000, rbinom(n = 10, size = 1, prob = 0.15), simplify = FALSE)
<- sapply(samples, mean)
mean_estim hist_mean_vs_gaussian_density(
n = 10, mu = 0.15, sigma = sqrt(0.15*(1-0.15))) mean_estim,
When the sample size is decreasing, the distribution of the empirical mean looks less and less alike a Gaussian distribution.
In particular, in this case, values far away from the theoretical mean becomes more frequent than what would be expected with a Gaussian distribution.
It means that when the sample size is large, you can use the central limit theorem to approximate the distribution of the empirical mean by a Gaussian distribution, but when the sample size is too small, we cannot.
(Optional) Quantifying the uncertainty regarding the empirical variance
We generate multiple random samples of a fixed size \(n = 10\) from a \(\NN(\mu,\sigma^2)\) Gaussian distribution with \(\mu = 3\) and \(\sigma^2 = 4\), and we compute the empirical variance on each sample.
# generate 1000 samples (from the Gaussian distribution N(3,4)) of size 10
<- replicate(
samples 1000, rnorm(n = 10, mean = 3, sd = sqrt(4)), simplify = FALSE)
# compute the empirical variance for each sample
<- sapply(samples, var) var_estim
Represent the empirical distribution of the empirical variance across the different samples and compare it to a \(\chi^2(n-1)\) distribution2 density.
What can you say? What result are we illustrating?
Hint: you can use the custom function hist_variance_vs_chi2(var_estim, n, sigma)
.
Solution
hist_variance_vs_chi2(var_estim, n = 10, sigma = sqrt(4))
The probability distribution of the empirical variance is a rescaled \(\chi^2(n-1)\) distribution.
In particular, the empirical variance estimator \(S^2\) verifies \(\frac{n-1}{\sigma^2}S^2 \sim \chi^2(n-1)\).
Note: \(S^2\) is the unbiased estimator of the variance, c.f. later.
Why is the empirical variance following a \(\chi^2\) distribution in this case?
Solution
The empirical variance is (with a scaling factor3) a sum of squared Gaussian random variable because we assumed that the sample is a collection of observations of independent and identically distributed Gaussian random variable. And a sum of \(n\) i.i.d. Gaussian random variables follows a \(\chi^2(n-1)\) distribution4.
This result is exact and not asymptotical thanks to the Gaussian assumption on the data.
(Optional) Combining the uncertainty regarding the empirical mean and the empirical variance
Again, we generate multiple random samples of a fixed size \(n = 50\) from a \(\NN(\mu,\sigma^2)\) Gaussian distribution with \(\mu = 3\) and \(\sigma^2 = 4\), and we now compute the empirical mean \(\bar{X}_n\) and variance \(S_2\) on each sample.
# generate 1000 samples (from the Gaussian distribution N(3,4)) of size 50
<- replicate(
samples 1000, rnorm(n = 10, mean = 3, sd = sqrt(4)), simplify = FALSE)
# compute the empirical mean for each sample
<- sapply(samples, mean)
mean_estim # compute the empirical variance for each sample
<- sapply(samples, var) var_estim
Represent the empirical distribution of the quantity \(\sqrt{n}\frac{\bar{X}_n - \mu}{\sqrt{S^2}}\) across the different samples and compare it to a \(\TT(n-1)\) Student distribution density and a \(\NN(0,1)\) Gaussian distribution density.
What can you say? What result are we illustrating?
Hint: you can use the custom function hist_scaled_mean_vs_student(mean_estim, var_estim, n, mu, sigma)
.
Solution
hist_scaled_mean_vs_student(
n = 10,
mean_estim, var_estim, mu = 3, sigma = sqrt(4))
The probability distribution of the quantity \(\sqrt{n}\frac{\bar{X}_n - \mu}{\sqrt{S^2}}\) is a \(\TT(n-1)\) Student distribution.
Why is the quantity \(\sqrt{n}\frac{\bar{X}_n - \mu}{\sqrt{S^2}}\) following a \(\TT(n-1)\) Student distribution in this case?
Solution
Thanks to the Gaussian assumption on the data, the quantity \(\sqrt{n}\frac{\bar{X}_n - \mu}{\sqrt{S^2}}\) is the ratio of a Gaussian distribution and a \(\chi^2\) distribution, which eventually define the Student distribution,
This result is exact and not asymptotical thanks to the Gaussian assumption on the data.
(optional) Unbiased empirical variance
We generate a sample of a fixed size \(n = 50\) from the \(\NN(0,1)\) standard Gaussian distribution, and we compute the empirical variance on each sample with the var()
function and with \(\hat{\sigma}^2_n = \frac{1}{n} \sum_{i=1} (X_i - \bar{X}_n)^2\).
# generate a sample
<- rnorm(50)
obs # compute the empirical variance
var(obs)
[1] 1.173189
# compute hat_sigma2_n
sum((obs - mean(obs))^2) / length(obs)
[1] 1.149725
Can you explain the difference between the value computed by the var()
function and the value of \(\hat{\sigma}^2_n\)?
Solution
The var()
function computes the unbiased estimation of the variance: \(S^2 = \frac{1}{n-1} \sum_{i=1} (X_i - \bar{X}_n)^2\)
# compute the empirical variance
var(obs)
[1] 1.173189
# compute S^2
sum((obs - mean(obs))^2) / (length(obs) - 1)
[1] 1.173189
We can show that \(\EE[\hat{\sigma}^2_n] \ne \sigma^2\) where \(\sigma^2\) is the true value of the variance, whereas \(\EE[S^2] = \sigma^2\).
Estimation
In practice (when doing scientific research, e.g. analyzing some data), we generally cannot collect multiple samples. Indeed, it is generally not possible to generate/collect as much data as we want potentially in multiple samples, like we did previously with simulations. We have a given and fixed size sample, and we should work with it.
In that case, how can we quantify the variability and uncertainty regarding a quantity or character of interest with a single sample (e.g. using a single value of the empirical mean and of the empirical covariance)?
Solution
To quantify the variability and uncertainty regarding a quantity or character, we have to rely on statistical models, that are either pertinent regarding the data characteristics and/or that rely on statistical approximation results.
In this section, we will do a real data analysis. We will work with the penguins
dataset from the palmerpenguins
package [1].
[1] Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0 (https://allisonhorst.github.io/palmerpenguins/) doi: 10.5281/zenodo.3960218
We first load the dataset. It contains various physical characteristics of penguins sampled from different populations (different species in this case).
library(palmerpenguins)
data(penguins)
penguins
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
We will focus on the “Adelie” penguin species. We select the corresponding individuals and remove any individual with missing data.
<- penguins %>% filter(species == "Adelie") %>% drop_na() adelie_penguins
For now on, we will work with the adelie_penguins
data table.
adelie_penguins
# A tibble: 146 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen 36.7 19.3 193 3450
5 Adelie Torgersen 39.3 20.6 190 3650
6 Adelie Torgersen 38.9 17.8 181 3625
7 Adelie Torgersen 39.2 19.6 195 4675
8 Adelie Torgersen 41.1 17.6 182 3200
9 Adelie Torgersen 38.6 21.2 191 3800
10 Adelie Torgersen 34.6 21.1 198 4400
# ℹ 136 more rows
# ℹ 2 more variables: sex <fct>, year <int>
How many observations are there in the adelie_penguins
dataset?
Solution
nrow(adelie_penguins)
[1] 146
Formalization of the statistical problem
Say we want to study a certain attribute/feature of the Adelie penguins, e.g. their body mass.
How would we proceed? Formalize the problem from the statistical point of view.
Solution
We would collect a sample and measure the attribute/feature of interest on each individual in the collected sample. Then, we can compute the empirical mean for this attribute/feature, compute confidence interval for the population mean, and so much more.
In a formal statistical point of view, we say that the value of the attribute/feature of interest measured on a given individual penguin is given by a random variable \(X\). The sample will be a collection of independent and identically distributed (i.i.d.) random variables \((X_i)_{i=1,...,n}\) following the same probability distribution than \(X\). The observed values for the \(X_i\)’s in the sample are noted \((x_i)_{i=1,...,n}\).
Sampling uncertainty quantification
How can we quantify the uncertainty and variability regarding the attribute/feature of interest in the entire population using a single sample?
In particular, which tools can we use to construct a probability distribution regarding the empirical mean for this attribute/feature of interest?
Solution
We will need to construct a statistical model. Depending on the situation, we can use two different approaches (in general).
- Depending on the characteristics of the attribute/feature of interest, we can use a statistical model, i.e. we can assume that the corresponding random variable follows a given probability distribution, i.e. Bernoulli if we are studying a binary feature, Poisson if we are counting occurrences of a phenomenon, Gaussian if we are studying a numerical feature.
Based on this assumption, we can sometimes determine the explicit probability distribution for the empirical mean (e.g. normalized binomial if the \(X_i\) are Bernoulli, Gaussian if the \(X_i\) are Gaussian) and use this distribution to quantify uncertainty regarding the attribute/feature of interest.
- If the sample size is large enough, we can use an asymptotic result like the central limit theorem to approximate the probability distribution of the empirical mean.
However, in all cases, the true values of the parameters of the underlying distributions5 are unknown (on contrary to simulation studies where we know them), and we can only work with their estimation.
What about the empirical variance? How can we derive its probability distribution?
Solution
Regarding the empirical variance, we generally need a statistical model. For instance, we saw that assuming that the data are Gaussian, i.e. assuming that the random variable measuring the attribute/feature of interest follows a Gaussian distribution, we can derive that the empirical variance follows a \(\chi^2\) distribution.
However, even with a statistical model, it is sometimes very complicated (or even impossible) to derive the probability distribution for the empirical variance.
It can be sometimes possible to derive asymptotic convergence for the probability distribution of the empirical variance but it can be difficult outside the Gaussian framework.
Descriptive statistics
We will focus on the body mass of the Adelie penguins (column body_mass_g
in the adelie_penguin
table).
<- adelie_penguins$body_mass_g body_mass
Compute the empirical mean, empirical variance and for the Adelie penguin body mass manually and with the corresponding R function?
You can also compute the empirical standard deviation.
For the variance, should we divide by \(n\) or \(n-1\)?
Solution
# number of observations
<- length(body_mass)
n_obs # empirical mean
sum(body_mass) / n_obs
[1] 3706.164
mean(body_mass)
[1] 3706.164
# empirical variance
sum((body_mass - mean(body_mass))^2) / n_obs
[1] 208891.8
sum((body_mass - mean(body_mass))^2) / (n_obs - 1)
[1] 210332.4
var(body_mass)
[1] 210332.4
# standard deviation
sqrt(var(body_mass))
[1] 458.6201
sd(body_mass)
[1] 458.6201
Reminder: the var()
function computes the unbiased estimator for the variance, i.e. sum((body_mass - mean(body_mass))^2) / (n_obs - 1)
, whereas sum((body_mass - mean(body_mass))^2) / n_obs
is the biased estimator for the variance (c.f. previous section).
We should use the unbiased estimator (i.e. when dividing the sum of squared deviation with the mean value divided by \(n-1\)).
What descriptive statistics and corresponding graphical representations can we use to evaluate the distribution of the observations in the sample?
Solution
We can compute the usual empirical quantiles, and use the boxplot visualization:
boxplot(body_mass, horizontal=TRUE)
We can also represent the empirical distribution of the observed values with an histogram:
hist(body_mass)
Relevance of the model
To go further, we will need to use a statistical model regarding our data. Here, the body mass is a numerical and continuous feature.
In this case, if the empirical distribution is symmetric, the default model would be to assume a Gaussian model for the feature of interest, i.e. the body mass here, because it gives access to various statistical tools (c.f later).
To use a Gaussian model, we have to verify that the empirical distribution of the observations is indeed symmetrical and look alike a Gaussian distribution.
We can use various tools to do so. How could we proceed?
Solution
We do not have access to the true probability distribution of the feature of interest because we cannot measure it on the entire population. Thus, we need to verify that the empirical distribution of the observations in the sample is close to a Gaussian distribution.
- We can compare the empirical cumulative distribution function of the standardized observations to the standard Gaussian cumulative distribution function (you can use the custom function
plot_ecdf()
):
plot_ecdf(body_mass)
- We can compare the empirical distribution of the standardized observations to the density of the standard Gaussian distribution (you can use the custom function
hist_obs_vs_gaussian_density()
:
hist_obs_vs_gaussian_density(body_mass)
- We can compare the empirical quantiles of the standardized observations to the theoretical quantiles of the standard Gaussian distribution using a “quantile-quantile plot” (a.k.a qqplot). You can do this with the
qqnorm()
function6. The point should approximately align on the line \(y = x\) meaning that the empirical quantiles approximately correspond the standard Gaussian quantile.
qqnorm(scale(body_mass))
abline(a=0, b=1, col = "red") # y=x line
Note: the standardised observations can be computed with the
scale()
function (e.g.scale(body_mass)
).
Here, we can say that the observations are approximately Gaussian.
These results are visual and not quantitative (some tests exists to verify if a set of observations is significantly not Gaussian but that goes beyond the scope of this tutorial).
Important: a model is a simplified and quantified version of the reality (driven by complex and random phenomenons and processes). It will generally never be true, but one can hope that is close enough to the reality to be used in a statistical analysis to derive pertinent results and conclusions about the studied feature.
Confidence interval
Since we have a Gaussian model, we will use it to build a confidence interval for the population average body mass of the Adelie penguin (which is unknown but that we are studying).
What are the limitations with the empirical mean estimation?
Solution
The empirical mean is a single value that is expected to be not very far (depending on the sample size) from the population average but it does not give any insight about how far the estimation is actually from true value for the population average.
How can we proceed to quantify the uncertainty about how far the observed empirical mean is from the true population average value.
Solution
We can construct an interval estimator that would contain the true value of the population average with high probability, called a confidence interval.
What are the boundaries of a confidence interval?
Solution
The boundaries of a confidence interval are random variables.
From the Gaussian model for the body mass, build the confidence interval for the population average that we note \(\mu\).
Solution
We assumed that our sample \((X_i)_{i=1,...,n}\) is i.i.d. following a Gaussian distribution of population unknown mean \(\mu\) and variance \(\sigma^2\).
From this, we can derive the following:
\(\bar{X}_n \sim \NN(\mu, \sigma^2/\sqrt{n})\)
\(\frac{n-1}{\sigma} S^2 \sim \chi^2(n-1)\)
Consequence: we have that \(\sqrt{n} \frac{\bar{X}_n - \mu}{\sqrt{S^2}} \sim \TT(n-1)\) i.e. a Student distribution with \(n-1\) degrees of freedom
Eventually, we can use the Student probability function to write: \[\PP\Big( -t_{n-1,1-\alpha/2} \leq \sqrt{n} \frac{\bar{X}_n - \mu}{\sqrt{S^2}} \leq t_{n-1,1-\alpha/2} \Big) = 1 - \alpha\] for a given \(0 < \alpha < 1\) where \(t_{n-1,1-\alpha/2}\) is the quantile of level \(1-\alpha/2\) of the \(\TT(n-1)\) Student distribution.
Here is a reminder about the quantile definition for a symmetric confidence interval and a symmetric probability distribution:
Now we can inverse the inequalities to write an interval with random boundaries (because depending on the estimators \(\bar{X}_n\) and \(S^2\)):
\[\PP\Big(\bar{X}_n - \frac{\sqrt{S^2}}{\sqrt{n}}t_{n-1,1-\alpha/2} \leq \mu \leq \bar{X}_n + \frac{\sqrt{S^2}}{\sqrt{n}}t_{n-1,1-\alpha/2} \Big) = 1 - \alpha\] The confidence interval for \(\mu\) is then \(\Big[ \bar{X}_n - \frac{\sqrt{S^2}}{\sqrt{n}}t_{n-1,1-\alpha/2}, \bar{X}_n + \frac{\sqrt{S^2}}{\sqrt{n}}t_{n-1,1-\alpha/2}\Big]\).
The observed confidence interval is \(\Big[ \bar{x}_n - \frac{\sqrt{s^2}}{\sqrt{n}}t_{n-1,1-\alpha/2}, \bar{x}_n + \frac{\sqrt{s^2}}{\sqrt{n}}t_{n-1,1-\alpha/2}\Big]\) where \(\bar{x}_n\) is the observed estimation of the empirical mean on the sample and \(s^2\) the the observed estimation of the empirical variance on the sample.
Compute the observed confidence interval for the body mass?
Hint: you can use the qt()
function to compute quantiles of the Student distribution.
Solution
# sample size
= length(body_mass)
n_obs
# empirical mean
<- mean(body_mass)
x_bar
# empirical variance
<- var(body_mass)
s2
# confidence interval risk 5%
<- 0.05
alpha
# quantile of level 1-alpha/2 for the T(n-1) distribution
<- qt(1-alpha/2, df = n_obs-1)
t_quant
# lower bound
<- x_bar - t_quant * sqrt(s2) / sqrt(n_obs)
lower_bound
# upper bound
<- x_bar + t_quant * sqrt(s2) / sqrt(n_obs) upper_bound
The observed confidence interval is \([\) 3631.1465319, 3781.1822353 \(]\).
Is it true that the population average of the body mass is in the interval \([\) 3631.1465319, 3781.1822353 \(]\) with a 95% probability?
Solution
No, it is not true.
The confidence interval defined with random variable boundaries contains the population average with a 95% probability.
But the observation of this interval is a deterministic value given the observations in the sample. Their is no random variable in the interval 3631.1465319 \(\leq \mu \leq\) 3781.1822353 since \(\mu\) is also a fixed (though unknown) value. Thus, it makes no sense to speak about a probability for the true parameter \(\mu\) to be in the observed interval.
However, what is true is that if we repeat the sampling numerous times, \(\mu\) will be in the observed confidence interval 95% of the time, but we generally cannot do that (unless we are working with simulations) because we generally have access to a limited number of samples.
(optional) Robustness
The robustness of a statistical method (e.g. an estimation method) is its ability to not be affected by outliers, i.e. data points that significantly differs from other observations.
Here are information about the distribution of the observations in the sample regarding the body mass:
summary(body_mass)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2850 3362 3700 3706 4000 4775
hist(body_mass)
We have two new observations for penguins with weight 6000g and 6500g.
<- c(body_mass, 6000, 6500) new_body_mass
Why can these two new observations considered as outliers in the sample?
Solution
These two new observations are significantly different from all the others.
hist(new_body_mass)
What could be the effect of outliers on the mean and variance estimators? Compare the mean and variance estimation in the original sample and in the new sample.
Solution
Outliers can modify the value of the estimations regarding the mean and the variance.
# mean
mean(body_mass)
[1] 3706.164
mean(new_body_mass)
[1] 3740.541
# mean ratio
mean(new_body_mass) - mean(body_mass))/mean(body_mass) (
[1] 0.0092754
# variance
var(body_mass)
[1] 210332.4
var(new_body_mass)
[1] 295173.5
# variance ratio
var(new_body_mass) - var(body_mass))/var(body_mass) (
[1] 0.4033667
# standard deviation
sd(body_mass)
[1] 458.6201
sd(new_body_mass)
[1] 543.2987
# standard deviation ratio
sd(new_body_mass) - sd(body_mass))/sd(body_mass) (
[1] 0.1846378
Both estimations are affected.
The effect on the empirical variance is bigger since it becomes 40.3366653% bigger, whereas the empirical mean only becomes 0.92754% bigger when accouting for the outliers.
Even, when considering the standard deviation (which is the squared root of the variance and then homogeneous to the empirical mean), it becomes 18.4637773% bigger when accounting for the outliers.
Thus, the effect of the outliers in our sample is bigger on the dispersion statistics like the empirical variance and standard deviation than on the empirical mean which is a position statistics.
When computing a confidence interval, we can verify its width, the narrower, the better precision.
The width is the length of the interval, computed as the difference between the upper and the lower boundaries.
What is the width of the confidence interval computed in the previous section?
Solution
The confidence interval width is two times \(\frac{\sqrt{s^2}}{\sqrt{n}}t_{n-1,1-\alpha/2}\).
What could affect the width of this confidence interval?
Solution
If the empirical variance is larger, the width will be larger since it depends on \(\sqrt{s^2}\).
If the sample size is larger, the width will be smaller since it depends on \(\frac{1}{\sqrt{n}}\).
What could affect the robustness of a confidence interval estimator?
Solution
Since outliers tend to increase the empirical variance value, it would increase the observed confidence interval width, and then decrease the confidence interval quality.
Thus, we would need to collect more data, i.e. get more observations in the sample (increase the sample size) to counter-balance the empirical variance increase.
Compare the confidence interval when using the first sample body_mass
and in the second sample new_body_mass
.
Solution
First sample:
# sample size
= length(body_mass)
n_obs
# empirical mean
<- mean(body_mass)
x_bar
# empirical variance
<- var(body_mass)
s2
# confidence interval risk 5%
<- 0.05
alpha
# quantile of level 1-alpha/2 for the T(n-1) distribution
<- qt(1-alpha/2, df = n_obs-1)
t_quant
# lower bound
<- x_bar - t_quant * sqrt(s2) / sqrt(n_obs)
lower_bound
# upper bound
<- x_bar + t_quant * sqrt(s2) / sqrt(n_obs)
upper_bound
# observed confidence interval width
- lower_bound upper_bound
[1] 150.0357
Second sample:
# sample size
= length(new_body_mass)
n_obs
# empirical mean
<- mean(new_body_mass)
x_bar
# empirical variance
<- var(new_body_mass)
s2
# confidence interval risk 5%
<- 0.05
alpha
# quantile of level 1-alpha/2 for the T(n-1) distribution
<- qt(1-alpha/2, df = n_obs-1)
t_quant
# lower bound
<- x_bar - t_quant * sqrt(s2) / sqrt(n_obs)
lower_bound
# upper bound
<- x_bar + t_quant * sqrt(s2) / sqrt(n_obs)
upper_bound
# observed confidence interval width
- lower_bound upper_bound
[1] 176.5127
(Optional) Sampling bias
We compute the average body mass for the male and female in the sample.
%>% group_by(sex) %>% summarise(mean_body_mass = mean(body_mass_g)) adelie_penguins
# A tibble: 2 × 2
sex mean_body_mass
<fct> <dbl>
1 female 3369.
2 male 4043.
What could be the problem here when studying the body mass in this species? especially regarding the sampling composition?
Solution
The average body mass of male individuals in the sample is higher than the average body mass of female individuals in the sample.
Then, if we sample only males, we risk to over-estimate the Adelie penguin average body mass. If we sample only females, we risk to under-estimate the Adelie penguin average body mass.
This phenomenon is called sampling bias. One must be careful that the sample is representative of the studied population.
Here is the composition of the sample:
table(adelie_penguins$sex)
female male
73 73
We should study the sex ratio for this species and verify that the sample is representative of the population (c.f next section).
Confidence interval for the proportion
We now focus on the sex ratio in the Adelie penguin species, i.e. the proportion of females/males in the species (which varies across the animal kingdom and is not always around 50%-50%).
We encode the sex as a \(\{0, 1\}\) binary variable, indicating if the considered individual is a female (i.e. 1 = “female” and 0 = “male”). Thus, we will work on the estimation of the female proportion within this species 7.
We extract the corresponding data:
<- as.integer(adelie_penguins$sex == "female")
female female
[1] 0 1 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 1 0 0 1 0 1 1 0 1 0 1 0 1 0 0 1 1 0 1 0
[38] 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
[75] 1 0 1 0 1 0 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
[112] 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 1 0 1 0
What is the estimated proportion of females in the sample?
Solution
The estimated proportion of females in the sample, a.k.a. the empirical frequency, corresponds to the empirical mean of the \(\{0, 1\}\) binary variable indicating if each individual is a female over the sample.
It can also be computed by summing up the number of females in the sample and divided by the sample size.
mean(female)
[1] 0.5
sum(female) / length(female)
[1] 0.5
We have 50% of female in this sample.
Formalize the statistical model that we will use.
Solution
We consider a Bernoulli model for the \(\{0,1\}\) female indicator since it is the only possible probability distribution for a binary value.
More formally, we denote by \(Y\sim\BB(p)\) the random variable indicating if a given Adelie penguin is a female, where \(p\) is the probability that the individual in question is a female.
Then \(p\) will also correspond to the proportion of females in the entire population.
And we have a sample \((Y_i)_{i=1,...,n}\) independent and identically distributed following a \(\BB(p)\) Bernoulli distribution.
From that sample, we want to get an estimation of \(p\).
The empirical mean or empirical frequency \(\bar{Y}_n = \sum_{i=1}^n Y_i\) can also be noted \(\hat{p}\).
What is the probability distribution of the random variable \(B_n = \sum_{i=1}^n Y_i\) where the the \(Y_i\) are independent and identically distributed variables following a \(\BB(p)\) Bernoulli distribution?
Solution
In this case, the probability distribution of \(B_n\) is known. It follows a \(\BB(n,p)\) Binomial distribution.
We question the true sex ratio in the Adelie penguin population. How could we get more information about the proportion of females in the species based on its estimation in the sample than just a single estimated value?
Solution
We could compute a confidence interval for the female proportion in the population using its observed estimation \(\hat{p}\) in the sample.
What tool could we use to build a confidence interval for the proportion of females in the population?
\(Hint:\) we have two possibilities.
Solution
- Using the exact distribution (more complicated):
Since all \(Y_i\)s follow a \(\BB(p)\) Bernoulli distribution, we know that \(B_n = \sum_{i=1}^n Y_i\) follows a \(\BB(n,p)\) Binomial distribution. Thus the probability distribution of the quantity \(n \times \bar{Y}_n = n \times \hat{p} = B_n\) is known and we can try to compute the \(\BB(n,p)\) Binomial distribution quantiles \(q_1\) and \(q_2\) such that \[\PP\Big(u_1 \leq B_n \leq u_2\Big) = 1 - \alpha\]
But remember that \[\PP\Big(u_1 \leq B_n \leq u_2\Big) = \sum_{k = u_1}^{u_2} \PP(B_n = k)\] and \[\PP(B_n = k) = \frac{n!}{k!(n-k)!} p^k (1-p)^{n-k}\]
Thus, the computation of boundaries over \(p\) will be very complicated.
- Asymptotic approximation:
If the sample size is large enough, we could use the central limit theorem, which says that \[\sqrt{n}\frac{\bar{Y}_n - \mu}{\sigma} \sim_n \NN(0,1)\] where \(\mu\) is the expectation and \(\sigma\) the variance of a variable following a \(\BB(p)\) Bernoulli distribution, which are?
If the sample size is too small, we cannot use this method and have no choice but to use the complicated exact one (c.f. previous point).
Build the confidence interval for the population average proportion of females \(p\).
Solution
We assume that our sample is large enough and we use the Gaussian approximation.
Here, \(\mu = p\) and \(\sigma^2 = p(1-p)\), thus \[\sqrt{n}\frac{\bar{Y}_n - p}{\sqrt{p(1-p)}} \sim \NN(0,1)\]
Then we can write \[\PP\Big( -z_{1-\alpha/2} \leq \sqrt{n} \frac{\bar{Y}_n - p}{\sqrt{p(1-p)}} \leq z_{1-\alpha/2} \Big) = 1 - \alpha\] for a given \(0 < \alpha < 1\) where \(z_{1-\alpha/2}\) is the quantile of level \(1-\alpha/2\) of the \(\NN(0,1)\) standard Gaussian distribution.
By inverting the following inequality \[-z_{1-\alpha/2} \leq \sqrt{n} \frac{\bar{Y}_n - p}{\sqrt{p(1-p)}} \leq z_{1-\alpha/2}\] we can get a confidence interval for \(p\) where the boundaries depends on the random variable \(\bar{Y}_n\).
But we have a problem since it is very complicated to extract \(p\) from the expression \(\frac{\bar{Y}_n - p}{\sqrt{p(1-p)}}\).
How can we proceed to simplify the confidence interval computation for \(p\) when using the Gaussian approximation?
Solution
In the formulation \(\sqrt{n}\frac{\bar{Y}_n - \mu}{\sigma} \sim_n \NN(0,1)\), we can replace the standard deviation \(\sigma\) by its estimation \(\sqrt{\hat{p}(1-\hat{p})}\) (remember we have a Bernoulli model) without modifying the Gaussian approximation8 and then we get \[\sqrt{n}\frac{\bar{Y}_n - p}{\sqrt{\hat{p}(1-\hat{p})}} \sim_n \NN(0,1)\]
Thus, we can write \[\PP\Big( -z_{1-\alpha/2} \leq \sqrt{n}\frac{\bar{Y}_n - p}{\sqrt{\hat{p}(1-\hat{p})}} \leq z_{1-\alpha/2} \Big) = 1 - \alpha\] for a given \(0 < \alpha < 1\) where \(z_{1-\alpha/2}\) is the quantile of level \(1-\alpha/2\) of the \(\NN(0,1)\) standard Gaussian distribution.
And by inverting the inequality \(-z_{1-\alpha/2} \leq \sqrt{n}\frac{\bar{Y}_n - p}{\sqrt{\hat{p}(1-\hat{p})}} \leq z_{1-\alpha/2}\), we find the confidence interval \[\bar{Y}_n - \frac{\sqrt{\hat{p}(1-\hat{p})}}{\sqrt{n}} z_{1-\alpha/2} \leq p \leq \bar{Y}_n + \frac{\sqrt{\hat{p}(1-\hat{p})}}{\sqrt{n}} z_{1-\alpha/2}\]
And the observed confidence interval is \[\Big[ \hat{p} - \frac{\sqrt{\hat{p}(1-\hat{p})}}{\sqrt{n}} z_{1-\alpha/2}, \hat{p} + \frac{\sqrt{\hat{p}(1-\hat{p})}}{\sqrt{n}} z_{1-\alpha/2} \Big]\]
Compute the observed confidence interval for the female proportion?
Hint: you can use the qnorm()
function to compute quantiles of the Gaussian distribution.
Solution
# sample size
= length(female)
n_obs
# empirical mean
<- mean(female)
hat_p
# confidence interval risk 5%
<- 0.05
alpha
# quantile of level 1-alpha/2 for the N(0,1) Gaussian distribution
<- qnorm(1-alpha/2)
z_quant
# lower bound
<- hat_p - z_quant * sqrt(hat_p * (1 - hat_p)) / sqrt(n_obs)
lower_bound
# upper bound
<- hat_p + z_quant * sqrt(hat_p * (1 - hat_p)) / sqrt(n_obs) upper_bound
The observed confidence interval is \([\) 0.4188961, 0.5811039 \(]\).
Reminding: we cannot say the \(p\) is in the observed confidence interval \([\) 0.4188961, 0.5811039 \(]\) with probability 95%, we can also say that if we repeat the sampling procedure numerous times, the confidence interval will contain the population proportion \(p\) 95% of the time.
Question ?
Solution
Solution
Footnotes
as long as the variance of this distribution is finite, i.e. \(\sigma^2<+\infty\)↩︎
a.k.a. “chi square” distribution↩︎
i.e. the sum of squared deviation with the mean value divided by \(n-1\)↩︎
per the Cochran theorem↩︎
e.g. \(p\) for Bernoulli distribution, or \(\mu\) and \(\sigma\) for the Gaussian distribution↩︎
The
qqplot()
function also exists but requires to provide the theoretical quantiles in input.↩︎we could inverse the encoding 1 = “male” and 0 = “female”, and study the proportion of males, which is equivalent since we have only two classes of individuals↩︎
It can be proven↩︎