install.packages("tidyverse") # to manipulate data and make plot
install.packages("performance") # to evaluate model
install.packages("FactoMineR") # for dimension reduction
install.packages("factoextra") # to plot dimension reduction output
6 Practice: Introduction to linear models and multiple testing
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License
\[ \def\H{{\mathcal{H}}} \def\PP{{\mathbb{P}}} \def\EE{{\mathbb{E}}} \]
6.1 Introduction
Linear models are widely used in data analysis to relate variables and discover relationships between quantitative and/or qualitative measurements. Linear models can be used for modeling, prediction, data normalization, factor/variable effect on an outcome detection or quantification, etc, sometimes in combination with statistical tests. More generally, hypothesis testing is at the core of the scientific method, formulating hypotheses and then verifying them with experiments. In this context, understanding p-values and multiple testing is crucial to avoid the pitfall of false discoveries.
To manipulate these notions, this practical will focus on possible associations between genotype and morphological data measured in yeast, that were originally studied in the following publication:
[1] Nogami S, Ohya Y, Yvert G (2007) Genetic Complexity and Quantitative Trait Loci Mapping of Yeast Morphological Traits . PLOS Genetics 3(2): e31. https://doi.org/10.1371/journal.pgen.0030031
Yeast are “eukaryotic, single-celled microorganisms classified as members of the fungus kingdom” (per Wikipedia).
We will study the possible associations between the genotype of the considered yeast colonies and some morpholocigal traits directly measured on the yeast through image analysis, such as (from [1]) :
Throughout this practical, we will explore the following subjects:
- Descriptive statistics
- Statistical hypothesis testing and p-values
- Linear model and analysis of variance (ANOVA)
- Dimension reduction
- Multiple testing
6.2 Requirements
We will need the following packages. If not already done, you need to install the following packages:
Now, we can load them:
library(tidyverse) # to manipulate data and make plot
library(performance) # to evaluate model
library(FactoMineR) # for dimension reduction
library(factoextra) # to plot dimension reduction output
6.3 Reminder about statistical hypothesis testing and p-values
6.3.1 Statistical testing
6.3.1.1 Building the test
Confront two hypotheses, the null hypothesis \(\H_0\) vs the alternative hypothesis \(\H_1\).
Detail the general framework to build a statistical test?
Solution
- Compute the observed value \(t\) of a random variable, called a statistic, \(T\) using measured quantities on a sample of observations.
- Requirements: know the probability distribution of \(T\) assuming \(\H_0\) is true
- Verify if the value \(t\) is “probable” according to the distibution of \(T\) under \(\H_0\).
What are the possible answers of a statistical test?
Solution
- possible answers: “\(\H_0\) is false with a given risk” (reject \(\H_0\), the test result is significant) vs “the result is not significant” (given this sample, we don’t know if \(\H_0\) is true or false).
- the answer is never:
\(\H_0\) is true.
6.3.1.2 Getting the answer of the test
What is the p-value ?
Solution
p-value = conditional probability assuming \(\H_0\) that the statistic \(T\) is at least as extreme as the observed value \(t\)
For a bilateral test:
\[ p\text{-value} = \PP(T < -t \ \text{or}\ T > t) = \PP(\vert T\vert > \vert t\vert) = 1 - \PP(-t \leq T \leq t) \]
For a (right) uniteral test \[ p\text{-value} = \PP(T > t) = 1 - \PP(T \leq t) \]
Are the following statements true of false ?
the p-value is the estimated probability that \(\H_0\) is true
the p-value is the risk (estimated probability) to be wrong when discovering a significant effect (=rejecting \(\H_0\))
Solution
the p-value is the estimated probability that \(\H_0\) is true(FALSE)the p-value is the risk (estimated proabiility) to be wrong when discovering a significant effect (=rejecting \(\H_0\))(FALSE)
The p-value is a conditional probability assuming that \(\H_0\) is true.
Reasoning:
- assuming \(\H_0\) \(\to\) the statistics \(T\) follows given probability distribution
- “Inductive contraposition”: p-value is small \(\to\) observed value \(t\) for the statistics \(T\) is unlikely considering its probability distribution under \(\H_0\) \(\to\) it is unlikely that \(T\) follows this distribution \(\to\) it is unlikely that \(\H_0\) is true.
What are type I and type II errors?
Solution
Type I error: reject \(\H_0\) conditionally to the fact that \(\H_0\) is true.
Type II error: not reject \(\H_0\) conditionally to the fact that \(\H_0\) is false.
What are type I and type II risks?
Solution
Type I risk: \(\alpha = \PP(\text{reject } \H_0\ \vert\ \H_0 \text{ is true})\)
Type II risk: \(\beta = \PP(\text{not reject } \H_0\ \vert\ \H_0 \text{ is false})\)
Power \(= 1 - \beta = \PP(\text{reject } \H_0\ \vert\ \H_0 \text{ is false})\)
Are the following statements true of false ?
the type 1 risk \(\alpha\) is the probability that \(\H_0\) is false
the type 1 risk \(\alpha\) is the probability to be wrong when discovering a significant effect (=rejecting \(\H_0\))
Solution
the type 1 risk \(\alpha\) is the probability that \(\H_0\) is false(FALSE)the type 1 risk \(\alpha\) is the probability to be wrong when discovering a significant effect (=rejecting \(\H_0\))(FALSE)
The type 1 risk is a conditional probability assuming that \(\H_0\) is true.
The type 1 risk \(\alpha\) is generally chosen (e.g. \(\alpha = 5\%\)). It is important to evaluate the power of the test which can not be done in general (it requires either to design a test where the distribution of statistic \(T\) is known assuming that tjhe alternative hypostheis \(\H_1\) is true, or to evaluate the power using simulations where the truth about \(\H_0\) and \(\H_1\) is konwn.
How to get the answer of a statistical test?
Solution
Choose a given risk \(\alpha\) (generally \(\alpha = 5\%\))
Compare the p-value to \(\alpha\)
if p-value \(\leq \alpha\) then reject \(\H_0\) and the result is significant at level \(\alpha\) (reminder it is not the probability to be wrong)
if p-value \(> \alpha\) then not reject \(\H_0\) and the result is not significant
Or similarly, compare the observed value \(t\) for the statistic \(T\) to the quantile of the distribution under \(\H_0\).
For a bilateral test:
- reject \(\H_0\):
- not reject \(\H_0\):
For a (right) uniteral test:
- reject \(\H_0\)
- not reject \(\H_0\)
6.3.2 p-values, significancy and power
We run a Student T-test on a sample of observations regarding a single variable. We want to know if the average of this variable is null.
The data are generated from the simulation of a random Gaussian variable of mean \(\mu=0\) and variance \(\sigma^2=1\). Here, we know the true mean \(\mu\) and variance \(\sigma^2\), however, in general data analysis, they are usually unknown, hence the motivation to run a test on a possible value for \(\mu\).
Since, we are working with simulations, we can repeat the experiment as much as we want, i.e. generate multiple samples of observed values for the considered variables.
Note: it is only possible because we are working with simulated data. When analysing real experimental data, it is generally very complicated or costly (or even impossible) to repeat the experiments numerous times in order to generate multiple independent samples of observed values.
In the T-test, we test the hypotheses “\(\H_0\): \(\mu = \mu_0\)” versus “\(\H_1\): \(\mu \ne \mu_0\)” where \(\mu\) is the population mean of the variable (that is unknown), for a given value \(\mu_0\). In the following, we will choose \(\mu_0=0\) and test “\(\H_0\): \(\mu = 0\)” versus “\(\H_1\): \(\mu \ne 0\)”.
Here is a function to generate a sample and compute the p-values associated to the T-test:
## function to generate a sample and compute the p-value for the T-test
#' @param sample_size the size of the simulated sample
#' @param mu the mean of the Gaussian distribution used to generate the data (default is 0)
#' @param mu0 the tested value for the mean in the T-test (default is 0)
<- function(sample_size, mu = 0, mu0 = 0) {
compute_p_value # simulation of a Gaussian variable
<- rnorm(sample_size, mean = mu)
obs_values # Computation of the observed value for the statistic associated to the T-test
<- sqrt(sample_size) * (mean(obs_values) - mu0) / sd(obs_values)
stat_value # Computation of the p-value
# (under H0, the statistic follows a Student T-distribution with n-1 degrees of freedom)
<- 2*(1-pt(abs(stat_value), df = sample_size - 1))
p_value # or use t.test()
return(p_value)
}
We repeat the experiment 10000 times: we repeat the data generation with \(\mu=0\) (hence \(\H_0\) is known to be true) and we compute the p-value for the T-test on each repetition.
<- 10000
n_repetition <- replicate(n_repetition, compute_p_value(sample_size = 100, mu = 0)) p_values
Below is the empirical distribution of the collected p-values (10000 computations of the same p-value regarding the same test on the same variable but with different samples each time).
What can you say about this figure? What could be the problem? especially regarding the result of the test?
Solution
In a non negligible number of samples, the null hypothesis was rejected (p-value \(\leq\alpha\)) whereas it is known to be true. In this case, we find a significant result \(\mu\ne 0\) despite being wrong.
However, in the majority of the studies, the null hypothesis is correctly not rejected.
We follow the same process but this time, we generate data where the population mean is \(\mu=0.25\) (hence not 0 and \(\H_0\) is known to be false) and we run the T-test on 10000 repetitions of the experiment.
<- 10000
n_repetition <- replicate(n_repetition, compute_p_value(sample_size = 100, mu = 0.25)) p_values
Below is the empirical distribution of the collected p-values
What can you say about this figure? What could be the problem? especially regarding the result of the test?
Solution
In a non negligible number of samples, the null hypothesis was not rejected (p-value \(\leq\alpha\)) whereas it is known to be false. In this case, we cannot find a significant result \(\mu\ne0\) despite being right.
However, in the majority of the studies, the null hypothesis is correctly rejected.
It is important to evaluate the power of a test (despite being generally impossible with real experimental data).
Imagine a procedure based on simulation to estimate the power of the previous T-test depending on the type I risk \(\alpha\)?
Solution
The power of a test is \(1 - \beta = \PP(\text{reject } \H_0\ \vert\ \H_0 \text{ is false})\) where \(\beta = \PP(\text{not reject } \H_0\ \vert\ \H_0 \text{ is false})\) is the type II risk.
To estimate \(\beta\), we need to repeat the same experiment multiple time and estimate the corresponding probability.
A simple way is to generate data where \(\H_0\) is known to be false, i.e. \(\PP(\H_0 \text{ is false})=1\).
We have (c.f. here): \[ \begin{aligned} & \text{power}\\ & = \PP(\text{reject } \H_0\ \vert\ \H_0 \text{ is false})\\ & = \frac{\PP(\text{reject } \H_0\ \ \text{and}\ \H_0 \text{ is false})}{\PP(\H_0 \text{ is false})} \end{aligned} \]
In this case, we have then: \(\text{power} = \PP(\text{reject } \H_0\ \ \text{and}\ \H_0 \text{ is false})\) and we can estimate this probability by counting the number of times reject \(\H_0\) among the repetition of the experiment depending on the type I risk \(\alpha\) that we choose.
We repeat the experiment, generating data where \(\mu\ne0\) (here \(\mu = 0.2\)), hence “\(\H_0: \mu=0\)” is known to be false.
<- 10000
n_repetition <- replicate(n_repetition, compute_p_value(sample_size = 100, mu = 0.25)) p_values
We compute the estimation of the power, being the number of times \(\H_0\) was rejected over the experiment repetitions over the total number of repetitions, depending on the value that we choose for \(\alpha\)
<- seq(0, 1, 0.001) # values from 0 to 1
alpha_values <- sapply(
power_values
alpha_values, function(alpha) {
<- sum(p_values<=alpha)/length(p_values)
estimated_power return(estimated_power)
} )
Here is the representation of the estimated power depending on alpha:
What can you say about this representation?
Solution
Decreasing \(\alpha\) to reduce the type I error decreases the power of the test.
Important:
- confirm a detected effect with additional experiments/studies
- the more (independent) studies, the lower risk of incorrect conclusions
6.4 Loading the data
In this data, 64 different yeast strains are considered.
Note: all strains were generated from an admixture between two original strains but we will not investigate this point here.
For each strain, we have genotype data, containing the SNPs along the yeast genome. Each SNP is encoded with a 0
or 1
value corresponding to the number of derivative allele present at the corresponding locus.
Note: Here, the yeast were sequenced during their haplotype phase, therefore the possible values for each genotype are only
0
and1
.
For each strain, we also have measures of different morphological traits for hundred of cells during 3 different stages of the cell cycle (called "A"
, "A2B"
and "C"
).
Since the mutation is rate is very low in yeast, we can consider that all cells from the same strain share the same genotype across their entire genome.
We will use data stored in the practical_c.RData
file.
<- load(url("https://lbmc.gitbiopages.ens-lyon.fr/hub/formations/ens_m1_ml/practical_c.Rdata"), verbose = T) data_list
It contains the following objects:
print(data_list) # list the loaded objects
[1] "gt_data" "morpho_data" "strain_id" "yeast_data"
[5] "yeast_av_data"
gt_data
: a data frame with 2820 SNPs in rows and the corresponding genotypes for the 64 strains in columns, along with other informations regarding the SNPs such as their position on the genome, their identification, etc. More details with:
str(gt_data)
morpho_data
: a data frame with 45215 cells in rows and the corresponding morphological trait measures in columns, along with the corresponding cell identification, strain identification and cell cycle, etc. More details with:
str(morpho_data)
yeast_data
: a data frame containing the morphological data (seemorpho_data
) and the corresponding genotypes of all SNPs (in additional columns) for each cell (in rows). More details with:
str(yeast_data)
yeast_av_data
: a data frame containing the average (over all cells) of the morphological measures for each strain and each cell cycle phase with the corresponding genotype (same columns asyeast_data
). More details with:
str(yeast_av_data)
strain_id
: a data frame containing the identification of the different strains (including other not present inmorpho_data
andgt_data
)
6.5 Data description
We will first conduct a descriptive analysis of the data. During the course of this practical, we will first consider the morphological measure called A101
and the genotype associated to two SNPs, namely YAL069W_1
and YDL200C_427
.
Generally, when the dimension of the data is not too large (otherwise see below and afterward), a first exploratory step in data analysis (for quantitative variables) is to consider descriptive statistics, such as position parameters (mean/average, empirical quantiles including the median), dispersion parameters (empirical variance and empirical standard deviation), possibly depending on groups or classes defined by a qualitative variables.
Which graphical representation of the data does give an insight about the distribution (including the position and dispersion) of a variable ?
Solution
For instance, a box-plot (and derivative)
IMPORTANT: first we will focus on the morphological data averaged by strain and cell cycle, i.e. yeast_av_data
. The idea here is to avoid dealing with the measure heterogeneity across all cells within each strain (c.f. later).
Construct a boxplot of the value of the variable A101
depending on the genotype associated to the SNPs YAL069W_1
or YDL200C_427
(separately). What can you say about this representation?
Solution
ggplot(yeast_av_data, aes(factor(YAL069W_1), A101)) + geom_boxplot() +
theme_bw()
ggplot(yeast_av_data, aes(factor(YDL200C_427), A101)) + geom_boxplot() +
theme_bw()
How to account for the cell cycle in the previous representations? Is it important?
Solution
ggplot(yeast_av_data, aes(factor(YAL069W_1), A101)) + geom_boxplot() +
facet_wrap(~cell_cycle, labeller = "label_both") +
theme_bw()
ggplot(yeast_av_data, aes(factor(YDL200C_427), A101)) + geom_boxplot() +
facet_wrap(~cell_cycle, labeller = "label_both") +
theme_bw()
What can you infer about the possible relationship between the morphological trait quantified by the variable A101
and the genotype associated to the SNPs YAL069W_1
or YDL200C_427
from the boxplot?
Solution
For all cell cycle phases, the value of the variable A101
seems to depend on the genotype associated to the SNP YAL069W_1
. Indeed, the value of the variable A101
is globally higher in presence of the derivative allele.
On contrary, for cell cycle phases "A1B"
and "C"
, the value of the variable A101
does not seem to depend on the genotype associated to the SNP YAL069W_1
.
Note: depending on the context and the data, other representations like empirical histogram or empirical density representation of quantitative variables of interest depending on factors (qualitative variables) of interest can be used.
6.6 Linear model and analysis of variance
We will now use a linear model, and especially an analysis of variance (ANOVA), to test the association between the morphological trait A101
and the genotype associated to the SNPs YAL069W_1
or YDL200C_427
(separately).
6.6.1 One-factor ANOVA
Write the linear model for the ANOVA regarding the morphological trait A101
depending on genotype associated to the SNP YAL069W_1
?
Solution
Notations:
- \(j=1,2\) is the genotype associated to the SNP
YAL069W_1
(i.e.0
for \(j=1\) and1
for \(j=2\)) - \(y_{jr}\) is the value of the morphological trait
A101
in the \(r^\text{th}\) individual with genotype \(j\). The associated random variable is \(Y_{jr}\)
Model:
\[ Y_{jr} = \mu_j + \varepsilon_{jr} \]
Assumptions:
- we assume that \(\varepsilon_{jr} \sim \mathcal{N}(0, \sigma^2)\)
- we assume that \(Y_{jr}\) are independent (hence that \(\varepsilon_{jr}\) are independent)
Equivalent formulation:
\[ Y_{jr} \sim \mathcal{N}(\mu_j, \sigma^2) \]
IMPORTANT: We will first consider only a single cell cycle phase (e.g. the "C"
phase) to avoid dealing with the measure heterogeneity across cell cycles (c.f. later)
<- yeast_av_data %>% filter(cell_cycle == "C") yeast_av_subdata
Conduct a one-factor ANOVA between the considered output variable (A101
) and each of the considered SNPs (YAL069W_1
or YDL200C_427
). Hint: use the functions lm()
and anova()
Solution
Between A101
and the SNP YAL069W_1
:
anova(lm(A101 ~ factor(YAL069W_1), data = yeast_av_subdata))
Analysis of Variance Table
Response: A101
Df Sum Sq Mean Sq F value Pr(>F)
factor(YAL069W_1) 1 0.004629 0.0046293 8.5298 0.004867 **
Residuals 62 0.033648 0.0005427
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Between A101
and the SNP YAL069W_1
:
anova(lm(A101 ~ factor(YDL200C_427), data = yeast_av_subdata))
Analysis of Variance Table
Response: A101
Df Sum Sq Mean Sq F value Pr(>F)
factor(YDL200C_427) 1 0.000034 0.00003449 0.0559 0.8139
Residuals 62 0.038243 0.00061683
What should we check before interpreting the results of the ANOVA?
Solution
Check the normality of the residuals and the homoskedasticity (constant variance).
Note: you can get the residuals of a model with the
residuals()
function.
Between A101
and the SNP YAL069W_1
:
# linear model
<- lm(A101 ~ factor(YAL069W_1), data = yeast_av_subdata)
mod1
# scatter plot of the residuals
plot(residuals(mod1))
abline(h=0, col = "red")
# empirical distribution of the residuals
hist(residuals(mod1), prob = TRUE) # histogram of empirical frequency (instead of counts)
lines(density(residuals(mod1)), col = "red") # empirical density
# qq-plot
plot(mod1, which = 2)
# check normality
::check_normality(mod1) performance
OK: residuals appear as normally distributed (p = 0.128).
# check homoskedasticity
::check_heteroskedasticity(mod1) performance
OK: Error variance appears to be homoscedastic (p = 0.302).
Between A101
and the SNP YAL069W_1
:
# linear model
<- lm(A101 ~ factor(YDL200C_427), data = yeast_av_subdata)
mod2
# scatter plot of the residuals
plot(residuals(mod2))
abline(h=0, col = "red")
# empirical distribution of the residuals
hist(residuals(mod2), prob = TRUE) # histogram of empirical frequency (instead of counts)
lines(density(residuals(mod2)), col = "red") # empirical density
# qq-plot
plot(mod2, which = 2)
# check normality
::check_normality(mod2) performance
OK: residuals appear as normally distributed (p = 0.084).
# check homooskedasticity
::check_heteroskedasticity(mod2) performance
OK: Error variance appears to be homoscedastic (p = 0.766).
Compute the value of the coefficients in the ANOVA model regarding the morphological trait A101
depending on genotype associated to the SNP YAL069W_1
that was defined in the first question? This model is different from the results given by the lm
function? How so?
Solution
Between A101
and the SNP YAL069W_1
:
# manual computation
%>% group_by(YAL069W_1) %>% summarise(mu=mean(A101)) yeast_av_subdata
# A tibble: 2 × 2
YAL069W_1 mu
<int> <dbl>
1 0 0.215
2 1 0.232
# automatic computation
lm(A101 ~ factor(YAL069W_1), data = yeast_av_subdata)
Call:
lm(formula = A101 ~ factor(YAL069W_1), data = yeast_av_subdata)
Coefficients:
(Intercept) factor(YAL069W_1)1
0.21462 0.01704
Model considered in R (with the previous notations):
\[ Y_{jr} = \mu + \beta_j + \varepsilon_{jr} \]
with:
- \(\mu\) is the intercept
- \(\beta_j\) is the coefficient associated to genotype \(j\), with the constraint: \(\sum_j \beta_j = 0\) (think \(\beta_j = \mu_j - \mu\)), hence you get only \(\beta_j\) for \(j=2\) (i.e. genotype = 1)
Corresponding matrix notation (which requires to reindex \(Y\) and \(E\)):
\[ Y = \mu + X \times B + E \]
where:
- \(Y = [y_i]\) is the vector containing the values for the output variable for all \(n\) individuals indexed by \(i=1:n\)
- \(B = [\beta_j]\) is the vector of coefficients for the \(p\) different genotypes indexed by \(j = 1:p\) (here \(p=2\) in our example)
- \(X = [x_{ij}]\) is the \(\{0,1\}\) indicator that individual \(i\) has genotype \(j\)
- \(E = [\varepsilon_i]\) is the vector of residuals for all individuals indexed by \(i\)
Conditional expectation:
\[ \EE(Y_i\ \vert\ X_{ij} = x_{ij}) = \mu + \sum_{i=1}^n x_{ij}\ \beta_j \]
Then \(\mu\) and \(B\) are estimated by a least square linear regression (see here here and here for more details and references).
Interpretation of the ANOVA results: do you find a significant effect of the respective SNPs on the considered morphological trait? Comment the results in relation with your intuition from the analysis of the descriptive statistics above?
Solution
After verifying the normality and homoskedasticity of the residuals (if it is not verified, we cannot use the results from the ANOVA significance test because it assumes a Gaussian model), we find a significant effect of SNP YAL069W_1
and a non-significant effect of SNP YDL200C_427
onto the morphological trait A101
(when focusing on the C
cell cycle phase), which confirms our intuition from the graphical representation.
6.6.2 Two-factor ANOVA
We would like to avoid restricting the analysis to a single cell cycle phase, and account for the potential variability of the morphological trait A101
across the different cell cycle phases, which is not explained by the genotype.
Write the linear model for the two-factor ANOVA regarding the morphological trait A101
depending on genotype associated to the SNP YAL069W_1
and the cell cycle phase (stored in the column `cell_cycle)?
Solution
Notation:
- \(j=1,2\) indexes the genotype associated to the SNP
YAL069W_1
(i.e.0
for \(j=1\) and1
for \(j=2\)) - \(k=1,2,3\) indexes the cell cycle phase (among
"A"
,"A1B"
,"C"
) - \(y_{jkr}\) is the value of the morphological trait
A101
in the \(r^\text{th}\) individual with genotype \(j\) in cell cycle \(k\). The associated random variable is \(Y_{jkr}\)
Two-factor ANOVA without interaction:
\[ Y_{jkr} = \mu + \beta_j + \alpha_k + \varepsilon_{jkr} \]
where:
- \(\mu\) is the intercept
- \(\beta_j\) is the coefficient associated to genotype \(j\), with the constraint: \(\sum_j \beta_j = 0\)
- \(\alpha_k\) is the coefficient associated to genotype \(k\), with the constraint: \(\sum_k \alpha_k = 0\)
Two-factor ANOVA with interaction:
\[ Y_{jkr} = \mu + \beta_j + \alpha_k + \gamma_{jk} + \varepsilon_{jkr} \]
where:
- \(\beta_j\) is the coefficient associated to the interaction between genotype \(j\) and cell cycle phase \(k\), with the constraints: \(\sum_j \gamma_{jk} = 0\) and \(\sum_k \gamma_{jk} = 0\).
Conduct a one-factor ANOVA between the considered output variable (A101
) and a combination of one of the considered SNPs (YAL069W_1
or YDL200C_427
) and the cell cycle? (for the moment, we do not consider both SNPs in the same model)
Solution
A101
depending on SNPYAL069W_1
andcell_cycle
(without interaction):
anova(lm(A101 ~ factor(YAL069W_1) + cell_cycle, data = yeast_av_data))
Analysis of Variance Table
Response: A101
Df Sum Sq Mean Sq F value Pr(>F)
factor(YAL069W_1) 1 0.012442 0.0124421 11.561 0.0008225 ***
cell_cycle 2 0.051605 0.0258026 23.975 5.317e-10 ***
Residuals 188 0.202330 0.0010762
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note: when the variable is numerical but with integer (discrete) values (e.g.
YAL069W_1
encoded with 0 and 1), we specifyfactor()
in the model so that thelm()
function understands this function as a qualitative and not quantitative variable. When the variable type is not ambiguous (e.g.cell_cycle
which is a character variable), it is not necessary to specify that it is afactor()
in the linear model, thelm()
function assumes it automatically.
A101
depending on SNPYDL200C_427
andcell_cycle
(without interaction):
anova(lm(A101 ~ factor(YDL200C_427) + cell_cycle, data = yeast_av_data))
Analysis of Variance Table
Response: A101
Df Sum Sq Mean Sq F value Pr(>F)
factor(YDL200C_427) 1 0.000014 0.0000141 0.0123 0.9118
cell_cycle 2 0.051605 0.0258026 22.5877 1.617e-09 ***
Residuals 188 0.214758 0.0011423
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interaction?
A101
depending on SNPYAL069W_1
andcell_cycle
(with interaction):
anova(lm(A101 ~ factor(YAL069W_1) * cell_cycle, data = yeast_av_data))
Analysis of Variance Table
Response: A101
Df Sum Sq Mean Sq F value Pr(>F)
factor(YAL069W_1) 1 0.012442 0.0124421 11.4894 0.0008546 ***
cell_cycle 2 0.051605 0.0258026 23.8268 6.13e-10 ***
factor(YAL069W_1):cell_cycle 2 0.000905 0.0004527 0.4181 0.6589336
Residuals 186 0.201424 0.0010829
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A101
depending on SNPYDL200C_427
andcell_cycle
(with interaction):
anova(lm(A101 ~ factor(YDL200C_427) * cell_cycle, data = yeast_av_data))
Analysis of Variance Table
Response: A101
Df Sum Sq Mean Sq F value Pr(>F)
factor(YDL200C_427) 1 0.000014 0.0000141 0.0122 0.9122
cell_cycle 2 0.051605 0.0258026 22.3711 1.967e-09 ***
factor(YDL200C_427):cell_cycle 2 0.000227 0.0001135 0.0984 0.9063
Residuals 186 0.214531 0.0011534
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note: in a linear formula in R,
factor1 * factor2
is equivalent tofactor1 + factor2 + factor1:factor2
. The:
is used to explicitely specify the interaction between factors.
After doing regular verification for an ANOVA model, interpret the previous results?
Solution
Regular verification (normality and homoskedasticity for the residuals):
A101
depending on SNPYAL069W_1
andcell_cycle
(without interaction):
# linear model
<- lm(A101 ~ factor(YAL069W_1) + cell_cycle, data = yeast_av_data)
mod1
# check normality
::check_normality(mod1) performance
Warning: Non-normality of residuals detected (p < .001).
# check homoskedasticity
::check_heteroskedasticity(mod1) performance
Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).
A101
depending on SNPYAL069W_1
andcell_cycle
(with interaction):
# linear model
<- lm(A101 ~ factor(YAL069W_1) * cell_cycle, data = yeast_av_data)
mod2
# check normality
::check_normality(mod2) performance
Warning: Non-normality of residuals detected (p < .001).
# check homoskedasticity
::check_heteroskedasticity(mod2) performance
Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).
A101
depending on SNPYDL200C_427
andcell_cycle
(without interaction):
# linear model
<- lm(A101 ~ factor(YDL200C_427) + cell_cycle, data = yeast_av_data)
mod3
# check normality
::check_normality(mod3) performance
Warning: Non-normality of residuals detected (p < .001).
# check homoskedasticity
::check_heteroskedasticity(mod3) performance
Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).
A101
depending on SNPYDL200C_427
andcell_cycle
(with interaction):
# linear model
<- lm(A101 ~ factor(YDL200C_427) * cell_cycle, data = yeast_av_data)
mod4
# check normality
::check_normality(mod4) performance
Warning: Non-normality of residuals detected (p < .001).
# check homoskedasticity
::check_heteroskedasticity(mod4) performance
Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).
Interpretation and comments:
- The prerequisite Gaussian and homoskedasticity assumptions are not verified therefore, in theory, we cannot use the p-values computed by R to assess which factor (or interaction) has an effect or not on the output variable
A101
.
What can we do in that case?
- use variance stabilizing function to transform the output response, like
log1p()
, i.e. \(x \mapsto \log(x+1)\), or the Anscombe transform - switch to a generalized linear model if relevant, e.g. in case of specific non-Gaussian output variables, like categorical or count variables
- We can at least verify the values of the computed coefficients returned by the
lm()
functions (using the `coefficient) for the different model:
coefficients(mod1)
(Intercept) factor(YAL069W_1)1 cell_cycleA1B cell_cycleC
0.205166427 0.016131578 -0.028726263 0.009939116
coefficients(mod2)
(Intercept) factor(YAL069W_1)1
0.202608272 0.020946929
cell_cycleA1B cell_cycleC
-0.023125759 0.012013077
factor(YAL069W_1)1:cell_cycleA1B factor(YAL069W_1)1:cell_cycleC
-0.010542125 -0.003903927
coefficients(mod3)
(Intercept) factor(YDL200C_427)1 cell_cycleA1B
0.2140432349 -0.0005456127 -0.0287262627
cell_cycleC
0.0099391158
coefficients(mod4)
(Intercept) factor(YDL200C_427)1
0.215756410 -0.003591258
cell_cycleA1B cell_cycleC
-0.031013344 0.007086671
factor(YDL200C_427)1:cell_cycleA1B factor(YDL200C_427)1:cell_cycleC
0.004065923 0.005071013
The closer to zero coefficient, the lower effect. However, we don’t have any idea of the statistical significance of theses results.
In practice, you need to choose if you are going to illicitly use these p-values or not (knowing that the conditions for the results to be interpretable are not fully verified).
Note on model selection:
- When comparing multiple model, e.g. with or without interaction, a simple rule is to start by training with the “richest” model (with all factors and interactions) and then train a more simple model where we remove factors and/or interactions that are not significant (if we can use the p-values).
- There exist automatic methods, called model selection approaches to allow to automatically compare and choose the “best” model among a set of candidate models (these types of methods and the comparison criteria they are based on is beyond the scope of the course).
6.7 Dimension reduction
In the following, we want to consider all the SNPs in the dataset and not just one or two.
6.7.1 Genotype data
Here is a heatmap representation of the genotype table regarding all SNPs (in columns) for each strain (in rows): each cell of the image represent the genotype for the corresponding SNP in the corresponding strain.
What can you say about the heatmap ?
Solution
We observe a lot of variability in the genotypes along all SNPs between the different strains. Some SNPs seems to be less variable than other. It appears quite difficult to extract information from this representation. We would need to summarize the information contained in the genotype table to get an overview of these information.
How could we visualize both the information contained in the SNP data and the morphological trait data (we will still focus on a single morphological variable, e.g. A101
)?
Solution
We can use a dimension reduction approach, like PCA.
6.7.2 PCA vs Correspondence analysis
Use a PCA dimension reduction on the SNP data and use a visualization of the results colored by the corresponding value of the morphological trait A101
in each strain?
Here is the code to extract the data. Again, we focus on a specific cell cycle.
# extract genotype data regarding all SNP
<- yeast_av_data %>%
data_for_pca # consider a specific cell cycle
filter(cell_cycle == "A") %>%
# select genotype data
select(c(26:ncol(.))) %>%
# remove SNPs with too many NAs
::select(which(colSums(is.na(.)) == 0))
dplyr
# extract corresponding value for morphological trait
<- yeast_av_data %>%
A101_value # consider a specific cell cycle
filter(cell_cycle == "A") %>%
# extract result
select(A101) %>% unlist()
Solution
# run PCA
<- prcomp(data_for_pca)
pca_res
# PCA indiv plot
fviz_pca_ind(
pca_res,geom = "point",
col.ind = A101_value,
gradient.cols = c("#2E9FDF", "#FC4E07")
)
We can use a discretized scale to get a better visualization effect:
# discretize the continuous morphological variable
<- yeast_av_data %>%
discrete_A101 # consider a specific cell cycle
filter(cell_cycle == "A") %>%
# discretize
mutate(A101 = cut(A101, breaks = seq(0,40,by=5)/100)) %>%
# extract result
select(A101) %>% unlist() %>% as.character()
# PCA indiv plot
fviz_pca_ind(
pca_res,geom = "point",
col.ind = discrete_A101
)
What can you say about this representation?
Solution
The structure of the SNP data extracted by the first two components of PCA does not allow to discriminate between the different level of the morphological trait A101
. The variability in the data explained by the first two components is not very high though.
Maybe we can use a different dimension reduction approach that is more suitable for the SNP data. If we consider SNP genotype as categorical variables, we can use the multiple correspondence analysis (MCA) method which is another linear dimension reduction approach but adatped to categorical variables.
Here are the results of such an approach:
# extract genotype data regarding all SNP and convert them to factors
<- yeast_av_data %>%
data_for_mca # consider a specific cell cycle
filter(cell_cycle == "A") %>%
# select genotype data
select(c(26:ncol(.))) %>%
# remove SNPs with too many NAs
::select(which(colSums(is.na(.)) == 0)) %>%
dplyr# transform all columns in factors
mutate(across(everything(), ~as.factor(.x)))
# run MCA
<- MCA(data_for_mca, graph=FALSE)
mca_res
# MCA indiv plot
fviz_mca_ind(
mca_res,geom = "point",
col.ind = discrete_A101
)
Is the representation obtained by MCA better than the one obtained with PCA?
Solution
Not better, maybe we could try some supervised dimension reduction approaches (c.f. next note) or some non-linear dimension reduction approaches (c.f. previous practical subject).
Note on supervised dimension reduction and mulitple output variables: here we consider a single morphological trait as output variable, however this output variable is not used in the dimension reduction process. One could aim at finding the reduced-dimension representation of the data that best relate to this output variable. To solve this problem, one can use the partial least squares regression (PLS) method which is specifically able to find a latent dimension that relate to an output variable. In addition, there exist specific approaches, such as PLS and also canonical correlation analysis (CCA), that allows to perform this supervised dimension reduction by accouting for multiple output variables (e.g. multiple morphological traits in our example).
6.8 Multiple testing
The exploration of the SNP data by linear dimension reduction approaches did not allow to detect a specific global structure linking the SNP genotypes and the morphological trait measure A101
. However, we still want to investigate the possible effect of the genotypes of all the recorded SNP on the morphological trait A101
.
To do so, we are going to run an ANOVA for each SNP.
Given the number of SNPs in the data (i.e. 2820), what could be the risk when running such an analysis?
Solution
We are going to do thousands of tests, computing and using thousands of p-values to assess the potential significant effect of each SNP on the considered morphological trait.
We have a non-negligible risk to wrongly reject the null hypothesis for many of the SNPs, and conclude to a non-existing significant effect.
Thus, we have to use p-values correction (or adjustment) procedure adapted to the case of multiple testing.
We run the two-factor ANOVA model for all SNP (accounting for the considered SNP effect and the cell cycle effect).
<- yeast_av_data %>%
test_result # run all ANOVA tests
summarize(
across(
any_of(gt_data$RQTL_name),
~anova(lm(A101~factor(.x) + cell_cycle))$`Pr(>F)`[1]
)%>%
) # reformat the output (to get a data.frame [SNP, p_values])
t() %>% as.data.frame() %>% dplyr::rename("p_values" = "V1") %>%
rownames_to_column("SNP") %>%
# extract SNP index (to facilitate graphical representation)
mutate(SNP_index = as.integer(as.factor(SNP)))
What should we verify?
Solution
In theory, we should verify the Gaussian and homoskedasticity assumptions for the residuals in each model (i.e. 2820 models). In practice, it can be cumbersome… (but it would be a flaw in the analysis).
Here is the result of the analysis. We plot the p_values for all SNPs.
ggplot(test_result) + geom_point(aes(x=SNP_index, y=p_values)) +
geom_hline(yintercept = 0.05, col = "red") +
annotate("text", label = "alpha = 5%", x=100, y=0.1, col = "red") +
theme_bw()
What can you say about these results?
Solution
All SNPs for which the p-value is below the horizontal line are found to have a significant effect on the morphological trait A101
.
It corresponding to hundred of genes:
sum(test_result$p_values <= 0.05)
[1] 612
But what about the p-value correction?
The p-values should be adjusted in the context of multiple testing. Try different method to adjust the p-values (see the function p.adjust()
).
Solution
<- test_result %>%
test_result mutate(
bonf_adj_p_values = p.adjust(p_values, method="bonferroni"),
fdr_adj_p_values = p.adjust(p_values, method="fdr"),
)
Here is the distribution of the p-values after and before adjustement (we used the Bonferroni and the FDR adjustment approaches):
ggplot(
%>% pivot_longer(!c(SNP, SNP_index), names_to = "type", values_to = "p_values")
test_result + geom_histogram(aes(x=p_values)) +
) facet_wrap(~type) +
geom_vline(xintercept = 0.05, col = "red") +
annotate("text", label = "alpha = 5%", x=0.1, y=1000, angle=90, col = "red") +
theme_bw()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
What can you say about the different corrections?
Solution
After the Bonferronni correction, no significant effect is found anymore. It tend to be very conservative.
The FDR correction reduces the number of found significant effects:
sum(test_result$fdr_adj_p_values <= 0.05)
[1] 198
Here is another way to compare the FDR-adjusted and original p-values:
ggplot(test_result) + geom_line(aes(x=p_values, y=fdr_adj_p_values)) +
geom_abline(slope=1, intercept=0, col = "red") +
annotate("text", label = "y=x", x=0.5, y=0.45, angle=45, col = "red") +
theme_bw()
Eventually, we can also compare the number of significant SNPs found before and after p-value correction depending on the chosen type I risk \(\alpha\).
# compute the number of significant SNPs
<- data.frame(alpha = seq(0, 1, 0.001)) %>%
signif_result rowwise() %>%
mutate(
signif = sum(test_result$p_values <= alpha),
signif_adj = sum(test_result$fdr_adj_p_values <= alpha)
)
# graphical representation of the number of significant SNPs depending on alpha
ggplot(
%>% pivot_longer(!alpha, names_to = "type", values_to = "number_snp")
signif_result +
) geom_line(aes(x = alpha, y=number_snp, col=type)) +
geom_vline(xintercept = 0.05, col = "red") +
annotate("text", label = "alpha = 5%", x=0.1, y=2000, angle=90, col = "red") +
theme_bw()
Extract the top50 SNPs with the most significant effects.
Solution
%>%
test_result arrange(fdr_adj_p_values) %>%
select(!bonf_adj_p_values) %>%
head(50)
SNP p_values SNP_index fdr_adj_p_values
1 YCL032W_344 1.143022e-04 1051 0.01349748
2 YCL026C_345 1.175937e-04 1050 0.01349748
3 YCL023C_347 1.340176e-04 1048 0.01349748
4 YCL022C_348 9.571504e-05 1047 0.01349748
5 gCL01_350 8.850660e-05 25 0.01349748
6 YCL018W_359 8.683893e-05 1036 0.01349748
7 YCL018W_360 7.954324e-05 1037 0.01349748
8 YCL018W_361 8.683893e-05 1038 0.01349748
9 YCL018W_362 8.683893e-05 1039 0.01349748
10 YCL018W_363 8.683893e-05 1040 0.01349748
11 YCL018W_364 8.683893e-05 1041 0.01349748
12 YCL018W_365 8.683893e-05 1042 0.01349748
13 YCL018W_366 3.156126e-05 1043 0.01349748
14 YCL018W_367 8.683893e-05 1044 0.01349748
15 YCL018W_368 8.683893e-05 1045 0.01349748
16 YCL018W_369 8.683893e-05 1046 0.01349748
17 NCR015C_372 1.338310e-04 404 0.01349748
18 YFR045W_862 6.205391e-05 1430 0.01349748
19 YHR095W_1279 1.338548e-04 1694 0.01349748
20 YHR104W_1282 7.856486e-05 1695 0.01349748
21 YPL242C_2803 3.349920e-05 2795 0.01349748
22 YPL238C_2804 1.035004e-04 2794 0.01349748
23 YPL235W_2805 1.101901e-04 2793 0.01349748
24 YPL234C_2806 5.169574e-05 2791 0.01349748
25 YPL234C_2807 1.101901e-04 2792 0.01349748
26 YPL233W_2808 1.101901e-04 2790 0.01349748
27 gPL03_2809 1.101901e-04 294 0.01349748
28 YPL231W_2810 1.291664e-04 2789 0.01349748
29 YCL025C_346 1.440175e-04 1049 0.01390326
30 YPL246C_2801 1.479070e-04 2798 0.01390326
31 gCL01_349 2.371728e-04 24 0.01529711
32 gCL01_351 2.371728e-04 26 0.01529711
33 gCL01_352 2.371728e-04 27 0.01529711
34 gCL01_353 2.371728e-04 28 0.01529711
35 gCL01_354 2.371728e-04 29 0.01529711
36 gCL01_355 2.371728e-04 30 0.01529711
37 gCL01_357 2.371728e-04 32 0.01529711
38 gCL01_358 2.371728e-04 33 0.01529711
39 NCR015C_373 2.441028e-04 405 0.01529711
40 NCR015C_374 2.441028e-04 406 0.01529711
41 YEL068C_617 1.921348e-04 1288 0.01529711
42 YEL068C_618 1.921348e-04 1289 0.01529711
43 gHR04_1280 2.208442e-04 99 0.01529711
44 gHR04_1281 2.208442e-04 100 0.01529711
45 YLR074C_1972 2.321278e-04 2177 0.01529711
46 YDR251W_550 2.846138e-04 1206 0.01672106
47 YDR252W_551 2.846138e-04 1207 0.01672106
48 YNL220W_2414 2.810908e-04 2504 0.01672106
49 YLR074C_1971 3.013515e-04 2176 0.01734309
50 YAL069W_1 8.225020e-04 880 0.01743952
What can we do with these results?
Solution
investigate the corresponding genomic region containing these SNPs (is it a gene, etc.)
run a confirmation study (for instance by inducing mutation in these SNPs)
etc.
6.9 Full data analysis
Open (and optional) question: run the previous analysis to find SNPs with significant effect on the morphological trait A101
with the full dataset yeast_data
, i.e. without the average by strain for the morphological traits.
In this case, you will have to account for the strain_id
factor in the ANOVA.
You could also try to consider linear model integrating multiple SNPs (instead of testing each SNP independently). However be aware of potential dimension issues. You will also encounter the question of model selection in this particular setting (which SNP do I integrate in my model?).
6.10 Appendix (optional): Simpon’s paradox and confounding factors
In the palmerpenguins
dataset, we have different morphological measurements for numerous individuals of penguins of different species. We want to investigate a possible link between the bill depth vs the bill length for these penguins.
Reminder about the anatomy of a penguin:
Here is a scatter plot representing the bill depth vs the bill length for all individuals in the dataset.
What can you say about this representation? Can you infer any relationship between bill depth and length in penguins?
Solution
There seems to be a negative relationship between bill depth and length in penguins. As the length increases, the depth decreases.
Could we question this result?
Solution
Here is the same representation discriminated by species:
We obtained opposite conclusions : in all species, there seems to be a positive relationship between bill depth and length. As the length increases, the depth also increases.
The species factor is called a confounding factors: a variable (potentially unknown) that influences both variables resulting in a spurious association.
In this particular case, the change of trend when accounting for the groups of individuals is called the Simpon’s paradox.
Here is another example from the following publication:
[2] Appleton, David & French, Joyce & Vanderpump, Mark. (1996). Ignoring a Covariate: An Example of Simpson’s Paradox. American Statistician - AMER STATIST. 50. 340-341. 10.1080/00031305.1996.10473563.
After a first survey in the 1970s, a follow-up study in 1992-94 (in Whickham a mixed urban and rural district near Newcastle upon Tyne, United Kingdom) investigated (among other things) the 20-year after survival of subjects included in the original study.
Here are the results (dataset from [2]) regarding the survival of the women in the study depending on their smoking status (“smoker” or “non smoker”):
smoker_status | dead | alive | survival_rate |
---|---|---|---|
non_smoker | 230 | 502 | 68.579% |
smoker | 139 | 443 | 76.117% |
What can you say about these results ?
Solution
The survival rate is larger in the “smoker” sub-group than in the “not smoker” sub-group. Surprising?
What could explain this surprising result?
Solution
A confounding variable: the age of the person in the original study.
There is a sampling bias in this dataset. The number of older non-smoking women (thus with a higher mortality risk) is very high compared to the other groups, hence decreasing the global survival rate in the non-smoking group.
age | smoker_status | dead | alive | survival_rate |
---|---|---|---|---|
18-24 | non_smoker | 1 | 61 | 98.387% |
25-34 | non_smoker | 5 | 152 | 96.815% |
35-44 | non_smoker | 7 | 114 | 94.215% |
45-54 | non_smoker | 12 | 66 | 84.615% |
55-64 | non_smoker | 40 | 81 | 66.942% |
65-74 | non_smoker | 101 | 28 | 21.705% |
75+ | non_smoker | 64 | 0 | 0% |
18-24 | smoker | 2 | 53 | 96.364% |
25-34 | smoker | 3 | 121 | 97.581% |
35-44 | smoker | 14 | 95 | 87.156% |
45-54 | smoker | 27 | 103 | 79.231% |
55-64 | smoker | 51 | 64 | 55.652% |
65-74 | smoker | 29 | 7 | 19.444% |
75+ | smoker | 13 | 0 | 0% |
Note: in the two previous examples, we did not focus on quantifying any statistical significance of the different effects that we discussed (that would require using statistical models). The point was to give you the intuition about potential issues and misleading results caused by any forgotten confounding factors and the Simpson’s paradox in an analysis.
How to avoid drawing false conclusions because of confounding factors?
Solution
Use randomization to build your experiment, like randomized controlled trial, possibly with double-blinding, to remove the potential effect of confounding variables (should be used for any serious drug trial), or at least control the potential sampling bias caused by confounding factors (like case-control studies, cohort studies, stratification).
If not possible (it is not always possible depending on the design of the experiment and/or the object of the study), measure and log various metadata regarding your subjects/individuals so that you will be able to account for the potential effect of confounding variables in your analysis (c.f. later).
It generally requires a certain level of technical expertise/knowledge in the considered subject to be able to identify potential confounding factors before the experiments (so that you can monitor and log the corresponding quantities during your experiment).
More details here.
The .Rmd file corresponding to this page is available here under the AGPL3 Licence