6 Practice: Introduction to linear models and multiple testing

Author

Ghislain Durif, Laurent Modolo, Franck Picard

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

6.1 Introduction

Linear models are widely used in data analysis to relate variables and discover relationships between quantitative and/or qualitative measurements. Linear models can be used for modeling, prediction, data normalization, factor/variable effect on an outcome detection or quantification, etc, sometimes in combination with statistical tests. More generally, hypothesis testing is at the core of the scientific method, formulating hypotheses and then verifying them with experiments. In this context, understanding p-values and multiple testing is crucial to avoid the pitfall of false discoveries.

To manipulate these notions, this practical will focus on possible associations between genotype and morphological data measured in yeast, that were originally studied in the following publication:

[1] Nogami S, Ohya Y, Yvert G (2007) Genetic Complexity and Quantitative Trait Loci Mapping of Yeast Morphological Traits . PLOS Genetics 3(2): e31. https://doi.org/10.1371/journal.pgen.0030031

Yeast are “eukaryotic, single-celled microorganisms classified as members of the fungus kingdom” (per Wikipedia).

Trulli — Microscopy picture of yeast cells (credit)

We will study the possible associations between the genotype of the considered yeast colonies and some morpholocigal traits directly measured on the yeast through image analysis, such as (from [1]) :

Throughout this practical, we will explore the following subjects:

Descriptive statistics
Statistical hypothesis testing and p-values
Linear model and analysis of variance (ANOVA)
Dimension reduction
Multiple testing

6.2 Requirements

We will need the following packages. If not already done, you need to install the following packages:

install.packages("tidyverse")   # to manipulate data and make plot
install.packages("performance") # to evaluate model
install.packages("FactoMineR")  # for dimension reduction
install.packages("factoextra")  # to plot dimension reduction output

Now, we can load them:

library(tidyverse)   # to manipulate data and make plot
library(performance) # to evaluate model
library(FactoMineR)  # for dimension reduction
library(factoextra)  # to plot dimension reduction output

6.3 Reminder about statistical hypothesis testing and p-values

6.3.1 Statistical testing

6.3.1.1 Building the test

Confront two hypotheses, the null hypothesis \(\H_0\) vs the alternative hypothesis \(\H_1\).

Detail the general framework to build a statistical test?

Solution

Compute the observed value \(t\) of a random variable, called a statistic, \(T\) using measured quantities on a sample of observations.
Requirements: know the probability distribution of \(T\) assuming \(\H_0\) is true
Verify if the value \(t\) is “probable” according to the distibution of \(T\) under \(\H_0\).

What are the possible answers of a statistical test?

Solution

possible answers: “\(\H_0\) is false with a given risk” (reject \(\H_0\), the test result is significant) vs “the result is not significant” (given this sample, we don’t know if \(\H_0\) is true or false).
the answer is never: ~~\(\H_0\) is true~~.

6.3.1.2 Getting the answer of the test

What is the p-value ?

Solution

p-value = conditional probability assuming \(\H_0\) that the statistic \(T\) is at least as extreme as the observed value \(t\)

For a bilateral test:

\[ p\text{-value} = \PP(T < -t \ \text{or}\ T > t) = \PP(\vert T\vert > \vert t\vert) = 1 - \PP(-t \leq T \leq t) \]

For a (right) uniteral test \[ p\text{-value} = \PP(T > t) = 1 - \PP(T \leq t) \]

Are the following statements true of false ?

the p-value is the estimated probability that \(\H_0\) is true
the p-value is the risk (estimated probability) to be wrong when discovering a significant effect (=rejecting \(\H_0\))

Solution

~~the p-value is the estimated probability that \(\H_0\) is true~~ (FALSE)
~~the p-value is the risk (estimated proabiility) to be wrong when discovering a significant effect (=rejecting \(\H_0\))~~ (FALSE)

The p-value is a conditional probability assuming that \(\H_0\) is true.

Reasoning:

assuming \(\H_0\) \(\to\) the statistics \(T\) follows given probability distribution
“Inductive contraposition”: p-value is small \(\to\) observed value \(t\) for the statistics \(T\) is unlikely considering its probability distribution under \(\H_0\) \(\to\) it is unlikely that \(T\) follows this distribution \(\to\) it is unlikely that \(\H_0\) is true.

What are type I and type II errors?

Solution

Type I error: reject \(\H_0\) conditionally to the fact that \(\H_0\) is true.
Type II error: not reject \(\H_0\) conditionally to the fact that \(\H_0\) is false.

What are type I and type II risks?

Solution

Type I risk: \(\alpha = \PP(\text{reject } \H_0\ \vert\ \H_0 \text{ is true})\)
Type II risk: \(\beta = \PP(\text{not reject } \H_0\ \vert\ \H_0 \text{ is false})\)
Power \(= 1 - \beta = \PP(\text{reject } \H_0\ \vert\ \H_0 \text{ is false})\)

Are the following statements true of false ?

the type 1 risk \(\alpha\) is the probability that \(\H_0\) is false
the type 1 risk \(\alpha\) is the probability to be wrong when discovering a significant effect (=rejecting \(\H_0\))

Solution

~~the type 1 risk \(\alpha\) is the probability that \(\H_0\) is false~~ (FALSE)
~~the type 1 risk \(\alpha\) is the probability to be wrong when discovering a significant effect (=rejecting \(\H_0\))~~ (FALSE)

The type 1 risk is a conditional probability assuming that \(\H_0\) is true.

The type 1 risk \(\alpha\) is generally chosen (e.g. \(\alpha = 5\%\)). It is important to evaluate the power of the test which can not be done in general (it requires either to design a test where the distribution of statistic \(T\) is known assuming that tjhe alternative hypostheis \(\H_1\) is true, or to evaluate the power using simulations where the truth about \(\H_0\) and \(\H_1\) is konwn.

How to get the answer of a statistical test?

Solution

Choose a given risk \(\alpha\) (generally \(\alpha = 5\%\))
Compare the p-value to \(\alpha\)

if p-value \(\leq \alpha\) then reject \(\H_0\) and the result is significant at level \(\alpha\) (reminder it is not the probability to be wrong)
if p-value \(> \alpha\) then not reject \(\H_0\) and the result is not significant

Or similarly, compare the observed value \(t\) for the statistic \(T\) to the quantile of the distribution under \(\H_0\).

For a bilateral test:

reject \(\H_0\):

not reject \(\H_0\):

For a (right) uniteral test:

reject \(\H_0\)

not reject \(\H_0\)

6.3.2 p-values, significancy and power

We run a Student T-test on a sample of observations regarding a single variable. We want to know if the average of this variable is null.

The data are generated from the simulation of a random Gaussian variable of mean \(\mu=0\) and variance \(\sigma^2=1\). Here, we know the true mean \(\mu\) and variance \(\sigma^2\), however, in general data analysis, they are usually unknown, hence the motivation to run a test on a possible value for \(\mu\).

Since, we are working with simulations, we can repeat the experiment as much as we want, i.e. generate multiple samples of observed values for the considered variables.

Note: it is only possible because we are working with simulated data. When analysing real experimental data, it is generally very complicated or costly (or even impossible) to repeat the experiments numerous times in order to generate multiple independent samples of observed values.

In the T-test, we test the hypotheses “\(\H_0\): \(\mu = \mu_0\)” versus “\(\H_1\): \(\mu \ne \mu_0\)” where \(\mu\) is the population mean of the variable (that is unknown), for a given value \(\mu_0\). In the following, we will choose \(\mu_0=0\) and test “\(\H_0\): \(\mu = 0\)” versus “\(\H_1\): \(\mu \ne 0\)”.

Here is a function to generate a sample and compute the p-values associated to the T-test:

## function to generate a sample and compute the p-value for the T-test
#' @param sample_size the size of the simulated sample
#' @param mu the mean of the Gaussian distribution used to generate the data (default is 0)
#' @param mu0 the tested value for the mean in the T-test (default is 0)
compute_p_value <- function(sample_size, mu = 0, mu0 = 0) {
    # simulation of a Gaussian variable
    obs_values <- rnorm(sample_size, mean = mu)
    # Computation of the observed value for the statistic associated to the T-test
    stat_value <- sqrt(sample_size) * (mean(obs_values) - mu0) / sd(obs_values)
    # Computation of the p-value
    # (under H0, the statistic follows a Student T-distribution with n-1 degrees of freedom)
    p_value <- 2*(1-pt(abs(stat_value), df = sample_size - 1))
    # or use t.test()
    return(p_value)
}

We repeat the experiment 10000 times: we repeat the data generation with \(\mu=0\) (hence \(\H_0\) is known to be true) and we compute the p-value for the T-test on each repetition.

n_repetition <- 10000
p_values <- replicate(n_repetition, compute_p_value(sample_size = 100, mu = 0))

Below is the empirical distribution of the collected p-values (10000 computations of the same p-value regarding the same test on the same variable but with different samples each time).

What can you say about this figure? What could be the problem? especially regarding the result of the test?

Solution

In a non negligible number of samples, the null hypothesis was rejected (p-value \(\leq\alpha\)) whereas it is known to be true. In this case, we find a significant result \(\mu\ne 0\) despite being wrong.

However, in the majority of the studies, the null hypothesis is correctly not rejected.

We follow the same process but this time, we generate data where the population mean is \(\mu=0.25\) (hence not 0 and \(\H_0\) is known to be false) and we run the T-test on 10000 repetitions of the experiment.

n_repetition <- 10000
p_values <- replicate(n_repetition, compute_p_value(sample_size = 100, mu = 0.25))

Below is the empirical distribution of the collected p-values

What can you say about this figure? What could be the problem? especially regarding the result of the test?

Solution

In a non negligible number of samples, the null hypothesis was not rejected (p-value \(\leq\alpha\)) whereas it is known to be false. In this case, we cannot find a significant result \(\mu\ne0\) despite being right.

However, in the majority of the studies, the null hypothesis is correctly rejected.

It is important to evaluate the power of a test (despite being generally impossible with real experimental data).

Imagine a procedure based on simulation to estimate the power of the previous T-test depending on the type I risk \(\alpha\)?

Solution

The power of a test is \(1 - \beta = \PP(\text{reject } \H_0\ \vert\ \H_0 \text{ is false})\) where \(\beta = \PP(\text{not reject } \H_0\ \vert\ \H_0 \text{ is false})\) is the type II risk.

To estimate \(\beta\), we need to repeat the same experiment multiple time and estimate the corresponding probability.

A simple way is to generate data where \(\H_0\) is known to be false, i.e. \(\PP(\H_0 \text{ is false})=1\).

We have (c.f. here): \[ \begin{aligned} & \text{power}\\ & = \PP(\text{reject } \H_0\ \vert\ \H_0 \text{ is false})\\ & = \frac{\PP(\text{reject } \H_0\ \ \text{and}\ \H_0 \text{ is false})}{\PP(\H_0 \text{ is false})} \end{aligned} \]

In this case, we have then: \(\text{power} = \PP(\text{reject } \H_0\ \ \text{and}\ \H_0 \text{ is false})\) and we can estimate this probability by counting the number of times reject \(\H_0\) among the repetition of the experiment depending on the type I risk \(\alpha\) that we choose.

We repeat the experiment, generating data where \(\mu\ne0\) (here \(\mu = 0.2\)), hence “\(\H_0: \mu=0\)” is known to be false.

n_repetition <- 10000
p_values <- replicate(n_repetition, compute_p_value(sample_size = 100, mu = 0.25))

We compute the estimation of the power, being the number of times \(\H_0\) was rejected over the experiment repetitions over the total number of repetitions, depending on the value that we choose for \(\alpha\)

alpha_values <- seq(0, 1, 0.001) # values from 0 to 1
power_values <- sapply(
    alpha_values, 
    function(alpha) {
        estimated_power <- sum(p_values<=alpha)/length(p_values)
        return(estimated_power)
    }
)

Here is the representation of the estimated power depending on alpha:

What can you say about this representation?

Solution

Decreasing \(\alpha\) to reduce the type I error decreases the power of the test.

Important:

confirm a detected effect with additional experiments/studies
the more (independent) studies, the lower risk of incorrect conclusions

6.4 Loading the data

In this data, 64 different yeast strains are considered.

Note: all strains were generated from an admixture between two original strains but we will not investigate this point here.

For each strain, we have genotype data, containing the SNPs along the yeast genome. Each SNP is encoded with a 0 or 1 value corresponding to the number of derivative allele present at the corresponding locus.

Note: Here, the yeast were sequenced during their haplotype phase, therefore the possible values for each genotype are only 0 and 1.

For each strain, we also have measures of different morphological traits for hundred of cells during 3 different stages of the cell cycle (called "A", "A2B" and "C").

Since the mutation is rate is very low in yeast, we can consider that all cells from the same strain share the same genotype across their entire genome.

We will use data stored in the practical_c.RData file.

data_list <- load(url("https://lbmc.gitbiopages.ens-lyon.fr/hub/formations/ens_m1_ml/practical_c.Rdata"), verbose = T)

It contains the following objects:

print(data_list) # list the loaded objects

[1] "gt_data"       "morpho_data"   "strain_id"     "yeast_data"   
[5] "yeast_av_data"

gt_data: a data frame with 2820 SNPs in rows and the corresponding genotypes for the 64 strains in columns, along with other informations regarding the SNPs such as their position on the genome, their identification, etc. More details with:

str(gt_data)

morpho_data: a data frame with 45215 cells in rows and the corresponding morphological trait measures in columns, along with the corresponding cell identification, strain identification and cell cycle, etc. More details with:

str(morpho_data)

yeast_data: a data frame containing the morphological data (see morpho_data) and the corresponding genotypes of all SNPs (in additional columns) for each cell (in rows). More details with:

str(yeast_data)

yeast_av_data: a data frame containing the average (over all cells) of the morphological measures for each strain and each cell cycle phase with the corresponding genotype (same columns as yeast_data). More details with:

str(yeast_av_data)

strain_id: a data frame containing the identification of the different strains (including other not present in morpho_data and gt_data)

6.5 Data description

We will first conduct a descriptive analysis of the data. During the course of this practical, we will first consider the morphological measure called A101 and the genotype associated to two SNPs, namely YAL069W_1 and YDL200C_427.

Generally, when the dimension of the data is not too large (otherwise see below and afterward), a first exploratory step in data analysis (for quantitative variables) is to consider descriptive statistics, such as position parameters (mean/average, empirical quantiles including the median), dispersion parameters (empirical variance and empirical standard deviation), possibly depending on groups or classes defined by a qualitative variables.

Which graphical representation of the data does give an insight about the distribution (including the position and dispersion) of a variable ?

Solution

For instance, a box-plot (and derivative)

IMPORTANT: first we will focus on the morphological data averaged by strain and cell cycle, i.e. yeast_av_data. The idea here is to avoid dealing with the measure heterogeneity across all cells within each strain (c.f. later).

Construct a boxplot of the value of the variable A101 depending on the genotype associated to the SNPs YAL069W_1 or YDL200C_427 (separately). What can you say about this representation?

Solution

ggplot(yeast_av_data, aes(factor(YAL069W_1), A101)) + geom_boxplot() +
    theme_bw()

ggplot(yeast_av_data, aes(factor(YDL200C_427), A101)) + geom_boxplot() +
    theme_bw()

How to account for the cell cycle in the previous representations? Is it important?

Solution

ggplot(yeast_av_data, aes(factor(YAL069W_1), A101)) + geom_boxplot() +
    facet_wrap(~cell_cycle, labeller = "label_both") +
    theme_bw()

ggplot(yeast_av_data, aes(factor(YDL200C_427), A101)) + geom_boxplot() +
    facet_wrap(~cell_cycle, labeller = "label_both") +
    theme_bw()

What can you infer about the possible relationship between the morphological trait quantified by the variable A101 and the genotype associated to the SNPs YAL069W_1 or YDL200C_427 from the boxplot?

Solution

For all cell cycle phases, the value of the variable A101 seems to depend on the genotype associated to the SNP YAL069W_1. Indeed, the value of the variable A101 is globally higher in presence of the derivative allele.

On contrary, for cell cycle phases "A1B" and "C", the value of the variable A101 does not seem to depend on the genotype associated to the SNP YAL069W_1.

Note: depending on the context and the data, other representations like empirical histogram or empirical density representation of quantitative variables of interest depending on factors (qualitative variables) of interest can be used.

6.6 Linear model and analysis of variance

We will now use a linear model, and especially an analysis of variance (ANOVA), to test the association between the morphological trait A101 and the genotype associated to the SNPs YAL069W_1 or YDL200C_427 (separately).

6.6.1 One-factor ANOVA

Write the linear model for the ANOVA regarding the morphological trait A101 depending on genotype associated to the SNP YAL069W_1?

Solution

Notations:

\(j=1,2\) is the genotype associated to the SNP YAL069W_1 (i.e. 0 for \(j=1\) and 1 for \(j=2\))
\(y_{jr}\) is the value of the morphological trait A101 in the \(r^\text{th}\) individual with genotype \(j\). The associated random variable is \(Y_{jr}\)

Model:

\[ Y_{jr} = \mu_j + \varepsilon_{jr} \]

Assumptions:

we assume that \(\varepsilon_{jr} \sim \mathcal{N}(0, \sigma^2)\)
we assume that \(Y_{jr}\) are independent (hence that \(\varepsilon_{jr}\) are independent)

Equivalent formulation:

\[ Y_{jr} \sim \mathcal{N}(\mu_j, \sigma^2) \]

IMPORTANT: We will first consider only a single cell cycle phase (e.g. the "C" phase) to avoid dealing with the measure heterogeneity across cell cycles (c.f. later)

yeast_av_subdata <- yeast_av_data %>% filter(cell_cycle == "C")

Conduct a one-factor ANOVA between the considered output variable (A101) and each of the considered SNPs (YAL069W_1 or YDL200C_427). Hint: use the functions lm() and anova()

Solution

Between A101 and the SNP YAL069W_1:

anova(lm(A101 ~ factor(YAL069W_1), data = yeast_av_subdata))

Analysis of Variance Table

Response: A101
                  Df   Sum Sq   Mean Sq F value   Pr(>F)   
factor(YAL069W_1)  1 0.004629 0.0046293  8.5298 0.004867 **
Residuals         62 0.033648 0.0005427                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Between A101 and the SNP YAL069W_1:

anova(lm(A101 ~ factor(YDL200C_427), data = yeast_av_subdata))

Analysis of Variance Table

Response: A101
                    Df   Sum Sq    Mean Sq F value Pr(>F)
factor(YDL200C_427)  1 0.000034 0.00003449  0.0559 0.8139
Residuals           62 0.038243 0.00061683

What should we check before interpreting the results of the ANOVA?

Solution

Check the normality of the residuals and the homoskedasticity (constant variance).

Note: you can get the residuals of a model with the residuals() function.

Between A101 and the SNP YAL069W_1:

# linear model
mod1 <- lm(A101 ~ factor(YAL069W_1), data = yeast_av_subdata)

# scatter plot of the residuals
plot(residuals(mod1))
abline(h=0, col = "red")

# empirical distribution of the residuals
hist(residuals(mod1), prob = TRUE) # histogram of empirical frequency (instead of counts)
lines(density(residuals(mod1)), col = "red") # empirical density

# qq-plot
plot(mod1, which = 2)

# check normality
performance::check_normality(mod1)

OK: residuals appear as normally distributed (p = 0.128).

# check homoskedasticity
performance::check_heteroskedasticity(mod1)

OK: Error variance appears to be homoscedastic (p = 0.302).

Between A101 and the SNP YAL069W_1:

# linear model
mod2 <- lm(A101 ~ factor(YDL200C_427), data = yeast_av_subdata)

# scatter plot of the residuals
plot(residuals(mod2))
abline(h=0, col = "red")

# empirical distribution of the residuals
hist(residuals(mod2), prob = TRUE) # histogram of empirical frequency (instead of counts)
lines(density(residuals(mod2)), col = "red") # empirical density

# qq-plot
plot(mod2, which = 2)

# check normality
performance::check_normality(mod2)

OK: residuals appear as normally distributed (p = 0.084).

# check homooskedasticity
performance::check_heteroskedasticity(mod2)

OK: Error variance appears to be homoscedastic (p = 0.766).

Compute the value of the coefficients in the ANOVA model regarding the morphological trait A101 depending on genotype associated to the SNP YAL069W_1 that was defined in the first question? This model is different from the results given by the lm function? How so?

Solution

Between A101 and the SNP YAL069W_1:

# manual computation
yeast_av_subdata %>% group_by(YAL069W_1) %>% summarise(mu=mean(A101))

# A tibble: 2 × 2
  YAL069W_1    mu
      <int> <dbl>
1         0 0.215
2         1 0.232

# automatic computation
lm(A101 ~ factor(YAL069W_1), data = yeast_av_subdata)


Call:
lm(formula = A101 ~ factor(YAL069W_1), data = yeast_av_subdata)

Coefficients:
       (Intercept)  factor(YAL069W_1)1  
           0.21462             0.01704

Model considered in R (with the previous notations):

\[ Y_{jr} = \mu + \beta_j + \varepsilon_{jr} \]

with:

\(\mu\) is the intercept
\(\beta_j\) is the coefficient associated to genotype \(j\), with the constraint: \(\sum_j \beta_j = 0\) (think \(\beta_j = \mu_j - \mu\)), hence you get only \(\beta_j\) for \(j=2\) (i.e. genotype = 1)

Corresponding matrix notation (which requires to reindex \(Y\) and \(E\)):

\[ Y = \mu + X \times B + E \]

where:

\(Y = [y_i]\) is the vector containing the values for the output variable for all \(n\) individuals indexed by \(i=1:n\)
\(B = [\beta_j]\) is the vector of coefficients for the \(p\) different genotypes indexed by \(j = 1:p\) (here \(p=2\) in our example)
\(X = [x_{ij}]\) is the \(\{0,1\}\) indicator that individual \(i\) has genotype \(j\)
\(E = [\varepsilon_i]\) is the vector of residuals for all individuals indexed by \(i\)

Conditional expectation:

\[ \EE(Y_i\ \vert\ X_{ij} = x_{ij}) = \mu + \sum_{i=1}^n x_{ij}\ \beta_j \]

Then \(\mu\) and \(B\) are estimated by a least square linear regression (see here here and here for more details and references).

Interpretation of the ANOVA results: do you find a significant effect of the respective SNPs on the considered morphological trait? Comment the results in relation with your intuition from the analysis of the descriptive statistics above?

Solution

After verifying the normality and homoskedasticity of the residuals (if it is not verified, we cannot use the results from the ANOVA significance test because it assumes a Gaussian model), we find a significant effect of SNP YAL069W_1 and a non-significant effect of SNP YDL200C_427 onto the morphological trait A101 (when focusing on the C cell cycle phase), which confirms our intuition from the graphical representation.

6.6.2 Two-factor ANOVA

We would like to avoid restricting the analysis to a single cell cycle phase, and account for the potential variability of the morphological trait A101 across the different cell cycle phases, which is not explained by the genotype.

Write the linear model for the two-factor ANOVA regarding the morphological trait A101 depending on genotype associated to the SNP YAL069W_1 and the cell cycle phase (stored in the column `cell_cycle)?

Solution

Notation:

\(j=1,2\) indexes the genotype associated to the SNP YAL069W_1 (i.e. 0 for \(j=1\) and 1 for \(j=2\))
\(k=1,2,3\) indexes the cell cycle phase (among "A", "A1B", "C")
\(y_{jkr}\) is the value of the morphological trait A101 in the \(r^\text{th}\) individual with genotype \(j\) in cell cycle \(k\). The associated random variable is \(Y_{jkr}\)

Two-factor ANOVA without interaction:

\[ Y_{jkr} = \mu + \beta_j + \alpha_k + \varepsilon_{jkr} \]

where:

\(\mu\) is the intercept
\(\beta_j\) is the coefficient associated to genotype \(j\), with the constraint: \(\sum_j \beta_j = 0\)
\(\alpha_k\) is the coefficient associated to genotype \(k\), with the constraint: \(\sum_k \alpha_k = 0\)

Two-factor ANOVA with interaction:

\[ Y_{jkr} = \mu + \beta_j + \alpha_k + \gamma_{jk} + \varepsilon_{jkr} \]

where:

\(\beta_j\) is the coefficient associated to the interaction between genotype \(j\) and cell cycle phase \(k\), with the constraints: \(\sum_j \gamma_{jk} = 0\) and \(\sum_k \gamma_{jk} = 0\).

Conduct a one-factor ANOVA between the considered output variable (A101) and a combination of one of the considered SNPs (YAL069W_1 or YDL200C_427) and the cell cycle? (for the moment, we do not consider both SNPs in the same model)

Solution

A101 depending on SNP YAL069W_1 and cell_cycle (without interaction):

anova(lm(A101 ~ factor(YAL069W_1) + cell_cycle, data = yeast_av_data))

Analysis of Variance Table

Response: A101
                   Df   Sum Sq   Mean Sq F value    Pr(>F)    
factor(YAL069W_1)   1 0.012442 0.0124421  11.561 0.0008225 ***
cell_cycle          2 0.051605 0.0258026  23.975 5.317e-10 ***
Residuals         188 0.202330 0.0010762                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note: when the variable is numerical but with integer (discrete) values (e.g. YAL069W_1 encoded with 0 and 1), we specify factor() in the model so that the lm() function understands this function as a qualitative and not quantitative variable. When the variable type is not ambiguous (e.g. cell_cycle which is a character variable), it is not necessary to specify that it is a factor() in the linear model, the lm() function assumes it automatically.

A101 depending on SNP YDL200C_427 and cell_cycle (without interaction):

anova(lm(A101 ~ factor(YDL200C_427) + cell_cycle, data = yeast_av_data))

Analysis of Variance Table

Response: A101
                     Df   Sum Sq   Mean Sq F value    Pr(>F)    
factor(YDL200C_427)   1 0.000014 0.0000141  0.0123    0.9118    
cell_cycle            2 0.051605 0.0258026 22.5877 1.617e-09 ***
Residuals           188 0.214758 0.0011423                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interaction?

A101 depending on SNP YAL069W_1 and cell_cycle (with interaction):

anova(lm(A101 ~ factor(YAL069W_1) * cell_cycle, data = yeast_av_data))

Analysis of Variance Table

Response: A101
                              Df   Sum Sq   Mean Sq F value    Pr(>F)    
factor(YAL069W_1)              1 0.012442 0.0124421 11.4894 0.0008546 ***
cell_cycle                     2 0.051605 0.0258026 23.8268  6.13e-10 ***
factor(YAL069W_1):cell_cycle   2 0.000905 0.0004527  0.4181 0.6589336    
Residuals                    186 0.201424 0.0010829                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A101 depending on SNP YDL200C_427 and cell_cycle (with interaction):

anova(lm(A101 ~ factor(YDL200C_427) * cell_cycle, data = yeast_av_data))

Analysis of Variance Table

Response: A101
                                Df   Sum Sq   Mean Sq F value    Pr(>F)    
factor(YDL200C_427)              1 0.000014 0.0000141  0.0122    0.9122    
cell_cycle                       2 0.051605 0.0258026 22.3711 1.967e-09 ***
factor(YDL200C_427):cell_cycle   2 0.000227 0.0001135  0.0984    0.9063    
Residuals                      186 0.214531 0.0011534                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note: in a linear formula in R, factor1 * factor2 is equivalent to factor1 + factor2 + factor1:factor2. The : is used to explicitely specify the interaction between factors.

After doing regular verification for an ANOVA model, interpret the previous results?

Solution

Regular verification (normality and homoskedasticity for the residuals):

A101 depending on SNP YAL069W_1 and cell_cycle (without interaction):

# linear model
mod1 <- lm(A101 ~ factor(YAL069W_1) + cell_cycle, data = yeast_av_data)

# check normality
performance::check_normality(mod1)

Warning: Non-normality of residuals detected (p < .001).

# check homoskedasticity
performance::check_heteroskedasticity(mod1)

Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).

A101 depending on SNP YAL069W_1 and cell_cycle (with interaction):

# linear model
mod2 <- lm(A101 ~ factor(YAL069W_1) * cell_cycle, data = yeast_av_data)

# check normality
performance::check_normality(mod2)

Warning: Non-normality of residuals detected (p < .001).

# check homoskedasticity
performance::check_heteroskedasticity(mod2)

Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).

A101 depending on SNP YDL200C_427 and cell_cycle (without interaction):

# linear model
mod3 <- lm(A101 ~ factor(YDL200C_427) + cell_cycle, data = yeast_av_data)

# check normality
performance::check_normality(mod3)

Warning: Non-normality of residuals detected (p < .001).

# check homoskedasticity
performance::check_heteroskedasticity(mod3)

Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).

A101 depending on SNP YDL200C_427 and cell_cycle (with interaction):

# linear model
mod4 <- lm(A101 ~ factor(YDL200C_427) * cell_cycle, data = yeast_av_data)

# check normality
performance::check_normality(mod4)

Warning: Non-normality of residuals detected (p < .001).

# check homoskedasticity
performance::check_heteroskedasticity(mod4)

Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).

Interpretation and comments:

The prerequisite Gaussian and homoskedasticity assumptions are not verified therefore, in theory, we cannot use the p-values computed by R to assess which factor (or interaction) has an effect or not on the output variable A101.

What can we do in that case?

use variance stabilizing function to transform the output response, like log1p(), i.e. \(x \mapsto \log(x+1)\), or the Anscombe transform
switch to a generalized linear model if relevant, e.g. in case of specific non-Gaussian output variables, like categorical or count variables

We can at least verify the values of the computed coefficients returned by the lm() functions (using the `coefficient) for the different model:

coefficients(mod1)

       (Intercept) factor(YAL069W_1)1      cell_cycleA1B        cell_cycleC 
       0.205166427        0.016131578       -0.028726263        0.009939116

coefficients(mod2)

                     (Intercept)               factor(YAL069W_1)1 
                     0.202608272                      0.020946929 
                   cell_cycleA1B                      cell_cycleC 
                    -0.023125759                      0.012013077 
factor(YAL069W_1)1:cell_cycleA1B   factor(YAL069W_1)1:cell_cycleC 
                    -0.010542125                     -0.003903927

coefficients(mod3)

         (Intercept) factor(YDL200C_427)1        cell_cycleA1B 
        0.2140432349        -0.0005456127        -0.0287262627 
         cell_cycleC 
        0.0099391158

coefficients(mod4)

                       (Intercept)               factor(YDL200C_427)1 
                       0.215756410                       -0.003591258 
                     cell_cycleA1B                        cell_cycleC 
                      -0.031013344                        0.007086671 
factor(YDL200C_427)1:cell_cycleA1B   factor(YDL200C_427)1:cell_cycleC 
                       0.004065923                        0.005071013

The closer to zero coefficient, the lower effect. However, we don’t have any idea of the statistical significance of theses results.

In practice, you need to choose if you are going to illicitly use these p-values or not (knowing that the conditions for the results to be interpretable are not fully verified).

Note on model selection:

When comparing multiple model, e.g. with or without interaction, a simple rule is to start by training with the “richest” model (with all factors and interactions) and then train a more simple model where we remove factors and/or interactions that are not significant (if we can use the p-values).

There exist automatic methods, called model selection approaches to allow to automatically compare and choose the “best” model among a set of candidate models (these types of methods and the comparison criteria they are based on is beyond the scope of the course).

6.7 Dimension reduction

In the following, we want to consider all the SNPs in the dataset and not just one or two.

6.7.1 Genotype data

Here is a heatmap representation of the genotype table regarding all SNPs (in columns) for each strain (in rows): each cell of the image represent the genotype for the corresponding SNP in the corresponding strain.

What can you say about the heatmap ?

Solution

We observe a lot of variability in the genotypes along all SNPs between the different strains. Some SNPs seems to be less variable than other. It appears quite difficult to extract information from this representation. We would need to summarize the information contained in the genotype table to get an overview of these information.

How could we visualize both the information contained in the SNP data and the morphological trait data (we will still focus on a single morphological variable, e.g. A101)?

Solution

We can use a dimension reduction approach, like PCA.

6.7.2 PCA vs Correspondence analysis

Use a PCA dimension reduction on the SNP data and use a visualization of the results colored by the corresponding value of the morphological trait A101 in each strain?

Here is the code to extract the data. Again, we focus on a specific cell cycle.

# extract genotype data regarding all SNP
data_for_pca <- yeast_av_data %>%
    # consider a specific cell cycle
    filter(cell_cycle == "A") %>%
    # select genotype data
    select(c(26:ncol(.))) %>%
    # remove SNPs with too many NAs
    dplyr::select(which(colSums(is.na(.)) == 0))

# extract corresponding value for morphological trait
A101_value <- yeast_av_data %>%
    # consider a specific cell cycle
    filter(cell_cycle == "A") %>%
    # extract result
    select(A101) %>% unlist()

Solution

# run PCA
pca_res <- prcomp(data_for_pca)

# PCA indiv plot
fviz_pca_ind(
    pca_res,
    geom = "point",
    col.ind = A101_value, 
    gradient.cols = c("#2E9FDF", "#FC4E07")
)

We can use a discretized scale to get a better visualization effect:

# discretize the continuous morphological variable
discrete_A101 <- yeast_av_data %>%
    # consider a specific cell cycle
    filter(cell_cycle == "A") %>%
    # discretize
    mutate(A101 = cut(A101, breaks = seq(0,40,by=5)/100)) %>%
    # extract result
    select(A101) %>% unlist() %>% as.character()

# PCA indiv plot
fviz_pca_ind(
    pca_res,
    geom = "point",
    col.ind = discrete_A101
)

What can you say about this representation?

Solution

The structure of the SNP data extracted by the first two components of PCA does not allow to discriminate between the different level of the morphological trait A101. The variability in the data explained by the first two components is not very high though.

Maybe we can use a different dimension reduction approach that is more suitable for the SNP data. If we consider SNP genotype as categorical variables, we can use the multiple correspondence analysis (MCA) method which is another linear dimension reduction approach but adatped to categorical variables.

Here are the results of such an approach:

# extract genotype data regarding all SNP and convert them to factors
data_for_mca <- yeast_av_data %>%
    # consider a specific cell cycle
    filter(cell_cycle == "A") %>%
    # select genotype data
    select(c(26:ncol(.))) %>%
    # remove SNPs with too many NAs
    dplyr::select(which(colSums(is.na(.)) == 0)) %>%
    # transform all columns in factors
    mutate(across(everything(), ~as.factor(.x)))

# run MCA
mca_res <- MCA(data_for_mca, graph=FALSE)

# MCA indiv plot
fviz_mca_ind(
    mca_res,
    geom = "point",
    col.ind = discrete_A101
)

Is the representation obtained by MCA better than the one obtained with PCA?

Solution

Not better, maybe we could try some supervised dimension reduction approaches (c.f. next note) or some non-linear dimension reduction approaches (c.f. previous practical subject).

Note on supervised dimension reduction and mulitple output variables: here we consider a single morphological trait as output variable, however this output variable is not used in the dimension reduction process. One could aim at finding the reduced-dimension representation of the data that best relate to this output variable. To solve this problem, one can use the partial least squares regression (PLS) method which is specifically able to find a latent dimension that relate to an output variable. In addition, there exist specific approaches, such as PLS and also canonical correlation analysis (CCA), that allows to perform this supervised dimension reduction by accouting for multiple output variables (e.g. multiple morphological traits in our example).

6.8 Multiple testing

The exploration of the SNP data by linear dimension reduction approaches did not allow to detect a specific global structure linking the SNP genotypes and the morphological trait measure A101. However, we still want to investigate the possible effect of the genotypes of all the recorded SNP on the morphological trait A101.

To do so, we are going to run an ANOVA for each SNP.

Given the number of SNPs in the data (i.e. 2820), what could be the risk when running such an analysis?

Solution

We are going to do thousands of tests, computing and using thousands of p-values to assess the potential significant effect of each SNP on the considered morphological trait.

We have a non-negligible risk to wrongly reject the null hypothesis for many of the SNPs, and conclude to a non-existing significant effect.

Thus, we have to use p-values correction (or adjustment) procedure adapted to the case of multiple testing.

We run the two-factor ANOVA model for all SNP (accounting for the considered SNP effect and the cell cycle effect).

test_result <- yeast_av_data %>%
    # run all ANOVA tests
    summarize(
        across(
            any_of(gt_data$RQTL_name), 
               ~anova(lm(A101~factor(.x) + cell_cycle))$`Pr(>F)`[1]
        )
    ) %>%
    # reformat the output (to get a data.frame [SNP, p_values])
    t() %>% as.data.frame() %>% dplyr::rename("p_values" = "V1") %>% 
    rownames_to_column("SNP") %>%
    # extract SNP index (to facilitate graphical representation)
    mutate(SNP_index = as.integer(as.factor(SNP)))

What should we verify?

Solution

In theory, we should verify the Gaussian and homoskedasticity assumptions for the residuals in each model (i.e. 2820 models). In practice, it can be cumbersome… (but it would be a flaw in the analysis).

Here is the result of the analysis. We plot the p_values for all SNPs.

ggplot(test_result) + geom_point(aes(x=SNP_index, y=p_values)) +
    geom_hline(yintercept = 0.05, col = "red") +
    annotate("text", label = "alpha = 5%", x=100, y=0.1, col = "red") +
    theme_bw()

What can you say about these results?

Solution

All SNPs for which the p-value is below the horizontal line are found to have a significant effect on the morphological trait A101.

It corresponding to hundred of genes:

sum(test_result$p_values <= 0.05)

[1] 612

But what about the p-value correction?

The p-values should be adjusted in the context of multiple testing. Try different method to adjust the p-values (see the function p.adjust()).

Solution

test_result <- test_result %>%
    mutate(
        bonf_adj_p_values = p.adjust(p_values, method="bonferroni"),
        fdr_adj_p_values = p.adjust(p_values, method="fdr"),
    )

Here is the distribution of the p-values after and before adjustement (we used the Bonferroni and the FDR adjustment approaches):

ggplot(
    test_result %>% pivot_longer(!c(SNP, SNP_index), names_to = "type", values_to = "p_values")
) + geom_histogram(aes(x=p_values)) +
    facet_wrap(~type) + 
    geom_vline(xintercept = 0.05, col = "red") +
    annotate("text", label = "alpha = 5%", x=0.1, y=1000, angle=90, col = "red") +
    theme_bw()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

What can you say about the different corrections?

Solution

After the Bonferronni correction, no significant effect is found anymore. It tend to be very conservative.

The FDR correction reduces the number of found significant effects:

sum(test_result$fdr_adj_p_values <= 0.05)

[1] 198

Here is another way to compare the FDR-adjusted and original p-values:

ggplot(test_result) + geom_line(aes(x=p_values, y=fdr_adj_p_values)) + 
    geom_abline(slope=1, intercept=0, col = "red") +
    annotate("text", label = "y=x", x=0.5, y=0.45, angle=45, col = "red") +
    theme_bw()

Eventually, we can also compare the number of significant SNPs found before and after p-value correction depending on the chosen type I risk \(\alpha\).

# compute the number of significant SNPs
signif_result <- data.frame(alpha = seq(0, 1, 0.001)) %>%
    rowwise() %>%
    mutate(
        signif = sum(test_result$p_values <= alpha),
        signif_adj = sum(test_result$fdr_adj_p_values <= alpha)
    )

# graphical representation of the number of significant SNPs depending on alpha
ggplot(
    signif_result %>% pivot_longer(!alpha, names_to = "type", values_to = "number_snp")
) + 
    geom_line(aes(x = alpha, y=number_snp, col=type)) +
    geom_vline(xintercept = 0.05, col = "red") +
    annotate("text", label = "alpha = 5%", x=0.1, y=2000, angle=90, col = "red") +
    theme_bw()

Extract the top50 SNPs with the most significant effects.

Solution

test_result %>%
    arrange(fdr_adj_p_values) %>%
    select(!bonf_adj_p_values) %>%
    head(50)

            SNP     p_values SNP_index fdr_adj_p_values
1   YCL032W_344 1.143022e-04      1051       0.01349748
2   YCL026C_345 1.175937e-04      1050       0.01349748
3   YCL023C_347 1.340176e-04      1048       0.01349748
4   YCL022C_348 9.571504e-05      1047       0.01349748
5     gCL01_350 8.850660e-05        25       0.01349748
6   YCL018W_359 8.683893e-05      1036       0.01349748
7   YCL018W_360 7.954324e-05      1037       0.01349748
8   YCL018W_361 8.683893e-05      1038       0.01349748
9   YCL018W_362 8.683893e-05      1039       0.01349748
10  YCL018W_363 8.683893e-05      1040       0.01349748
11  YCL018W_364 8.683893e-05      1041       0.01349748
12  YCL018W_365 8.683893e-05      1042       0.01349748
13  YCL018W_366 3.156126e-05      1043       0.01349748
14  YCL018W_367 8.683893e-05      1044       0.01349748
15  YCL018W_368 8.683893e-05      1045       0.01349748
16  YCL018W_369 8.683893e-05      1046       0.01349748
17  NCR015C_372 1.338310e-04       404       0.01349748
18  YFR045W_862 6.205391e-05      1430       0.01349748
19 YHR095W_1279 1.338548e-04      1694       0.01349748
20 YHR104W_1282 7.856486e-05      1695       0.01349748
21 YPL242C_2803 3.349920e-05      2795       0.01349748
22 YPL238C_2804 1.035004e-04      2794       0.01349748
23 YPL235W_2805 1.101901e-04      2793       0.01349748
24 YPL234C_2806 5.169574e-05      2791       0.01349748
25 YPL234C_2807 1.101901e-04      2792       0.01349748
26 YPL233W_2808 1.101901e-04      2790       0.01349748
27   gPL03_2809 1.101901e-04       294       0.01349748
28 YPL231W_2810 1.291664e-04      2789       0.01349748
29  YCL025C_346 1.440175e-04      1049       0.01390326
30 YPL246C_2801 1.479070e-04      2798       0.01390326
31    gCL01_349 2.371728e-04        24       0.01529711
32    gCL01_351 2.371728e-04        26       0.01529711
33    gCL01_352 2.371728e-04        27       0.01529711
34    gCL01_353 2.371728e-04        28       0.01529711
35    gCL01_354 2.371728e-04        29       0.01529711
36    gCL01_355 2.371728e-04        30       0.01529711
37    gCL01_357 2.371728e-04        32       0.01529711
38    gCL01_358 2.371728e-04        33       0.01529711
39  NCR015C_373 2.441028e-04       405       0.01529711
40  NCR015C_374 2.441028e-04       406       0.01529711
41  YEL068C_617 1.921348e-04      1288       0.01529711
42  YEL068C_618 1.921348e-04      1289       0.01529711
43   gHR04_1280 2.208442e-04        99       0.01529711
44   gHR04_1281 2.208442e-04       100       0.01529711
45 YLR074C_1972 2.321278e-04      2177       0.01529711
46  YDR251W_550 2.846138e-04      1206       0.01672106
47  YDR252W_551 2.846138e-04      1207       0.01672106
48 YNL220W_2414 2.810908e-04      2504       0.01672106
49 YLR074C_1971 3.013515e-04      2176       0.01734309
50    YAL069W_1 8.225020e-04       880       0.01743952

What can we do with these results?

Solution

investigate the corresponding genomic region containing these SNPs (is it a gene, etc.)
run a confirmation study (for instance by inducing mutation in these SNPs)
etc.

6.9 Full data analysis

Open (and optional) question: run the previous analysis to find SNPs with significant effect on the morphological trait A101 with the full dataset yeast_data, i.e. without the average by strain for the morphological traits.

In this case, you will have to account for the strain_id factor in the ANOVA.

You could also try to consider linear model integrating multiple SNPs (instead of testing each SNP independently). However be aware of potential dimension issues. You will also encounter the question of model selection in this particular setting (which SNP do I integrate in my model?).

6.10 Appendix (optional): Simpon’s paradox and confounding factors

In the palmerpenguins dataset, we have different morphological measurements for numerous individuals of penguins of different species. We want to investigate a possible link between the bill depth vs the bill length for these penguins.

Reminder about the anatomy of a penguin:

Here is a scatter plot representing the bill depth vs the bill length for all individuals in the dataset.

What can you say about this representation? Can you infer any relationship between bill depth and length in penguins?

Solution

There seems to be a negative relationship between bill depth and length in penguins. As the length increases, the depth decreases.

Could we question this result?

Solution

Here is the same representation discriminated by species:

We obtained opposite conclusions : in all species, there seems to be a positive relationship between bill depth and length. As the length increases, the depth also increases.

The species factor is called a confounding factors: a variable (potentially unknown) that influences both variables resulting in a spurious association.

In this particular case, the change of trend when accounting for the groups of individuals is called the Simpon’s paradox.

Here is another example from the following publication:

[2] Appleton, David & French, Joyce & Vanderpump, Mark. (1996). Ignoring a Covariate: An Example of Simpson’s Paradox. American Statistician - AMER STATIST. 50. 340-341. 10.1080/00031305.1996.10473563.

After a first survey in the 1970s, a follow-up study in 1992-94 (in Whickham a mixed urban and rural district near Newcastle upon Tyne, United Kingdom) investigated (among other things) the 20-year after survival of subjects included in the original study.

Here are the results (dataset from [2]) regarding the survival of the women in the study depending on their smoking status (“smoker” or “non smoker”):

smoker_status	dead	alive	survival_rate
non_smoker	230	502	68.579%
smoker	139	443	76.117%

What can you say about these results ?

Solution

The survival rate is larger in the “smoker” sub-group than in the “not smoker” sub-group. Surprising?

What could explain this surprising result?

Solution

A confounding variable: the age of the person in the original study.

There is a sampling bias in this dataset. The number of older non-smoking women (thus with a higher mortality risk) is very high compared to the other groups, hence decreasing the global survival rate in the non-smoking group.

age	smoker_status	dead	alive	survival_rate
18-24	non_smoker	1	61	98.387%
25-34	non_smoker	5	152	96.815%
35-44	non_smoker	7	114	94.215%
45-54	non_smoker	12	66	84.615%
55-64	non_smoker	40	81	66.942%
65-74	non_smoker	101	28	21.705%
75+	non_smoker	64	0	0%
18-24	smoker	2	53	96.364%
25-34	smoker	3	121	97.581%
35-44	smoker	14	95	87.156%
45-54	smoker	27	103	79.231%
55-64	smoker	51	64	55.652%
65-74	smoker	29	7	19.444%
75+	smoker	13	0	0%

Note: in the two previous examples, we did not focus on quantifying any statistical significance of the different effects that we discussed (that would require using statistical models). The point was to give you the intuition about potential issues and misleading results caused by any forgotten confounding factors and the Simpson’s paradox in an analysis.

How to avoid drawing false conclusions because of confounding factors?

Solution

Use randomization to build your experiment, like randomized controlled trial, possibly with double-blinding, to remove the potential effect of confounding variables (should be used for any serious drug trial), or at least control the potential sampling bias caused by confounding factors (like case-control studies, cohort studies, stratification).
If not possible (it is not always possible depending on the design of the experiment and/or the object of the study), measure and log various metadata regarding your subjects/individuals so that you will be able to account for the potential effect of confounding variables in your analysis (c.f. later).

It generally requires a certain level of technical expertise/knowledge in the considered subject to be able to identify potential confounding factors before the experiments (so that you can monitor and log the corresponding quantities during your experiment).

More details here.

The .Rmd file corresponding to this page is available here under the AGPL3 Licence