# required packages
library(tidyverse)
Tutorial 4: Analysis of Variance (ANOVA)
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License
\[ \def\BB{{\mathcal{B}}} \def\EE{{\mathbb{E}}} \def\FF{{\mathcal{F}}} \def\NN{{\mathcal{N}}} \def\poi{{\mathcal{P}}} \def\PP{{\mathbb{P}}} \def\TT{{\mathcal{T}}} \def\UU{{\mathcal{U}}} \def\VV{{\mathbb{V}}} \let\epsilon\varepsilon \]
Requirements
We first load the tidyverse
meta-package that we will be needing.
Note: we will need the
ggplot2
,dplyr
,readr
,tidyr
andtibble
packages from thetidyverse
meta-package. Thus, if you encounter issues when installingtidyverse
, you can only install these four packages.
1-factor ANOVA
We will again consider the penguins
dataset.
library(palmerpenguins)
penguins
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
We remove individuals with missing values
<- penguins %>% drop_na()
penguins penguins
# A tibble: 333 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen 36.7 19.3 193 3450
5 Adelie Torgersen 39.3 20.6 190 3650
6 Adelie Torgersen 38.9 17.8 181 3625
7 Adelie Torgersen 39.2 19.6 195 4675
8 Adelie Torgersen 41.1 17.6 182 3200
9 Adelie Torgersen 38.6 21.2 191 3800
10 Adelie Torgersen 34.6 21.1 198 4400
# ℹ 323 more rows
# ℹ 2 more variables: sex <fct>, year <int>
We question the potential relationship between the body mass of the penguins and their species. In other words, is the body mass of the penguins generally different in different species?
How could we answer this question? maybe a method that we have seen in the previous tutorials? What could be the limitation?
Solution
We want to compare a continuous feature in different populations, we could use a Student test to verify if the average body mass is different in two populations of penguins from different species.
However, here we have 3 species, hence 3 populations:
table(penguins$species)
Adelie Chinstrap Gentoo
146 68 119
Are we going to consider multiple comparisons between the different possible pairs of species (Adelie vs Chinstrap, Adelie vs Gentoo, Chinstrap vs Gentoo) ?
Solution
No, we are going to use a model that will consider all species together.
What could be this model that would allow us to compare the body mass of the penguins in the 3 different species at the same time?
Solution
We are going to use a linear model and an analysis of variance (a.k.a. ANOVA).
Data exploration
Before building a complex statistical model, we could first verify if the question of the relationship between the penguin species and their body mass is relevant regarding our sampled data with a qualitative and visual study.
How could we get an intuition about the possible difference of body mass between the 3 species of penguins in a very simple way? What can we say about it?
Solution
We can compute the body mass average in the sampled data for each species:
%>%
penguins group_by(species) %>%
summarise(body_mass_av = mean(body_mass_g))
# A tibble: 3 × 2
species body_mass_av
<fct> <dbl>
1 Adelie 3706.
2 Chinstrap 3733.
3 Gentoo 5092.
The penguins body mass seems to be quite different in average across the different species in the sample.
Which graphical representations could we use to get more insights about the previous results?
Solution
We could draw a boxplot of the body mass depending on the species:
ggplot(penguins, aes(x = species, y = body_mass_g)) +
geom_boxplot(aes(fill = species)) +
theme_bw()
We could also check the empirical distribution of the body mass in the sample depending on the species:
ggplot(penguins, aes(x = body_mass_g)) +
geom_histogram(aes(fill = species), alpha = 0.5, position = "identity") +
theme_bw()
In both representations, the empirical distribution of the body mass in the sample seems to be clearly different across the 3 species, especially for the Gentoo penguins versus the Adelie and Chinstrap penguins that seems to have similar body mass.
The ANOVA model
We are going to focus on the linear model and the analysis of variance, a.k.a the ANOVA, to stastically quantify the apparent difference of body mass between the 3 penguin species.
What is the main difference with all the previous models and methods that we have considered up till now?
Solution
With the linear model and in the particulat case of the ANOVA, we are now considering a statistical method that is directly designed to model the possible relationship between two variables. In particular we are going to question if the values observed for one or more factors, a.k.a. explanatory variables, may be used to explain the values observed for another, a.k.a. the response variable or the variable to be explained.
What type of variables are suitable for the ANOVA model?
Solution
The factor variables will be one or many categorical variables. The response variable will be a continuous variable.
What is the natural model that we are going to consider in the ANOVA?
Solution
We denote by \(Y_{ij}\) the observed body mass of the penguin \(j\) in the group or class \(i\). Here, the different groups \(i=1,...,I\) are the different species (\(I = 3\)), and there are \(j=1,...n_i\) penguins in each group.
Here are the group headcounts in the sample:
%>%
penguins group_by(species) %>%
summarise(n_i = n())
# A tibble: 3 × 2
species n_i
<fct> <int>
1 Adelie 146
2 Chinstrap 68
3 Gentoo 119
We note \(\mu_i\) the average body mass in species \(i\), then the model will be: \[Y_{ij} = \mu_i + \epsilon_{ij}\] where \(\epsilon_{ij}\) is the error term or residual for the body mass of the penguin \(j\) in species \(i\).
Important: what assumptions are we going to make to complete this model and go further to use it?
Solution
We are going to assume that the residuals \(\varepsilon_{ij}\) are independent and identically distributed (i.i.d.) random Gaussian variable of mean 0 and variance \(\sigma^2\), i.e. \[\epsilon_{ij} \sim \NN(0, \sigma^2)\]
Note:
The residuals are assumed to be centered Gaussian variables (i.e. Gaussian variables of mean = 0)
The variance of the residuals is assumed to be constant across all individuals in all groups: which is called homoskedasticity.
Linear model for the ANOVA in practice
In R, we will use the lm()
function to build a linear model and the anova()
function to get the ANOVA result.
We build the linear model of the body mass depending on the species:
<- lm(body_mass_g ~ species, data = penguins)
mod1 mod1
Call:
lm(formula = body_mass_g ~ species, data = penguins)
Coefficients:
(Intercept) speciesChinstrap speciesGentoo
3706.16 26.92 1386.27
The ANOVA model is encoded in R: <response_variable> ~ <factor>
.
What can you say about this model? Does it correspond to the ANOVA model that we wrote earlier?
Solution
It seems that we do not have one coefficients per species, but only an “intercept” and one coefficient for the Chinstrap and Gentoo species.
Note: you can use the following command to build the model \(Y_{ij} = \mu_i + \epsilon_{ij}\) (for the body mass of penguin \(j\) in species \(i\)) that we defined earlier:
<- lm(body_mass_g ~ 0 + species, data = penguins)
mod0 mod0
Call:
lm(formula = body_mass_g ~ 0 + species, data = penguins)
Coefficients:
speciesAdelie speciesChinstrap speciesGentoo
3706 3733 5092
or equivalently:
<- lm(body_mass_g ~ -1 + species, data = penguins)
mod0 mod0
Call:
lm(formula = body_mass_g ~ -1 + species, data = penguins)
Coefficients:
speciesAdelie speciesChinstrap speciesGentoo
3706 3733 5092
We can check that the \(\mu_i\) coefficients are indeed the species body mass empirical averages:
%>%
penguins group_by(species) %>%
summarise(body_mass_av = mean(body_mass_g))
# A tibble: 3 × 2
species body_mass_av
<fct> <dbl>
1 Adelie 3706.
2 Chinstrap 3733.
3 Gentoo 5092.
What ANOVA model is used in R?
Solution
R is using the following model: \[Y_{ij} = \mu + \alpha_i + \epsilon_{ij}\] where \(\mu\) is the global average (a.k.a. the “intercept”) and \(\alpha_i\) is the effect of species \(i\).
Intuitively, \(\alpha_i\) is defined such that \(\alpha_i = \mu_i - \mu\) where \(\mu_i\) is the average body mass in species \(i\) that we defined earlier in the natural model for the ANOVA.
Why is there no \(\alpha_i\) coefficient for the species Adelie in the R output?
mod1
Call:
lm(formula = body_mass_g ~ species, data = penguins)
Coefficients:
(Intercept) speciesChinstrap speciesGentoo
3706.16 26.92 1386.27
Solution
For the model to be identifiable (i.e. for a unique solution to exist), we need to put constraints on the \(\alpha_i\) coefficients, either \(\sum_{i=1}^I \alpha_i = 0\), or set \(\alpha_1 = 0\) (or \(\alpha_I = 0\)) for a specific group/class (which becomes the reference group/class).
The lm()
function is using the second option: \(\alpha_1 = 0\) for the Adelie species.
Note: we can use the lsmeans()
function from the emmeans
package to compute the \(\mu_i\) in the \(Y_{ij} = \mu_i + \epsilon_{ij}\) from the ANOVA model \(Y_{ij} = \mu + \alpha_i + \epsilon_{ij}\).
library(emmeans)
lsmeans(mod1, "species")
species lsmean SE df lower.CL upper.CL
Adelie 3706 38.1 330 3631 3781
Chinstrap 3733 55.9 330 3623 3843
Gentoo 5092 42.2 330 5009 5176
Confidence level used: 0.95
It also gives the estimated standard deviation for each factor, and a 95% Gaussian-based confidence interval for the effect of the modality.
Residuals
To go further with the ANOVA model and statistically quantify the effect of the species on the body mass, we need to verify that the ANOVA model is relevant regarding our data.
What assumptions did we made about the residuals?
Solution
We assume the residuals to be centered Gaussian variables (i.e. with mean = 0) and constant variance (homoskedasticity)?
Here is a command to extract the residuals of the model:
<- residuals(mod1) resid1
And we draw the graph of the observed residuals (or empirical residuals) depending on the group (here the species):
ggplot(
%>% mutate(resid = residuals(mod1)),
penguins aes(x = species, y = resid)) +
geom_point() +
geom_hline(yintercept = 0, color = "red") +
theme_bw()
What can you say about this representation?
Solution
The observed residuals seems to be centered around 0, and their variance is approximately constant over the different groups/classes, which are good clues that some of the assumptions that are required regarding the residuals are verified here.
How can we check the previous assumptions about the residuals more thoroughly?
Solution
We have to check that the empirical residuals are approximately Gaussian, with mean = 0 and constant variance.
The residuals plot was a first step. We can check the values of the empirical mean and empirical variance of the observed residuals globally and within each species.
mean(resid1)
[1] 7.064346e-15
var(resid1)
[1] 211052.6
%>% mutate(resid = residuals(mod1)) %>%
penguins group_by(species) %>%
summarise(resid_mean = mean(resid), var_resid = var(resid))
# A tibble: 3 × 3
species resid_mean var_resid
<fct> <dbl> <dbl>
1 Adelie -6.61e-15 210332.
2 Chinstrap 1.27e-14 147713.
3 Gentoo 2.06e-14 251478.
The residual mean is approximately 0. The variance seems to vary across the different group, it is approximately \(200000\pm 50000\), however we will see below that these difference are statistically significant (we will see below that we cannot reject the homoskedasticity hypothesis with a 5% risk).
Then, we can verify the Gaussian assumption with a qq-plot. Here we compared the empirical quantiles of the observed standardized residuals (obtained with the function rstandard()
) to the theoretical quantiles of a \(\NN(0,1)\) standard Gaussian distribution:
qqnorm(rstandard(mod1))
abline(a=0, b=1, col = "red") # y=x line
We can say here that the residuals are approximately Gaussian.
What test could we use to verify the assumptions on the residuals?
Solution
To verify the Gaussian assumptions on the residuals, we can use a Kolmogorov-Smirnov test (comparing the observed residuals to Gaussian random values) or the Shapiro-Wilk test of normality
ks.test(rstandard(mod1), pnorm)
Asymptotic one-sample Kolmogorov-Smirnov test
data: rstandard(mod1)
D = 0.041618, p-value = 0.6113
alternative hypothesis: two-sided
We cannot reject the hypothesis that the standardized residuals and some Gaussian data (drawn from the \(\NN(0,1)\) distributions) are drawn from the same distribution at a risk \(\alpha=5\%\) or lower.
shapiro.test(residuals(mod1))
Shapiro-Wilk normality test
data: residuals(mod1)
W = 0.9922, p-value = 0.07835
We cannot the hypothesis that the residuals are Gaussian at a risk \(\alpha=5\%\) or lower.
The performance
package also provides functions to statistically test the normality and homoskedasticity of the residuals of a linear model in R:
- the
check_normality()
uses a Shapiro-Wilk test of normality (\(H_0\): “the data are Gaussian”) - the
check_heteroskedasticity()
uses a Breusch-Pagan test (\(H_0\): “the variance is constant”)
library(performance)
# check normality
check_normality(mod1)
OK: residuals appear as normally distributed (p = 0.079).
# check homoskedasticity
check_heteroskedasticity(mod1)
OK: Error variance appears to be homoscedastic (p = 0.079).
So here, we conclude that the residuals are approximately Gaussian, centered and of constant variance.
As a consequence, we will be able to test the significance of the model and of the coefficients.
Without the validations of the Gaussian (with mean=0) and homoskedasticity assumptions, we cannot use what follows in the next sections (despite R being able to compute these results).
Interpreting the ANOVA results
We have the following model, and we validated the Gaussian (with mean=0) and homoskedasticity assumptions about the residuals.
<- lm(body_mass_g ~ species, data = penguins)
mod1 mod1
Call:
lm(formula = body_mass_g ~ species, data = penguins)
Coefficients:
(Intercept) speciesChinstrap speciesGentoo
3706.16 26.92 1386.27
Geometric goodness-of-fit
We can get information about the ANOVA results with the anova()
function:
anova(mod1)
Analysis of Variance Table
Response: body_mass_g
Df Sum Sq Mean Sq F value Pr(>F)
species 2 145190219 72595110 341.89 < 2.2e-16 ***
Residuals 330 70069447 212332
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Use the various sums of squares (Sum Sq
) to verify if the factor species has an effect on the body mass?
Solution
We can compute the ratio \(\frac{MSS}{RSS}\) where \(MSS\) quantifies the variance explained by the model (Model Sum of Squares) and \(RSS\) quantifies the residual variance (Residual Sum of Squares).
Here, we have (from the anova()
function output) \(\frac{MSS}{RSS} = \frac{145190219}{70069447} =\) 2.0720903 then \(MSS > RSS\) and we can say that the species factor seems to have an effect on the body mass.
Then compute the \(R^2\) (R square) coefficient of the model?
Solution
The \(R^2\) coefficient is \(R^2 = \frac{MSS}{TSS}\) where \(TSS\) quantifies the total variance (Total Sum of Squares). From the Pythagorean theorem, we have \(TSS = MSS + RSS\), then we also have \(R^2 = 1 - \frac{RSS}{TSS}\)
The \(R^2\) coefficient is also the squared correlation coefficient between the observed values for the ouput variable \(y_{ij}\) and the corresponding values predicted by the model \(\hat{y}_{ij} = \hat{y}_{i\cdot} = \hat{\mu} + \hat{\alpha}_i = \hat{\mu}_i\) (the group wise empirical mean), where \(\hat{\mu}\) and \(\hat{\alpha}_i\) are the estimations of the corresponding coefficients.
Here, we have (from the anova()
function output) \(R^2 = \frac{MSS}{MSS + RSS} = \frac{145190219}{145190219+70069447} =\) 0.6744887.
What can you say about the model from the \(R^2\) value here?
Solution
The \(R^2\) coefficient is closer to 1 than 0, which again suggests an effect of the species on the body mass.
The variance decomposition (or sum of squares decomposition) and \(R^2\) coefficient are geometric quantifications of the model goodness-of-fit (that are not based on the Gaussian nor the homoskedasticity assumptions).
Factor effect
Now, we are going to question and quantify the statistical significance of the model.
Other information are given by the anova()
function, especially, for the considered factor, an \(F\)-value (Fisher value, F value
) and the corresponding \(p\)-value (Pr(>F)
) with a code indicating the level of significance.
anova(mod1)
Analysis of Variance Table
Response: body_mass_g
Df Sum Sq Mean Sq F value Pr(>F)
species 2 145190219 72595110 341.89 < 2.2e-16 ***
Residuals 330 70069447 212332
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What can you say about this information?
Solution
The \(F\)-value is the observed value for the Fisher statistic related to the following test:
- \(H_0\): \(\alpha_1 = ... = \alpha_i = ... = \alpha_I = 0\)
- \(H_1\): there exist at least one \(i\) such that \(\alpha_i \ne 0\)
Here, we question the global effect of the factor (here the species) on the response variable (here the body mass).
The output also gives us the degrees of freedom (Df
) for the factor, i.e. the number of groups minus 1 (\(I - 1\)), and the degrees of freedom for the residuals, i.e. the number of observations minus the number of coefficients in the model (here \(n - I\)).
The corresponding Fisher statistics follows a \(\FF(I-1, n-I)\) Fisher distribution under the null hypothesis (and under the Gaussian with mean=0 and homoskedasticity assumptions about the residuals).
This test does not give any insight about which modality of the considered factor has an actual effect on the response variable.
What is the conclusion of the Fisher test here regarding the effect of the species on the body mass?
Solution
Here, the \(p\)-value is very small and we can reject the null hypothesis with a risk \(\alpha = 5\%\). There is a significant effect of the species on the body mass.
Important: we can call the anova()
function for any linear model with factors, but we need to first check that the Gaussian (with mean=0) and homoskedasticity assumptions about the residuals are verified to be able to interpret the results about the Fisher test. In case of non Gaussian or heteroskedastic residuals, it makes no sense to interpret this output.
How is computed the column Mean Sq
in the anova()
function output?
Solution
It is the sum of squares (Sum Sq
) divided by the number of degrees of freedom.
Factor modality effect
Now, with the following function, we can get additional information about our model:
summary(mod1)
Call:
lm(formula = body_mass_g ~ species, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-1142.44 -342.44 -33.09 307.56 1207.56
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3706.16 38.14 97.184 <2e-16 ***
speciesChinstrap 26.92 67.65 0.398 0.691
speciesGentoo 1386.27 56.91 24.359 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 460.8 on 330 degrees of freedom
Multiple R-squared: 0.6745, Adjusted R-squared: 0.6725
F-statistic: 341.9 on 2 and 330 DF, p-value: < 2.2e-16
What can you say about this output?
Solution
We have a lot of information here:
- a reminder of the model
- information about the distribution of the residuals
- a table with the coefficient estimations and related information (c.f. below)
- the standard deviation of the residuals and the corresponding degrees of freedom
Multiple R-squared
gives the \(R^2\) coefficient (check that the value is the same that we computed). TheAdjusted R-squared
is a modified version of the \(R^2\) that has been adjusted for the number of coefficients in the model (we will see this one later).- a Fisher statistic value (
F-statistic
) with the corresponding degrees of freedom and \(p\)-value (we will explain this one later).
Reminder: we are considering the model \(Y_{ij} = \mu + \alpha_i + \epsilon_{ij}\)
The table with the coefficient estimations gives:
- the estimation \(\hat{\mu}\) of the \(\mu\) coefficient (
Intercept
) - the estimation \(\hat{\alpha}_i\) of the \(\alpha_i\) coefficient for each species \(i\)
- the corresponding estimated standard deviation of the coefficient
Std. Error
- the corresponding value of a Student statistic (
t value
) and a corresponding \(p\)-value (Pr(>|t|)
)
Why is there no coefficient for the Adelie species?
Solution
Reminder: we need a constraint on the \(\alpha_i\) for the model to be identifiable and R using the constraint \(\alpha_1 = 0\) which corresponds to the Adelie species coefficient. The Adelie species is the reference group/class.
What is this Student \(T\) statistics? To which test is it related?
Solution
For each coefficient, including the intercept \(\mu\) and the \(\alpha_i\) for each species (except the Adelie species for which the coefficient \(\alpha_1\) is forced to 0), we can run the following test:
- \(H_0\): the coefficient is null
- \(H_1\): the coefficient is not null
e.g. for a given species \(i\), we can test \(H_0\): \(\alpha_i = 0\) versus \(H_1\): \(\alpha_i \ne 0\).
Under the null hypothesis (and assuming that the residuals are Gaussian and homoskedastic), the test statistic follows a \(\TT(n-I)\) Student distribution.
Here, the corresponding \(p\)-value is also computed and the significance level is encoded.
Important: again, if the Gaussian (with mean=0) and homoskedasticity assumptions about the residuals are not verified, we cannot interpret the results about the Student tests.
What do you conclude here?
Solution
We conclude that the Chinstrap and Gentoo species both have a significant effect on the penguin body mass but there is a limitation here: in practice we should use other tools to question the effect of the modality of the factors (c.f. the following questions).
What about the particular effect of the Adelie species?
Solution
The output of the summary()
function for the model \(Y_{ij} = \mu + \alpha_i + \epsilon_{ij}\) is not very useful to question which modality of the factor has an effect on the response because the coefficient \(\alpha_1\) is set to 0 by constraints for identifiability reason.
Therefore, this output cannot be used to assess the effect of all modality of the factor, here all species, individually.
A force workaround is to use the following model to estimate the effect of all classes of the factor (without considering an intercept):
<- lm(body_mass_g ~ 0 + species, data = penguins)
mod0 check_normality(mod0)
OK: residuals appear as normally distributed (p = 0.079).
check_heteroscedasticity(mod0)
OK: Error variance appears to be homoscedastic (p = 0.079).
anova(mod0)
Analysis of Variance Table
Response: body_mass_g
Df Sum Sq Mean Sq F value Pr(>F)
species 3 6039066803 2013022268 9480.6 < 2.2e-16 ***
Residuals 330 70069447 212332
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod0)
Call:
lm(formula = body_mass_g ~ 0 + species, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-1142.44 -342.44 -33.09 307.56 1207.56
Coefficients:
Estimate Std. Error t value Pr(>|t|)
speciesAdelie 3706.16 38.14 97.18 <2e-16 ***
speciesChinstrap 3733.09 55.88 66.81 <2e-16 ***
speciesGentoo 5092.44 42.24 120.56 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 460.8 on 330 degrees of freedom
Multiple R-squared: 0.9885, Adjusted R-squared: 0.9884
F-statistic: 9481 on 3 and 330 DF, p-value: < 2.2e-16
Here, we can conclude that all species have a significant effect on the body mass.
Note: in that case, there is no intercept and there are \(I = 3\) degrees of freedom for the species factor.
Or we can use a contrast test (c.f. next section).
Contrast test
With the following commands, we can get the estimations (with a 95% interval) of each species effect \(\mu_i\) directly from the model \(Y_{ij} = \mu + \alpha_i + \epsilon_{ij}\) (despite the constraint \(\alpha_1=0\)).
lsmeans(mod1, "species")
species lsmean SE df lower.CL upper.CL
Adelie 3706 38.1 330 3631 3781
Chinstrap 3733 55.9 330 3623 3843
Gentoo 5092 42.2 330 5009 5176
Confidence level used: 0.95
plot(lsmeans(mod1, "species")) + theme_bw() + coord_flip()
And we can get the result of a contrast test with the following command.
pairs(lsmeans(mod1, "species"))
contrast estimate SE df t.ratio p.value
Adelie - Chinstrap -26.9 67.7 330 -0.398 0.9164
Adelie - Gentoo -1386.3 56.9 330 -24.359 <.0001
Chinstrap - Gentoo -1359.3 70.0 330 -19.406 <.0001
P value adjustment: tukey method for comparing a family of 3 estimates
What do you think that we are doing here?
Solution
We are testing the pairwise difference of the effect between the modalities of the factor.
In particular,for each possible pair of the factor modalities \((i,i')\) (here each pair of species), we test the following:
- \(H_0\): \(\mu_i = \mu_{i'}\)
- \(H_1\): \(\mu_i \ne \mu_{i'}\)
The estimate
is giving the estimation \(\hat{\mu}_i - \hat{\mu}_{i'}\) and the corresponding standard deviation (SE
). In this case, under the null hypothesis (and assuming the normality with mean=0 and homoskedasticity assumptions on the residuals), the test statistics follows as \(\TT(n-I)\) Student distribution. The observed value for this statistics is given by t.ratio
and the corresponding \(p\)-value by p.value
.
Here, the only significant difference of effect is between the Gentoo and any of the other two species (Adelie or Chinstrap). There is no significant difference between the Adelie and Chinstrap species.
Important: again, if the Gaussian (with mean=0) and homoskedasticity assumptions about the residuals are not verified, we cannot interpret the results about the Student tests.
Perform your own analysis
We now question the potential effect of the island of origin (variable island
in the penguin
table) on the body mass.
Run an analysis on your own answering this question? (build the statistical model, verify the assumptions, interpret the results).
Solution
# boxplot
ggplot(penguins, aes(x = island, y = body_mass_g)) +
geom_boxplot(aes(fill = island)) +
theme_bw()
# linear model
<- lm(body_mass_g ~ island, data = penguins)
mod2
# residual plot
ggplot(
%>% mutate(resid = residuals(mod2)),
penguins aes(x = island, y = resid)) +
geom_point() +
geom_hline(yintercept = 0, color = "red") +
theme_bw()
# residual qq-plot
qqnorm(rstandard(mod2))
abline(a=0, b=1, col = "red") # y=x line
# check normality and homoskedasticity of residuals
check_normality(mod2)
Warning: Non-normality of residuals detected (p = 0.002).
check_heteroskedasticity(mod2)
Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).
# information about the model
anova(mod2)
Analysis of Variance Table
Response: body_mass_g
Df Sum Sq Mean Sq F value Pr(>F)
island 2 83740680 41870340 105.06 < 2.2e-16 ***
Residuals 330 131518986 398542
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod2)
Call:
lm(formula = body_mass_g ~ island, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-1869.17 -368.90 5.83 431.10 1580.83
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4719.17 49.45 95.438 <2e-16 ***
islandDream -1000.27 75.40 -13.266 <2e-16 ***
islandTorgersen -1010.66 104.52 -9.669 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 631.3 on 330 degrees of freedom
Multiple R-squared: 0.389, Adjusted R-squared: 0.3853
F-statistic: 105.1 on 2 and 330 DF, p-value: < 2.2e-16
lsmeans(mod2, "island")
island lsmean SE df lower.CL upper.CL
Biscoe 4719 49.4 330 4622 4816
Dream 3719 56.9 330 3607 3831
Torgersen 3709 92.1 330 3527 3890
Confidence level used: 0.95
pairs(lsmeans(mod2, "island"))
contrast estimate SE df t.ratio p.value
Biscoe - Dream 1000.3 75.4 330 13.266 <.0001
Biscoe - Torgersen 1010.7 104.5 330 9.669 <.0001
Dream - Torgersen 10.4 108.3 330 0.096 0.9949
P value adjustment: tukey method for comparing a family of 3 estimates
# non-parametric test
kruskal.test(body_mass_g ~ island, data = penguins)
Kruskal-Wallis rank sum test
data: body_mass_g by island
Kruskal-Wallis chi-squared = 124.35, df = 2, p-value < 2.2e-16
Here, the boxplot suggests that the island factor has an effect on the body mass. We use an ANOVA model. However, the residuals do not verify the normality (c.f. qq-plot and normalality test) and homoskedasticity (c.f. residual plot and homoskedasticity test) assumptions, thus we cannot use the Fisher and Student tests on the significance of the factor effect and factor modality effect (respectively).
We can however interpret the geometric result about the variance decomposition. The residual sum of squares is higher than the model sum of squares, and the \(R^2\) coefficient is closer to 0 than 1, so the model does not seem to fit the data so well.
Nonetheless, to get a statistical result, since we cannot use the ANOVA model, we use a non-parametric Kruskal-Wallis test: null hypothesis that all samples (here from the 3 different islands) came from the same population (i.e. were sampled from the same distribution).
Here the Kruskal-Wallis concludes that the island of origin has an effect on the penguin body mass.
We will dispute this conclusion in the next section.
Getting started with 2-factor ANOVA
Reminder: do not forget to remove the missing values in the penguin table (in general, missing values can be handled, but here we just remove them for simplicity sake).
<- penguins %>% drop_na() penguins
Now, we are going to consider multiple factors in the same model in order to jointly estimate their effects on a response variable.
Here is the distribution of the penguin body mass depending on their species and on their island of origin:
ggplot(penguins, aes(x = island, y = body_mass_g)) +
geom_boxplot(aes(fill = species)) +
theme_bw()
What can you say about this?
Solution
In the final analysis of the previous section, we found that the island of origin had an effect on the body mass of the penguins.
However, it appears that this difference is mainly related to the species of the penguins and not their island of origin. Indeed, except for the Gentoo penguins, the distribution of the penguin body mass seems to be quite similar between the species Adelie and Chinstrap in the 3 islands.
Perform an ANOVA analysis to model the body mass depending on both the species and the island of origin of the penguin?
Hint: when using multiple factors, the formula to build the linear model in R becomes body_mass_g ~ species + island
.
Solution
# model
<- lm(body_mass_g ~ species + island, data = penguins)
mod3 # residual plot
ggplot(
%>% mutate(resid = residuals(mod3)),
penguins aes(x = interaction(species, island), y = resid)) +
geom_point() +
geom_hline(yintercept = 0, color = "red") +
theme_bw()
# qq-plot
qqnorm(rstandard(mod3))
abline(a=0, b=1, col = "red") # y=x line
# check assumptions on residuals
check_normality(mod3)
OK: residuals appear as normally distributed (p = 0.077).
check_heteroskedasticity(mod3)
OK: Error variance appears to be homoscedastic (p = 0.078).
# model coefficient estimation and ANOVA table
lsmeans(mod3, "species")
species lsmean SE df lower.CL upper.CL
Adelie 3707 38.4 328 3631 3782
Chinstrap 3738 76.9 328 3587 3889
Gentoo 5089 69.9 328 4952 5227
Results are averaged over the levels of: island
Confidence level used: 0.95
lsmeans(mod3, "island")
island lsmean SE df lower.CL upper.CL
Biscoe 4181 56.0 328 4071 4291
Dream 4173 53.0 328 4069 4277
Torgersen 4180 77.9 328 4027 4333
Results are averaged over the levels of: species
Confidence level used: 0.95
anova(mod3)
Analysis of Variance Table
Response: body_mass_g
Df Sum Sq Mean Sq F value Pr(>F)
species 2 145190219 72595110 339.8328 <2e-16 ***
island 2 2064 1032 0.0048 0.9952
Residuals 328 70067383 213620
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What can you say about the ANOVA results?
Solution
The result of the ANOVA confirms our intuition from the boxplot that when accounting for the species effect, the island of origin has no effect anymore on the penguins body mass.
Since, the normality (with mean=0) and homoskedasticity assumptions on the residuals are verified (qualitatively from the residual plot and qq-plot, and quantitatively from the normality and homoskadasticity test with a risk \(\alpha=5\%\)), we can also interpret the Fisher significance test for the factors. We conclude that there is a significant effect of the species factor but not for the island factor.
Another illustration of the absence of island effect is that the estimations of the 3 modality effects for the island factor are very close.
Important: The species is called a confounding factor for the model considering the body mass depending on the island, because, when not accounted for in the model, the species factor creates an island effect when there is none.
In other words, a confounding variable is a (potentially unknown) variable that influences other variables resulting in a spurious association.
In an experiment, it is crucial to control and account for all potential sources of variability. Otherwise there is a huge risk to draw false conclusion (as we did in the model body_mass_g ~ island
).
How to avoid drawing false conclusions because of confounding factors?
Solution
Use randomization to build your experiment, like randomized controlled trial, possibly with double-blinding, to remove the potential effect of confounding variables (should be used for any serious drug trial), or at least control the potential sampling bias caused by confounding factors (like case-control studies, cohort studies, stratification).
If not possible (it is not always possible depending on the design of the experiment and/or the object of the study), measure and log various metadata regarding your subjects/individuals so that you will be able to account for the potential effect of confounding variables in your analysis.
It generally requires a certain level of technical expertise/knowledge in the considered field of study to be able to identify potential confounding factors before the experiments (so that you can monitor and log the corresponding quantities during your experiment).
More details here.
Multiple factor ANOVA and factor interaction
Two-way ANOVA provides a powerful framework to quantify the effect of 2 factors on the variability of an output variable in the presence of interactions. The framework allows the decomposition of the observed variance with respect to:
- the effect of the first and second factors
- the effect of the interaction of the two factors
- the residual part, the variance that is not explained by any factor in the model
Why would that be important to consider the effect of the interaction of the 2 factors on the response, and not just the effect of each factor?
Solution
There could be some experiments/phenomenon where the effect of one factor on the response depends on the effect of another factor, c.f. next example.
For instance, in our penguins data, we have seen that the species has an effect on the body mass:
ggplot(penguins, aes(x = species, y = body_mass_g)) +
geom_boxplot(aes(fill = species)) +
theme_bw()
We could also question the effect of the penguin sex on their body mass:
ggplot(penguins, aes(x = sex, y = body_mass_g)) +
geom_boxplot(aes(fill = sex)) +
theme_bw()
Here is the ratio of the average body mass of female and male in the different species:
# A tibble: 3 × 2
species `female/male average body mass ratio`
<fct> <dbl>
1 Adelie 0.833
2 Chinstrap 0.895
3 Gentoo 0.853
What can you say about this?
Solution
The relative difference between the body mass of males and females seems to be different for the different species of penguins.
Therefore, it suggests that not only the species and the sex both have an effect on the body mass of the penguins but also their interaction.
We will now see how to account for such an interaction in the ANOVA model with multiple factors.
(Un)orthogonal design?
Sums of Squares constitute the central ingredient to quantify the effect of each source of variability. The core property of linear models is that if the design is orthogonal, the decomposition of the variability with respect to the different sources of variability is unique. Otherwise, there is some ambiguity because the design does not allow the complete identification of the different sources of variations independently. Hence the central information that should be checked is whether the experimental design is orthogonal or not. One quick way is to check that the design is balanced (i.e. if the headcounts are the same in all crossed modalities (between the two factors).
We now study the joint influence of the sex and the species of the different penguins on the body mass of the animals.
Here is the headcounts regarding sex and species in the data:
table( penguins$sex,penguins$species)
Adelie Chinstrap Gentoo
female 73 34 58
male 73 34 61
Is the design balanced?
Solution
No, the design is not balanced, because the headcounts in each crossed categories are not the same.
The model
We consider the following model:
<- lm(body_mass_g ~ sex * species, data = penguins) mod4
Note: the notation sex * species
means that we consider both the effect of each factor and of their interaction. It is equivalent to the following model (we use :
to explicitly indicate an interaction):
<- lm(body_mass_g ~ sex + species + sex:species, data = penguins) mod4
Write the ANOVA model in case of two factors with interaction?
Solution
We denote by \(Y_{ijk}\) the observed body mass of the penguin \(k\) in the class \(i\) for the first factor and in the class \(j\) for the second factor.
Here, the different groups \(i=1,...,I\) for the first factor are the different species (\(I = 3\)), the different groups \(j=1,...,J\) for the second factor are the different sex (\(J = 2\)), and there are \(k=1,...n_{ij}\) penguins that have the species \(i\) and the sex \(j\).
The natural model for the multiple factor ANOVA is \[Y_{ijk} = \mu_{ij} + \epsilon_{ijk}\] where
- \(\mu_{ij}\) is the body mass average for species \(i\) and sex \(j\)
- \(\epsilon_{ijk}\) is the error term
However, in practice, we generally use the following model \[Y_{ijk} = \mu + \alpha_i + \beta_j + \gamma_{ij} + \epsilon_{ijk}\] where
- \(\mu\) is the global mean
- \(\alpha_i\) the principal effect for the species \(i\)
- \(\beta_j\) the principal effect for sex \(j\)
- \(\gamma_{ij}\) is the joint effect of species \(i\) and sex \(j\), quantifying the interaction between the two factors
- \(\epsilon_{ijk}\) is the error term
And we will also assume \(\epsilon_{ijk} \sim \NN(0, \sigma^2)\).
To illustrate the variations of the average body mass wrt1 sex and species :
emmip(mod4, ~ sex | species) + theme_bw()
emmip(mod4, ~ species | sex) + theme_bw()
What can you say here?
Solution
The average body mass on the animals is greater for males compared with females (whatever the species). However, the comparison of masses of the different species seems to depend on the sex of the animals. Even if Gentoo individuals are the heaviest for both sexes, in Males, Adelie are heavier than ChinStrap, whereas in Female, Chinstrap individuals are heavier than Adelie. Then the preliminary analyses illustrate the potential presence of interactions.
Accounting for interactions is mandatory when comparing the levels of each factors.
Comparing different species without considering differences due to the sex factor implies that in the comparison, we do not account for the fact that differences between species may depend on differences between males and females.
lsmeans(mod4, ~species)
species lsmean SE df lower.CL upper.CL
Adelie 3706 25.6 327 3656 3757
Chinstrap 3733 37.5 327 3659 3807
Gentoo 5082 28.4 327 5026 5138
Results are averaged over the levels of: sex
Confidence level used: 0.95
pairs(lsmeans(mod4, ~species))
contrast estimate SE df t.ratio p.value
Adelie - Chinstrap -26.9 45.4 327 -0.593 0.8241
Adelie - Gentoo -1376.1 38.2 327 -36.007 <.0001
Chinstrap - Gentoo -1349.2 47.0 327 -28.682 <.0001
Results are averaged over the levels of: sex
P value adjustment: tukey method for comparing a family of 3 estimates
Comparing both sexes without considering differences due to the species factor implies that in the comparison, we do not account for the fact that differences between sexes may depend on differences between species.
Note that the lsmeans
package proposes a warning when interaction is present.
lsmeans(mod4, ~sex)
sex lsmean SE df lower.CL upper.CL
female 3859 25.3 327 3809 3908
male 4489 25.2 327 4440 4539
Results are averaged over the levels of: species
Confidence level used: 0.95
pairs(lsmeans(mod4, ~sex))
contrast estimate SE df t.ratio p.value
female - male -631 35.7 327 -17.659 <.0001
Results are averaged over the levels of: species
The case of the unbalanced design
<- lm(body_mass_g ~ sex * species, data = penguins)
mod4 anova(mod4)
Analysis of Variance Table
Response: body_mass_g
Df Sum Sq Mean Sq F value Pr(>F)
sex 1 38878897 38878897 406.145 < 2.2e-16 ***
species 2 143401584 71700792 749.016 < 2.2e-16 ***
sex:species 2 1676557 838278 8.757 0.0001973 ***
Residuals 327 31302628 95727
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
<- lm(body_mass_g ~ species * sex, data = penguins)
mod5 anova(mod5)
Analysis of Variance Table
Response: body_mass_g
Df Sum Sq Mean Sq F value Pr(>F)
species 2 145190219 72595110 758.358 < 2.2e-16 ***
sex 1 37090262 37090262 387.460 < 2.2e-16 ***
species:sex 2 1676557 838278 8.757 0.0001973 ***
Residuals 327 31302628 95727
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What can you say about the variance decomposition (sums of squares) in the two previous models?
Solution
Depending on the order of the factors in the formula, the corresponding sums of squares are different.
When the design is not orthogonal, there exists 3 main strategies to compare the different sources of variability (Type I, II, III). The most important message here is that when the interaction is strong and/or significant, main effects can not / should not be interpreted separately. So the first thing to check in any ANOVA analysis is to check if the interaction is significant
Type-I sum of squares
This is the default test that is implemented in the anova()
framework. The Type-I strategy consists in comparing models that grow in a nested way. Results of Type-I analysis depend on the order of the factors in the model, hence is not adapted to unbalanced multiple ANOVAs.
If the model is specified in the following way body_mass_g ~ species * sex
, the tested models are (in this order):
- “species effect”:
body_mass_g ~ 1
vsbody_mass_g ~ 1 + species
, i.e. is there a species effect whatever the other factors? - “sex effect”:
body_mass_g ~ 1 + species
vsbody_mass_g ~ 1 + species + sex
, i.e. does the sex effect carry much information after the species effects has been specified? is there a sex effect in the residuals when we already accounted for the species effect? - “interaction”:
body_mass_g ~ 1 + species + sex
vsbody_mass_g ~ 1 + species + sex + species:sex
, i.e. does the interaction carry much information after the species and the sex effects have been specified? is there an interaction in the residuals when we already accounted for the main effects?
anova(lm(body_mass_g ~ species * sex, data = penguins))
Analysis of Variance Table
Response: body_mass_g
Df Sum Sq Mean Sq F value Pr(>F)
species 2 145190219 72595110 758.358 < 2.2e-16 ***
sex 1 37090262 37090262 387.460 < 2.2e-16 ***
species:sex 2 1676557 838278 8.757 0.0001973 ***
Residuals 327 31302628 95727
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
By deduction, if the model is specified in the other way: body_mass_g ~ sex * species
, what are the tested models in the following output?
anova(lm(body_mass_g~sex*species, data= penguins))
Analysis of Variance Table
Response: body_mass_g
Df Sum Sq Mean Sq F value Pr(>F)
sex 1 38878897 38878897 406.145 < 2.2e-16 ***
species 2 143401584 71700792 749.016 < 2.2e-16 ***
sex:species 2 1676557 838278 8.757 0.0001973 ***
Residuals 327 31302628 95727
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Solution
- “sex effect”:
body_mass_g ~ 1
vsbody_mass_g ~ 1 + species
, i.e. is there a species effect whatever the other factors? - “species effect”:
body_mass_g ~ 1 + species
vsbody_mass_g ~ 1 + species + sex
, i.e. does the sex effect carry much information after the species effects has been specified? is there a sex effect in the residuals when we already accounted for the species effect? - “interaction”:
body_mass_g ~ 1 + species + sex
vsbody_mass_g ~ 1 + species + sex + species:sex
, i.e. does the interaction carry much information after the species and the sex effects have been specified? is there an interaction in the residuals when we already accounted for the main effects?
What should we have verified before interpreting the Fisher test results?
Solution
We should have verified the Gaussian (with mean=0) and homoskedasticity assumptions on the residuals.
<- lm(body_mass_g ~ species * sex, data = penguins)
mod4
ggplot(
%>% mutate(resid = residuals(mod4)),
penguins aes(x = interaction(species, sex), y = resid)) +
geom_point() +
geom_hline(yintercept = 0, color = "red") +
theme_bw()
check_normality(mod4)
OK: residuals appear as normally distributed (p = 0.945).
check_heteroskedasticity(mod4)
OK: Error variance appears to be homoscedastic (p = 0.629).
Factor order matters in the formula defining the model when using Type-I sum of squares.
Type-II sum of squares
The Type-II strategy consists in considering that unfortunately, there is always some kind of aliasing. This strategy is valid if the interaction is not “too strong”. Type-II Sums of Squares can be computed by using the aov()
function and then the Anova()
(with an uppercase A
) function from the car
package:
<- aov(body_mass_g~sex*species, data=penguins)
mod5 mod5
Call:
aov(formula = body_mass_g ~ sex * species, data = penguins)
Terms:
sex species sex:species Residuals
Sum of Squares 38878897 143401584 1676557 31302628
Deg. of Freedom 1 2 2 327
Residual standard error: 309.3973
Estimated effects may be unbalanced
library(car)
Anova(mod5)
Anova Table (Type II tests)
Response: body_mass_g
Sum Sq Df F value Pr(>F)
sex 37090262 1 387.460 < 2.2e-16 ***
species 143401584 2 749.016 < 2.2e-16 ***
sex:species 1676557 2 8.757 0.0001973 ***
Residuals 31302628 327
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this case, the testing strategy is the following
- “sex effect”:
body_mass_g ~ 1 + species
vsbody_mass_g ~ 1 + sex
, i.e. is there a sex effect or rather a species effect? - “species effect”:
body_mass_g ~ 1 + sex
vsbody_mass_g ~ 1 + species
, i.e. is there a species effect or rather a sex effect? - “interaction”:
body_mass_g ~ 1 + species + sex
vsbody_mass_g ~ 1 + species + sex + species:sex
, i.e. does the interaction carry much information after the species and the sex effects have been specified? is there an interaction in the residuals when we already accounted for the main effects ?
The idea of the Type-II strategy is that if the factors are very confounded, then the species factor carries much information about the sex factor. Hence, when confronting body_mass_g ~ 1 + species
vs body_mass_g ~ 1 + sex
, the information will be equivalent, and the null hypothesis will be accepted because of aliasing.
When the interaction is strong, the only proper way to compare the levels of the main effects is to do the comparison conditionally to the other factor, such as:
pairs(lsmeans(mod5, ~sex | species))
species = Adelie:
contrast estimate SE df t.ratio p.value
female - male -675 51.2 327 -13.174 <.0001
species = Chinstrap:
contrast estimate SE df t.ratio p.value
female - male -412 75.0 327 -5.487 <.0001
species = Gentoo:
contrast estimate SE df t.ratio p.value
female - male -805 56.7 327 -14.188 <.0001
plot(pairs(lsmeans(mod5, ~ sex | species)))
pairs(lsmeans(mod5, ~species | sex))
sex = female:
contrast estimate SE df t.ratio p.value
Adelie - Chinstrap -158 64.2 327 -2.465 0.0377
Adelie - Gentoo -1311 54.4 327 -24.088 <.0001
Chinstrap - Gentoo -1153 66.8 327 -17.246 <.0001
sex = male:
contrast estimate SE df t.ratio p.value
Adelie - Chinstrap 105 64.2 327 1.627 0.2357
Adelie - Gentoo -1441 53.7 327 -26.855 <.0001
Chinstrap - Gentoo -1546 66.2 327 -23.345 <.0001
P value adjustment: tukey method for comparing a family of 3 estimates
plot(pairs(lsmeans(mod5, ~species | sex)))
Type-III sum of squares
We will not see that one.
Coming back to the species/island model
Reminder: we considered the following model body_mass_g ~ species + island
.
Is the design balanced?
Solution
No, it is not:
table(penguins$species, penguins$island)
Biscoe Dream Torgersen
Adelie 44 55 47
Chinstrap 0 68 0
Gentoo 119 0 0
Here are the two models (with interactions) body_mass_g ~ species * island
and body_mass_g ~ island * species
<- lm(body_mass_g ~ species * island, data = penguins)
mod3a anova(mod3a)
Analysis of Variance Table
Response: body_mass_g
Df Sum Sq Mean Sq F value Pr(>F)
species 2 145190219 72595110 339.8328 <2e-16 ***
island 2 2064 1032 0.0048 0.9952
Residuals 328 70067383 213620
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
<- lm(body_mass_g ~ island * species, data = penguins)
mod3b anova(mod3b)
Analysis of Variance Table
Response: body_mass_g
Df Sum Sq Mean Sq F value Pr(>F)
island 2 83740680 41870340 196.00 < 2.2e-16 ***
species 2 61451603 30725801 143.83 < 2.2e-16 ***
Residuals 328 70067383 213620
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What can you say about these two models? Why is there no interaction term?
Solution
First, there is no interaction term because some combination species/island are not represented in the data, so it was not considered.
Second, the model body_mass_g ~ species * island
(or equivalenty2 and body_mass_g ~ species + island
) considers the following:
- “species effect”:
body_mass_g ~ 1
vsbody_mass_g ~ 1 + species
, i.e. is there a species effect whatever the other factors? - “island effect”:
body_mass_g ~ 1 + species
vsbody_mass_g ~ 1 + species + island
, i.e. does the island effect carry much information after the species effects has been specified?
whereas the model body_mass_g ~ island * species
(or equivalenty3 and body_mass_g ~ island + species
) considers the following:
- “island effect”:
body_mass_g ~ 1
vsbody_mass_g ~ 1 + island
, i.e. is there a island effect whatever the other factors? - “species effect”:
body_mass_g ~ 1 + island
vsbody_mass_g ~ 1 + island + species
, i.e. does the species effect carry much information after the island effects has been specified?
Therefore, we have to be very cautious with the order of the factors, because we could find an island effect and a species effect, instead of just a species effect (as suggested by the boxplot).
Balanced design
We will consider another dataset giving the survival time of some animals when being contaminated by different poisons and given different treatments (used to counteract the poison) (more info here).
You can load the dataset with the following command:
<- read_csv("https://gitbio.ens-lyon.fr/LBMC/hub/formations/ens_l3_stats/-/raw/main/tutorial_4_anova/poisons.csv?inline=false") poisons
Note: the
read_csv()
function is loaded intidyverse
but comes from thereadr
package.
poisons
# A tibble: 48 × 3
time poison treat
<dbl> <dbl> <chr>
1 0.31 1 A
2 0.45 1 A
3 0.46 1 A
4 0.43 1 A
5 0.36 2 A
6 0.29 2 A
7 0.4 2 A
8 0.23 2 A
9 0.22 3 A
10 0.21 3 A
# ℹ 38 more rows
Check if the design is balanced?
Solution
Yes, it is:
table(poisons$poison, poisons$treat)
A B C D
1 4 4 4 4
2 4 4 4 4
3 4 4 4 4
We consider the model time ~ poison * treat
and the model time ~ treat * poison
:
<- lm(time ~ factor(poison) * treat, data = poisons)
mod6 anova(mod6)
Analysis of Variance Table
Response: time
Df Sum Sq Mean Sq F value Pr(>F)
factor(poison) 2 1.03301 0.51651 23.2217 3.331e-07 ***
treat 3 0.92121 0.30707 13.8056 3.777e-06 ***
factor(poison):treat 6 0.25014 0.04169 1.8743 0.1123
Residuals 36 0.80072 0.02224
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod6)$r.squared # R2
[1] 0.733543
<- lm(time ~ treat * factor(poison), data = poisons)
mod7 anova(mod7)
Analysis of Variance Table
Response: time
Df Sum Sq Mean Sq F value Pr(>F)
treat 3 0.92121 0.30707 13.8056 3.777e-06 ***
factor(poison) 2 1.03301 0.51651 23.2217 3.331e-07 ***
treat:factor(poison) 6 0.25014 0.04169 1.8743 0.1123
Residuals 36 0.80073 0.02224
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod7)$r.squared # R2
[1] 0.733543
Note: we use factor(poison)
because the poison
variable is an integer and R understand it as a numeric variable in the lm()
function.
What can you say regarding the variance decomposition in this balanced case?
Solution
Now, in this balanced design setting, the sums of square are the same whatever the order of the factors in the formula.
When the design is balanced, it is orthogonal and it is possible to unambigously separe all variability sources. In that cases, the sums of squares are equivalent and we can independently interpret the different variability sources, whatever the order of the factors.