Tutorial 4: Analysis of Variance (ANOVA)

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

Requirements

We first load the tidyverse meta-package that we will be needing.

# required packages
library(tidyverse)

Note: we will need the ggplot2, dplyr, readr, tidyr and tibble packages from the tidyverse meta-package. Thus, if you encounter issues when installing tidyverse, you can only install these four packages.

1-factor ANOVA

We will again consider the penguins dataset.

library(palmerpenguins)
penguins

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

We remove individuals with missing values

penguins <- penguins %>% drop_na()
penguins

# A tibble: 333 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           36.7          19.3               193        3450
 5 Adelie  Torgersen           39.3          20.6               190        3650
 6 Adelie  Torgersen           38.9          17.8               181        3625
 7 Adelie  Torgersen           39.2          19.6               195        4675
 8 Adelie  Torgersen           41.1          17.6               182        3200
 9 Adelie  Torgersen           38.6          21.2               191        3800
10 Adelie  Torgersen           34.6          21.1               198        4400
# ℹ 323 more rows
# ℹ 2 more variables: sex <fct>, year <int>

We question the potential relationship between the body mass of the penguins and their species. In other words, is the body mass of the penguins generally different in different species?

How could we answer this question? maybe a method that we have seen in the previous tutorials? What could be the limitation?

Solution

We want to compare a continuous feature in different populations, we could use a Student test to verify if the average body mass is different in two populations of penguins from different species.

However, here we have 3 species, hence 3 populations:

table(penguins$species)


   Adelie Chinstrap    Gentoo 
      146        68       119

Are we going to consider multiple comparisons between the different possible pairs of species (Adelie vs Chinstrap, Adelie vs Gentoo, Chinstrap vs Gentoo) ?

Solution

No, we are going to use a model that will consider all species together.

What could be this model that would allow us to compare the body mass of the penguins in the 3 different species at the same time?

Solution

We are going to use a linear model and an analysis of variance (a.k.a. ANOVA).

Data exploration

Before building a complex statistical model, we could first verify if the question of the relationship between the penguin species and their body mass is relevant regarding our sampled data with a qualitative and visual study.

How could we get an intuition about the possible difference of body mass between the 3 species of penguins in a very simple way? What can we say about it?

Solution

We can compute the body mass average in the sampled data for each species:

penguins %>% 
    group_by(species) %>%
    summarise(body_mass_av = mean(body_mass_g))

# A tibble: 3 × 2
  species   body_mass_av
  <fct>            <dbl>
1 Adelie           3706.
2 Chinstrap        3733.
3 Gentoo           5092.

The penguins body mass seems to be quite different in average across the different species in the sample.

Which graphical representations could we use to get more insights about the previous results?

Solution

We could draw a boxplot of the body mass depending on the species:

ggplot(penguins, aes(x = species, y = body_mass_g)) +
    geom_boxplot(aes(fill = species)) + 
    theme_bw()

We could also check the empirical distribution of the body mass in the sample depending on the species:

ggplot(penguins, aes(x = body_mass_g)) +
    geom_histogram(aes(fill = species), alpha = 0.5, position = "identity") + 
    theme_bw()

In both representations, the empirical distribution of the body mass in the sample seems to be clearly different across the 3 species, especially for the Gentoo penguins versus the Adelie and Chinstrap penguins that seems to have similar body mass.

The ANOVA model

We are going to focus on the linear model and the analysis of variance, a.k.a the ANOVA, to stastically quantify the apparent difference of body mass between the 3 penguin species.

What is the main difference with all the previous models and methods that we have considered up till now?

Solution

With the linear model and in the particulat case of the ANOVA, we are now considering a statistical method that is directly designed to model the possible relationship between two variables. In particular we are going to question if the values observed for one or more factors, a.k.a. explanatory variables, may be used to explain the values observed for another, a.k.a. the response variable or the variable to be explained.

What type of variables are suitable for the ANOVA model?

Solution

The factor variables will be one or many categorical variables. The response variable will be a continuous variable.

What is the natural model that we are going to consider in the ANOVA?

Solution

We denote by \(Y_{ij}\) the observed body mass of the penguin \(j\) in the group or class \(i\). Here, the different groups \(i=1,...,I\) are the different species (\(I = 3\)), and there are \(j=1,...n_i\) penguins in each group.

Here are the group headcounts in the sample:

penguins %>% 
    group_by(species) %>%
    summarise(n_i = n())

# A tibble: 3 × 2
  species     n_i
  <fct>     <int>
1 Adelie      146
2 Chinstrap    68
3 Gentoo      119

We note \(\mu_i\) the average body mass in species \(i\), then the model will be: \[Y_{ij} = \mu_i + \epsilon_{ij}\] where \(\epsilon_{ij}\) is the error term or residual for the body mass of the penguin \(j\) in species \(i\).

Important: what assumptions are we going to make to complete this model and go further to use it?

Solution

We are going to assume that the residuals \(\varepsilon_{ij}\) are independent and identically distributed (i.i.d.) random Gaussian variable of mean 0 and variance \(\sigma^2\), i.e. \[\epsilon_{ij} \sim \NN(0, \sigma^2)\]

Note:

The residuals are assumed to be centered Gaussian variables (i.e. Gaussian variables of mean = 0)
The variance of the residuals is assumed to be constant across all individuals in all groups: which is called homoskedasticity.

Linear model for the ANOVA in practice

In R, we will use the lm() function to build a linear model and the anova() function to get the ANOVA result.

We build the linear model of the body mass depending on the species:

mod1 <- lm(body_mass_g ~ species, data = penguins)
mod1


Call:
lm(formula = body_mass_g ~ species, data = penguins)

Coefficients:
     (Intercept)  speciesChinstrap     speciesGentoo  
         3706.16             26.92           1386.27

The ANOVA model is encoded in R: <response_variable> ~ <factor>.

What can you say about this model? Does it correspond to the ANOVA model that we wrote earlier?

Solution

It seems that we do not have one coefficients per species, but only an “intercept” and one coefficient for the Chinstrap and Gentoo species.

Note: you can use the following command to build the model \(Y_{ij} = \mu_i + \epsilon_{ij}\) (for the body mass of penguin \(j\) in species \(i\)) that we defined earlier:

mod0 <- lm(body_mass_g ~ 0 + species, data = penguins)
mod0


Call:
lm(formula = body_mass_g ~ 0 + species, data = penguins)

Coefficients:
   speciesAdelie  speciesChinstrap     speciesGentoo  
            3706              3733              5092

or equivalently:

mod0 <- lm(body_mass_g ~ -1 + species, data = penguins)
mod0


Call:
lm(formula = body_mass_g ~ -1 + species, data = penguins)

Coefficients:
   speciesAdelie  speciesChinstrap     speciesGentoo  
            3706              3733              5092

We can check that the \(\mu_i\) coefficients are indeed the species body mass empirical averages:

penguins %>% 
    group_by(species) %>%
    summarise(body_mass_av = mean(body_mass_g))

# A tibble: 3 × 2
  species   body_mass_av
  <fct>            <dbl>
1 Adelie           3706.
2 Chinstrap        3733.
3 Gentoo           5092.

What ANOVA model is used in R?

Solution

R is using the following model: \[Y_{ij} = \mu + \alpha_i + \epsilon_{ij}\] where \(\mu\) is the global average (a.k.a. the “intercept”) and \(\alpha_i\) is the effect of species \(i\).

Intuitively, \(\alpha_i\) is defined such that \(\alpha_i = \mu_i - \mu\) where \(\mu_i\) is the average body mass in species \(i\) that we defined earlier in the natural model for the ANOVA.

Why is there no \(\alpha_i\) coefficient for the species Adelie in the R output?

mod1


Call:
lm(formula = body_mass_g ~ species, data = penguins)

Coefficients:
     (Intercept)  speciesChinstrap     speciesGentoo  
         3706.16             26.92           1386.27

Solution

For the model to be identifiable (i.e. for a unique solution to exist), we need to put constraints on the \(\alpha_i\) coefficients, either \(\sum_{i=1}^I \alpha_i = 0\), or set \(\alpha_1 = 0\) (or \(\alpha_I = 0\)) for a specific group/class (which becomes the reference group/class).

The lm() function is using the second option: \(\alpha_1 = 0\) for the Adelie species.

Note: we can use the lsmeans() function from the emmeans package to compute the \(\mu_i\) in the \(Y_{ij} = \mu_i + \epsilon_{ij}\) from the ANOVA model \(Y_{ij} = \mu + \alpha_i + \epsilon_{ij}\).

library(emmeans)
lsmeans(mod1, "species")

 species   lsmean   SE  df lower.CL upper.CL
 Adelie      3706 38.1 330     3631     3781
 Chinstrap   3733 55.9 330     3623     3843
 Gentoo      5092 42.2 330     5009     5176

Confidence level used: 0.95

It also gives the estimated standard deviation for each factor, and a 95% Gaussian-based confidence interval for the effect of the modality.

Residuals

To go further with the ANOVA model and statistically quantify the effect of the species on the body mass, we need to verify that the ANOVA model is relevant regarding our data.

What assumptions did we made about the residuals?

Solution

We assume the residuals to be centered Gaussian variables (i.e. with mean = 0) and constant variance (homoskedasticity)?

Here is a command to extract the residuals of the model:

resid1 <- residuals(mod1)

And we draw the graph of the observed residuals (or empirical residuals) depending on the group (here the species):

ggplot(
    penguins %>% mutate(resid = residuals(mod1)),
    aes(x = species, y = resid)) +
    geom_point() + 
    geom_hline(yintercept = 0, color = "red") +
    theme_bw()

What can you say about this representation?

Solution

The observed residuals seems to be centered around 0, and their variance is approximately constant over the different groups/classes, which are good clues that some of the assumptions that are required regarding the residuals are verified here.

How can we check the previous assumptions about the residuals more thoroughly?

Solution

We have to check that the empirical residuals are approximately Gaussian, with mean = 0 and constant variance.

The residuals plot was a first step. We can check the values of the empirical mean and empirical variance of the observed residuals globally and within each species.

mean(resid1)

[1] 7.064346e-15

var(resid1)

[1] 211052.6

penguins %>% mutate(resid = residuals(mod1)) %>%
    group_by(species) %>%
    summarise(resid_mean = mean(resid), var_resid = var(resid))

# A tibble: 3 × 3
  species   resid_mean var_resid
  <fct>          <dbl>     <dbl>
1 Adelie     -6.61e-15   210332.
2 Chinstrap   1.27e-14   147713.
3 Gentoo      2.06e-14   251478.

The residual mean is approximately 0. The variance seems to vary across the different group, it is approximately \(200000\pm 50000\), however we will see below that these difference are statistically significant (we will see below that we cannot reject the homoskedasticity hypothesis with a 5% risk).

Then, we can verify the Gaussian assumption with a qq-plot. Here we compared the empirical quantiles of the observed standardized residuals (obtained with the function rstandard()) to the theoretical quantiles of a \(\NN(0,1)\) standard Gaussian distribution:

qqnorm(rstandard(mod1))
abline(a=0, b=1, col = "red") # y=x line

We can say here that the residuals are approximately Gaussian.

What test could we use to verify the assumptions on the residuals?

Solution

To verify the Gaussian assumptions on the residuals, we can use a Kolmogorov-Smirnov test (comparing the observed residuals to Gaussian random values) or the Shapiro-Wilk test of normality

ks.test(rstandard(mod1), pnorm)


    Asymptotic one-sample Kolmogorov-Smirnov test

data:  rstandard(mod1)
D = 0.041618, p-value = 0.6113
alternative hypothesis: two-sided

We cannot reject the hypothesis that the standardized residuals and some Gaussian data (drawn from the \(\NN(0,1)\) distributions) are drawn from the same distribution at a risk \(\alpha=5\%\) or lower.

shapiro.test(residuals(mod1))


    Shapiro-Wilk normality test

data:  residuals(mod1)
W = 0.9922, p-value = 0.07835

We cannot the hypothesis that the residuals are Gaussian at a risk \(\alpha=5\%\) or lower.

The performance package also provides functions to statistically test the normality and homoskedasticity of the residuals of a linear model in R:

the check_normality() uses a Shapiro-Wilk test of normality (\(H_0\): “the data are Gaussian”)
the check_heteroskedasticity() uses a Breusch-Pagan test (\(H_0\): “the variance is constant”)

library(performance)
# check normality
check_normality(mod1)

OK: residuals appear as normally distributed (p = 0.079).

# check homoskedasticity
check_heteroskedasticity(mod1)

OK: Error variance appears to be homoscedastic (p = 0.079).

So here, we conclude that the residuals are approximately Gaussian, centered and of constant variance.

As a consequence, we will be able to test the significance of the model and of the coefficients.

Without the validations of the Gaussian (with mean=0) and homoskedasticity assumptions, we cannot use what follows in the next sections (despite R being able to compute these results).

Interpreting the ANOVA results

We have the following model, and we validated the Gaussian (with mean=0) and homoskedasticity assumptions about the residuals.

mod1 <- lm(body_mass_g ~ species, data = penguins)
mod1


Call:
lm(formula = body_mass_g ~ species, data = penguins)

Coefficients:
     (Intercept)  speciesChinstrap     speciesGentoo  
         3706.16             26.92           1386.27

Geometric goodness-of-fit

We can get information about the ANOVA results with the anova() function:

anova(mod1)

Analysis of Variance Table

Response: body_mass_g
           Df    Sum Sq  Mean Sq F value    Pr(>F)    
species     2 145190219 72595110  341.89 < 2.2e-16 ***
Residuals 330  70069447   212332                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Use the various sums of squares (Sum Sq) to verify if the factor species has an effect on the body mass?

Solution

We can compute the ratio \(\frac{MSS}{RSS}\) where \(MSS\) quantifies the variance explained by the model (Model Sum of Squares) and \(RSS\) quantifies the residual variance (Residual Sum of Squares).

Here, we have (from the anova() function output) \(\frac{MSS}{RSS} = \frac{145190219}{70069447} =\) 2.0720903 then \(MSS > RSS\) and we can say that the species factor seems to have an effect on the body mass.

Then compute the \(R^2\) (R square) coefficient of the model?

Solution

The \(R^2\) coefficient is \(R^2 = \frac{MSS}{TSS}\) where \(TSS\) quantifies the total variance (Total Sum of Squares). From the Pythagorean theorem, we have \(TSS = MSS + RSS\), then we also have \(R^2 = 1 - \frac{RSS}{TSS}\)

The \(R^2\) coefficient is also the squared correlation coefficient between the observed values for the ouput variable \(y_{ij}\) and the corresponding values predicted by the model \(\hat{y}_{ij} = \hat{y}_{i\cdot} = \hat{\mu} + \hat{\alpha}_i = \hat{\mu}_i\) (the group wise empirical mean), where \(\hat{\mu}\) and \(\hat{\alpha}_i\) are the estimations of the corresponding coefficients.

Here, we have (from the anova() function output) \(R^2 = \frac{MSS}{MSS + RSS} = \frac{145190219}{145190219+70069447} =\) 0.6744887.

What can you say about the model from the \(R^2\) value here?

Solution

The \(R^2\) coefficient is closer to 1 than 0, which again suggests an effect of the species on the body mass.

The variance decomposition (or sum of squares decomposition) and \(R^2\) coefficient are geometric quantifications of the model goodness-of-fit (that are not based on the Gaussian nor the homoskedasticity assumptions).

Factor effect

Now, we are going to question and quantify the statistical significance of the model.

Other information are given by the anova() function, especially, for the considered factor, an \(F\)-value (Fisher value, F value) and the corresponding \(p\)-value (Pr(>F)) with a code indicating the level of significance.

anova(mod1)

Analysis of Variance Table

Response: body_mass_g
           Df    Sum Sq  Mean Sq F value    Pr(>F)    
species     2 145190219 72595110  341.89 < 2.2e-16 ***
Residuals 330  70069447   212332                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What can you say about this information?

Solution

The \(F\)-value is the observed value for the Fisher statistic related to the following test:

\(H_0\): \(\alpha_1 = ... = \alpha_i = ... = \alpha_I = 0\)
\(H_1\): there exist at least one \(i\) such that \(\alpha_i \ne 0\)

Here, we question the global effect of the factor (here the species) on the response variable (here the body mass).

The output also gives us the degrees of freedom (Df) for the factor, i.e. the number of groups minus 1 (\(I - 1\)), and the degrees of freedom for the residuals, i.e. the number of observations minus the number of coefficients in the model (here \(n - I\)).

The corresponding Fisher statistics follows a \(\FF(I-1, n-I)\) Fisher distribution under the null hypothesis (and under the Gaussian with mean=0 and homoskedasticity assumptions about the residuals).

This test does not give any insight about which modality of the considered factor has an actual effect on the response variable.

What is the conclusion of the Fisher test here regarding the effect of the species on the body mass?

Solution

Here, the \(p\)-value is very small and we can reject the null hypothesis with a risk \(\alpha = 5\%\). There is a significant effect of the species on the body mass.

Important: we can call the anova() function for any linear model with factors, but we need to first check that the Gaussian (with mean=0) and homoskedasticity assumptions about the residuals are verified to be able to interpret the results about the Fisher test. In case of non Gaussian or heteroskedastic residuals, it makes no sense to interpret this output.

How is computed the column Mean Sq in the anova() function output?

Solution

It is the sum of squares (Sum Sq) divided by the number of degrees of freedom.

Factor modality effect

Now, with the following function, we can get additional information about our model:

summary(mod1)


Call:
lm(formula = body_mass_g ~ species, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-1142.44  -342.44   -33.09   307.56  1207.56 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       3706.16      38.14  97.184   <2e-16 ***
speciesChinstrap    26.92      67.65   0.398    0.691    
speciesGentoo     1386.27      56.91  24.359   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 460.8 on 330 degrees of freedom
Multiple R-squared:  0.6745,    Adjusted R-squared:  0.6725 
F-statistic: 341.9 on 2 and 330 DF,  p-value: < 2.2e-16

What can you say about this output?

Solution

We have a lot of information here:

a reminder of the model
information about the distribution of the residuals
a table with the coefficient estimations and related information (c.f. below)
the standard deviation of the residuals and the corresponding degrees of freedom
Multiple R-squared gives the \(R^2\) coefficient (check that the value is the same that we computed). The Adjusted R-squared is a modified version of the \(R^2\) that has been adjusted for the number of coefficients in the model (we will see this one later).
a Fisher statistic value (F-statistic) with the corresponding degrees of freedom and \(p\)-value (we will explain this one later).

Reminder: we are considering the model \(Y_{ij} = \mu + \alpha_i + \epsilon_{ij}\)

The table with the coefficient estimations gives:

the estimation \(\hat{\mu}\) of the \(\mu\) coefficient (Intercept)
the estimation \(\hat{\alpha}_i\) of the \(\alpha_i\) coefficient for each species \(i\)
the corresponding estimated standard deviation of the coefficient Std. Error
the corresponding value of a Student statistic (t value) and a corresponding \(p\)-value (Pr(>|t|))

Why is there no coefficient for the Adelie species?

Solution

Reminder: we need a constraint on the \(\alpha_i\) for the model to be identifiable and R using the constraint \(\alpha_1 = 0\) which corresponds to the Adelie species coefficient. The Adelie species is the reference group/class.

What is this Student \(T\) statistics? To which test is it related?

Solution

For each coefficient, including the intercept \(\mu\) and the \(\alpha_i\) for each species (except the Adelie species for which the coefficient \(\alpha_1\) is forced to 0), we can run the following test:

\(H_0\): the coefficient is null
\(H_1\): the coefficient is not null

e.g. for a given species \(i\), we can test \(H_0\): \(\alpha_i = 0\) versus \(H_1\): \(\alpha_i \ne 0\).

Under the null hypothesis (and assuming that the residuals are Gaussian and homoskedastic), the test statistic follows a \(\TT(n-I)\) Student distribution.

Here, the corresponding \(p\)-value is also computed and the significance level is encoded.

Important: again, if the Gaussian (with mean=0) and homoskedasticity assumptions about the residuals are not verified, we cannot interpret the results about the Student tests.

What do you conclude here?

Solution

We conclude that the Chinstrap and Gentoo species both have a significant effect on the penguin body mass but there is a limitation here: in practice we should use other tools to question the effect of the modality of the factors (c.f. the following questions).

What about the particular effect of the Adelie species?

Solution

The output of the summary() function for the model \(Y_{ij} = \mu + \alpha_i + \epsilon_{ij}\) is not very useful to question which modality of the factor has an effect on the response because the coefficient \(\alpha_1\) is set to 0 by constraints for identifiability reason.

Therefore, this output cannot be used to assess the effect of all modality of the factor, here all species, individually.

A force workaround is to use the following model to estimate the effect of all classes of the factor (without considering an intercept):

mod0 <- lm(body_mass_g ~ 0 + species, data = penguins)
check_normality(mod0)

OK: residuals appear as normally distributed (p = 0.079).

check_heteroscedasticity(mod0)

OK: Error variance appears to be homoscedastic (p = 0.079).

anova(mod0)

Analysis of Variance Table

Response: body_mass_g
           Df     Sum Sq    Mean Sq F value    Pr(>F)    
species     3 6039066803 2013022268  9480.6 < 2.2e-16 ***
Residuals 330   70069447     212332                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(mod0)


Call:
lm(formula = body_mass_g ~ 0 + species, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-1142.44  -342.44   -33.09   307.56  1207.56 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
speciesAdelie     3706.16      38.14   97.18   <2e-16 ***
speciesChinstrap  3733.09      55.88   66.81   <2e-16 ***
speciesGentoo     5092.44      42.24  120.56   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 460.8 on 330 degrees of freedom
Multiple R-squared:  0.9885,    Adjusted R-squared:  0.9884 
F-statistic:  9481 on 3 and 330 DF,  p-value: < 2.2e-16

Here, we can conclude that all species have a significant effect on the body mass.

Note: in that case, there is no intercept and there are \(I = 3\) degrees of freedom for the species factor.

Or we can use a contrast test (c.f. next section).

Contrast test

With the following commands, we can get the estimations (with a 95% interval) of each species effect \(\mu_i\) directly from the model \(Y_{ij} = \mu + \alpha_i + \epsilon_{ij}\) (despite the constraint \(\alpha_1=0\)).

lsmeans(mod1, "species")

 species   lsmean   SE  df lower.CL upper.CL
 Adelie      3706 38.1 330     3631     3781
 Chinstrap   3733 55.9 330     3623     3843
 Gentoo      5092 42.2 330     5009     5176

Confidence level used: 0.95

plot(lsmeans(mod1, "species")) + theme_bw() + coord_flip()

And we can get the result of a contrast test with the following command.

pairs(lsmeans(mod1, "species"))

 contrast           estimate   SE  df t.ratio p.value
 Adelie - Chinstrap    -26.9 67.7 330  -0.398  0.9164
 Adelie - Gentoo     -1386.3 56.9 330 -24.359  <.0001
 Chinstrap - Gentoo  -1359.3 70.0 330 -19.406  <.0001

P value adjustment: tukey method for comparing a family of 3 estimates

What do you think that we are doing here?

Solution

We are testing the pairwise difference of the effect between the modalities of the factor.

In particular,for each possible pair of the factor modalities \((i,i')\) (here each pair of species), we test the following:

\(H_0\): \(\mu_i = \mu_{i'}\)
\(H_1\): \(\mu_i \ne \mu_{i'}\)

The estimate is giving the estimation \(\hat{\mu}_i - \hat{\mu}_{i'}\) and the corresponding standard deviation (SE). In this case, under the null hypothesis (and assuming the normality with mean=0 and homoskedasticity assumptions on the residuals), the test statistics follows as \(\TT(n-I)\) Student distribution. The observed value for this statistics is given by t.ratio and the corresponding \(p\)-value by p.value.

Here, the only significant difference of effect is between the Gentoo and any of the other two species (Adelie or Chinstrap). There is no significant difference between the Adelie and Chinstrap species.

Important: again, if the Gaussian (with mean=0) and homoskedasticity assumptions about the residuals are not verified, we cannot interpret the results about the Student tests.

Perform your own analysis

We now question the potential effect of the island of origin (variable island in the penguin table) on the body mass.

Run an analysis on your own answering this question? (build the statistical model, verify the assumptions, interpret the results).

Solution

# boxplot
ggplot(penguins, aes(x = island, y = body_mass_g)) +
    geom_boxplot(aes(fill = island)) + 
    theme_bw()

# linear model
mod2 <- lm(body_mass_g ~ island, data = penguins)

# residual plot
ggplot(
    penguins %>% mutate(resid = residuals(mod2)),
    aes(x = island, y = resid)) +
    geom_point() + 
    geom_hline(yintercept = 0, color = "red") +
    theme_bw()

# residual qq-plot
qqnorm(rstandard(mod2))
abline(a=0, b=1, col = "red") # y=x line

# check normality and homoskedasticity of residuals
check_normality(mod2)

Warning: Non-normality of residuals detected (p = 0.002).

check_heteroskedasticity(mod2)

Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).

# information about the model
anova(mod2)

Analysis of Variance Table

Response: body_mass_g
           Df    Sum Sq  Mean Sq F value    Pr(>F)    
island      2  83740680 41870340  105.06 < 2.2e-16 ***
Residuals 330 131518986   398542                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(mod2)


Call:
lm(formula = body_mass_g ~ island, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-1869.17  -368.90     5.83   431.10  1580.83 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      4719.17      49.45  95.438   <2e-16 ***
islandDream     -1000.27      75.40 -13.266   <2e-16 ***
islandTorgersen -1010.66     104.52  -9.669   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 631.3 on 330 degrees of freedom
Multiple R-squared:  0.389, Adjusted R-squared:  0.3853 
F-statistic: 105.1 on 2 and 330 DF,  p-value: < 2.2e-16

lsmeans(mod2, "island")

 island    lsmean   SE  df lower.CL upper.CL
 Biscoe      4719 49.4 330     4622     4816
 Dream       3719 56.9 330     3607     3831
 Torgersen   3709 92.1 330     3527     3890

Confidence level used: 0.95

pairs(lsmeans(mod2, "island"))

 contrast           estimate    SE  df t.ratio p.value
 Biscoe - Dream       1000.3  75.4 330  13.266  <.0001
 Biscoe - Torgersen   1010.7 104.5 330   9.669  <.0001
 Dream - Torgersen      10.4 108.3 330   0.096  0.9949

P value adjustment: tukey method for comparing a family of 3 estimates

# non-parametric test
kruskal.test(body_mass_g ~ island, data = penguins)


    Kruskal-Wallis rank sum test

data:  body_mass_g by island
Kruskal-Wallis chi-squared = 124.35, df = 2, p-value < 2.2e-16

Here, the boxplot suggests that the island factor has an effect on the body mass. We use an ANOVA model. However, the residuals do not verify the normality (c.f. qq-plot and normalality test) and homoskedasticity (c.f. residual plot and homoskedasticity test) assumptions, thus we cannot use the Fisher and Student tests on the significance of the factor effect and factor modality effect (respectively).

We can however interpret the geometric result about the variance decomposition. The residual sum of squares is higher than the model sum of squares, and the \(R^2\) coefficient is closer to 0 than 1, so the model does not seem to fit the data so well.

Nonetheless, to get a statistical result, since we cannot use the ANOVA model, we use a non-parametric Kruskal-Wallis test: null hypothesis that all samples (here from the 3 different islands) came from the same population (i.e. were sampled from the same distribution).

Here the Kruskal-Wallis concludes that the island of origin has an effect on the penguin body mass.

We will dispute this conclusion in the next section.

Getting started with 2-factor ANOVA

Reminder: do not forget to remove the missing values in the penguin table (in general, missing values can be handled, but here we just remove them for simplicity sake).

penguins <- penguins %>% drop_na()

Now, we are going to consider multiple factors in the same model in order to jointly estimate their effects on a response variable.

Here is the distribution of the penguin body mass depending on their species and on their island of origin:

ggplot(penguins, aes(x = island, y = body_mass_g)) +
    geom_boxplot(aes(fill = species)) + 
    theme_bw()

What can you say about this?

Solution

In the final analysis of the previous section, we found that the island of origin had an effect on the body mass of the penguins.

However, it appears that this difference is mainly related to the species of the penguins and not their island of origin. Indeed, except for the Gentoo penguins, the distribution of the penguin body mass seems to be quite similar between the species Adelie and Chinstrap in the 3 islands.

Perform an ANOVA analysis to model the body mass depending on both the species and the island of origin of the penguin?

Hint: when using multiple factors, the formula to build the linear model in R becomes body_mass_g ~ species + island.

Solution

# model
mod3 <- lm(body_mass_g ~ species + island, data = penguins)
# residual plot
ggplot(
    penguins %>% mutate(resid = residuals(mod3)),
    aes(x = interaction(species, island), y = resid)) +
    geom_point() + 
    geom_hline(yintercept = 0, color = "red") +
    theme_bw()

# qq-plot
qqnorm(rstandard(mod3))
abline(a=0, b=1, col = "red") # y=x line

# check assumptions on residuals
check_normality(mod3)

OK: residuals appear as normally distributed (p = 0.077).

check_heteroskedasticity(mod3)

OK: Error variance appears to be homoscedastic (p = 0.078).

# model coefficient estimation and ANOVA table
lsmeans(mod3, "species")

 species   lsmean   SE  df lower.CL upper.CL
 Adelie      3707 38.4 328     3631     3782
 Chinstrap   3738 76.9 328     3587     3889
 Gentoo      5089 69.9 328     4952     5227

Results are averaged over the levels of: island 
Confidence level used: 0.95

lsmeans(mod3, "island")

 island    lsmean   SE  df lower.CL upper.CL
 Biscoe      4181 56.0 328     4071     4291
 Dream       4173 53.0 328     4069     4277
 Torgersen   4180 77.9 328     4027     4333

Results are averaged over the levels of: species 
Confidence level used: 0.95

anova(mod3)

Analysis of Variance Table

Response: body_mass_g
           Df    Sum Sq  Mean Sq  F value Pr(>F)    
species     2 145190219 72595110 339.8328 <2e-16 ***
island      2      2064     1032   0.0048 0.9952    
Residuals 328  70067383   213620                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What can you say about the ANOVA results?

Solution

The result of the ANOVA confirms our intuition from the boxplot that when accounting for the species effect, the island of origin has no effect anymore on the penguins body mass.

Since, the normality (with mean=0) and homoskedasticity assumptions on the residuals are verified (qualitatively from the residual plot and qq-plot, and quantitatively from the normality and homoskadasticity test with a risk \(\alpha=5\%\)), we can also interpret the Fisher significance test for the factors. We conclude that there is a significant effect of the species factor but not for the island factor.

Another illustration of the absence of island effect is that the estimations of the 3 modality effects for the island factor are very close.

Important: The species is called a confounding factor for the model considering the body mass depending on the island, because, when not accounted for in the model, the species factor creates an island effect when there is none.

In other words, a confounding variable is a (potentially unknown) variable that influences other variables resulting in a spurious association.

In an experiment, it is crucial to control and account for all potential sources of variability. Otherwise there is a huge risk to draw false conclusion (as we did in the model body_mass_g ~ island).

How to avoid drawing false conclusions because of confounding factors?

Solution

Use randomization to build your experiment, like randomized controlled trial, possibly with double-blinding, to remove the potential effect of confounding variables (should be used for any serious drug trial), or at least control the potential sampling bias caused by confounding factors (like case-control studies, cohort studies, stratification).
If not possible (it is not always possible depending on the design of the experiment and/or the object of the study), measure and log various metadata regarding your subjects/individuals so that you will be able to account for the potential effect of confounding variables in your analysis.

It generally requires a certain level of technical expertise/knowledge in the considered field of study to be able to identify potential confounding factors before the experiments (so that you can monitor and log the corresponding quantities during your experiment).

More details here.

Multiple factor ANOVA and factor interaction

Two-way ANOVA provides a powerful framework to quantify the effect of 2 factors on the variability of an output variable in the presence of interactions. The framework allows the decomposition of the observed variance with respect to:

the effect of the first and second factors
the effect of the interaction of the two factors
the residual part, the variance that is not explained by any factor in the model

Why would that be important to consider the effect of the interaction of the 2 factors on the response, and not just the effect of each factor?

Solution

There could be some experiments/phenomenon where the effect of one factor on the response depends on the effect of another factor, c.f. next example.

For instance, in our penguins data, we have seen that the species has an effect on the body mass:

ggplot(penguins, aes(x = species, y = body_mass_g)) +
    geom_boxplot(aes(fill = species)) + 
    theme_bw()

We could also question the effect of the penguin sex on their body mass:

ggplot(penguins, aes(x = sex, y = body_mass_g)) +
    geom_boxplot(aes(fill = sex)) + 
    theme_bw()

Here is the ratio of the average body mass of female and male in the different species:

# A tibble: 3 × 2
  species   `female/male average body mass ratio`
  <fct>                                     <dbl>
1 Adelie                                    0.833
2 Chinstrap                                 0.895
3 Gentoo                                    0.853

What can you say about this?

Solution

The relative difference between the body mass of males and females seems to be different for the different species of penguins.

Therefore, it suggests that not only the species and the sex both have an effect on the body mass of the penguins but also their interaction.

We will now see how to account for such an interaction in the ANOVA model with multiple factors.

(Un)orthogonal design?

Sums of Squares constitute the central ingredient to quantify the effect of each source of variability. The core property of linear models is that if the design is orthogonal, the decomposition of the variability with respect to the different sources of variability is unique. Otherwise, there is some ambiguity because the design does not allow the complete identification of the different sources of variations independently. Hence the central information that should be checked is whether the experimental design is orthogonal or not. One quick way is to check that the design is balanced (i.e. if the headcounts are the same in all crossed modalities (between the two factors).

We now study the joint influence of the sex and the species of the different penguins on the body mass of the animals.

Here is the headcounts regarding sex and species in the data:

table( penguins$sex,penguins$species)

        
         Adelie Chinstrap Gentoo
  female     73        34     58
  male       73        34     61

Is the design balanced?

Solution

No, the design is not balanced, because the headcounts in each crossed categories are not the same.

The model

We consider the following model:

mod4 <- lm(body_mass_g ~ sex * species, data = penguins)

Note: the notation sex * species means that we consider both the effect of each factor and of their interaction. It is equivalent to the following model (we use : to explicitly indicate an interaction):

mod4 <- lm(body_mass_g ~ sex + species + sex:species, data = penguins)

Write the ANOVA model in case of two factors with interaction?

Solution

We denote by \(Y_{ijk}\) the observed body mass of the penguin \(k\) in the class \(i\) for the first factor and in the class \(j\) for the second factor.

Here, the different groups \(i=1,...,I\) for the first factor are the different species (\(I = 3\)), the different groups \(j=1,...,J\) for the second factor are the different sex (\(J = 2\)), and there are \(k=1,...n_{ij}\) penguins that have the species \(i\) and the sex \(j\).

The natural model for the multiple factor ANOVA is \[Y_{ijk} = \mu_{ij} + \epsilon_{ijk}\] where

\(\mu_{ij}\) is the body mass average for species \(i\) and sex \(j\)
\(\epsilon_{ijk}\) is the error term

However, in practice, we generally use the following model \[Y_{ijk} = \mu + \alpha_i + \beta_j + \gamma_{ij} + \epsilon_{ijk}\] where

\(\mu\) is the global mean
\(\alpha_i\) the principal effect for the species \(i\)
\(\beta_j\) the principal effect for sex \(j\)
\(\gamma_{ij}\) is the joint effect of species \(i\) and sex \(j\), quantifying the interaction between the two factors
\(\epsilon_{ijk}\) is the error term

And we will also assume \(\epsilon_{ijk} \sim \NN(0, \sigma^2)\).

To illustrate the variations of the average body mass wrt¹ sex and species :

emmip(mod4, ~ sex | species) + theme_bw()

emmip(mod4, ~ species | sex) + theme_bw()

What can you say here?

Solution

The average body mass on the animals is greater for males compared with females (whatever the species). However, the comparison of masses of the different species seems to depend on the sex of the animals. Even if Gentoo individuals are the heaviest for both sexes, in Males, Adelie are heavier than ChinStrap, whereas in Female, Chinstrap individuals are heavier than Adelie. Then the preliminary analyses illustrate the potential presence of interactions.

Accounting for interactions is mandatory when comparing the levels of each factors.

Comparing different species without considering differences due to the sex factor implies that in the comparison, we do not account for the fact that differences between species may depend on differences between males and females.

lsmeans(mod4, ~species)

 species   lsmean   SE  df lower.CL upper.CL
 Adelie      3706 25.6 327     3656     3757
 Chinstrap   3733 37.5 327     3659     3807
 Gentoo      5082 28.4 327     5026     5138

Results are averaged over the levels of: sex 
Confidence level used: 0.95

pairs(lsmeans(mod4, ~species))

 contrast           estimate   SE  df t.ratio p.value
 Adelie - Chinstrap    -26.9 45.4 327  -0.593  0.8241
 Adelie - Gentoo     -1376.1 38.2 327 -36.007  <.0001
 Chinstrap - Gentoo  -1349.2 47.0 327 -28.682  <.0001

Results are averaged over the levels of: sex 
P value adjustment: tukey method for comparing a family of 3 estimates

Comparing both sexes without considering differences due to the species factor implies that in the comparison, we do not account for the fact that differences between sexes may depend on differences between species.

Note that the lsmeans package proposes a warning when interaction is present.

lsmeans(mod4, ~sex)

 sex    lsmean   SE  df lower.CL upper.CL
 female   3859 25.3 327     3809     3908
 male     4489 25.2 327     4440     4539

Results are averaged over the levels of: species 
Confidence level used: 0.95

pairs(lsmeans(mod4, ~sex))

 contrast      estimate   SE  df t.ratio p.value
 female - male     -631 35.7 327 -17.659  <.0001

Results are averaged over the levels of: species

The case of the unbalanced design

mod4 <- lm(body_mass_g ~ sex * species, data = penguins)
anova(mod4)

Analysis of Variance Table

Response: body_mass_g
             Df    Sum Sq  Mean Sq F value    Pr(>F)    
sex           1  38878897 38878897 406.145 < 2.2e-16 ***
species       2 143401584 71700792 749.016 < 2.2e-16 ***
sex:species   2   1676557   838278   8.757 0.0001973 ***
Residuals   327  31302628    95727                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

mod5 <- lm(body_mass_g ~ species * sex, data = penguins)
anova(mod5)

Analysis of Variance Table

Response: body_mass_g
             Df    Sum Sq  Mean Sq F value    Pr(>F)    
species       2 145190219 72595110 758.358 < 2.2e-16 ***
sex           1  37090262 37090262 387.460 < 2.2e-16 ***
species:sex   2   1676557   838278   8.757 0.0001973 ***
Residuals   327  31302628    95727                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What can you say about the variance decomposition (sums of squares) in the two previous models?

Solution

Depending on the order of the factors in the formula, the corresponding sums of squares are different.

When the design is not orthogonal, there exists 3 main strategies to compare the different sources of variability (Type I, II, III). The most important message here is that when the interaction is strong and/or significant, main effects can not / should not be interpreted separately. So the first thing to check in any ANOVA analysis is to check if the interaction is significant

Type-I sum of squares

This is the default test that is implemented in the anova() framework. The Type-I strategy consists in comparing models that grow in a nested way. Results of Type-I analysis depend on the order of the factors in the model, hence is not adapted to unbalanced multiple ANOVAs.

If the model is specified in the following way body_mass_g ~ species * sex, the tested models are (in this order):

“species effect”: body_mass_g ~ 1 vs body_mass_g ~ 1 + species, i.e. is there a species effect whatever the other factors?
“sex effect”: body_mass_g ~ 1 + species vs body_mass_g ~ 1 + species + sex, i.e. does the sex effect carry much information after the species effects has been specified? is there a sex effect in the residuals when we already accounted for the species effect?
“interaction”: body_mass_g ~ 1 + species + sex vs body_mass_g ~ 1 + species + sex + species:sex, i.e. does the interaction carry much information after the species and the sex effects have been specified? is there an interaction in the residuals when we already accounted for the main effects?

anova(lm(body_mass_g ~ species * sex, data = penguins))

Analysis of Variance Table

Response: body_mass_g
             Df    Sum Sq  Mean Sq F value    Pr(>F)    
species       2 145190219 72595110 758.358 < 2.2e-16 ***
sex           1  37090262 37090262 387.460 < 2.2e-16 ***
species:sex   2   1676557   838278   8.757 0.0001973 ***
Residuals   327  31302628    95727                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

By deduction, if the model is specified in the other way: body_mass_g ~ sex * species, what are the tested models in the following output?

anova(lm(body_mass_g~sex*species, data= penguins))

Analysis of Variance Table

Response: body_mass_g
             Df    Sum Sq  Mean Sq F value    Pr(>F)    
sex           1  38878897 38878897 406.145 < 2.2e-16 ***
species       2 143401584 71700792 749.016 < 2.2e-16 ***
sex:species   2   1676557   838278   8.757 0.0001973 ***
Residuals   327  31302628    95727                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Solution

“sex effect”: body_mass_g ~ 1 vs body_mass_g ~ 1 + species, i.e. is there a species effect whatever the other factors?
“species effect”: body_mass_g ~ 1 + species vs body_mass_g ~ 1 + species + sex, i.e. does the sex effect carry much information after the species effects has been specified? is there a sex effect in the residuals when we already accounted for the species effect?
“interaction”: body_mass_g ~ 1 + species + sex vs body_mass_g ~ 1 + species + sex + species:sex, i.e. does the interaction carry much information after the species and the sex effects have been specified? is there an interaction in the residuals when we already accounted for the main effects?

What should we have verified before interpreting the Fisher test results?

Solution

We should have verified the Gaussian (with mean=0) and homoskedasticity assumptions on the residuals.

mod4 <- lm(body_mass_g ~ species * sex, data = penguins)

ggplot(
    penguins %>% mutate(resid = residuals(mod4)),
    aes(x = interaction(species, sex), y = resid)) +
    geom_point() + 
    geom_hline(yintercept = 0, color = "red") +
    theme_bw()

check_normality(mod4)

OK: residuals appear as normally distributed (p = 0.945).

check_heteroskedasticity(mod4)

OK: Error variance appears to be homoscedastic (p = 0.629).

Factor order matters in the formula defining the model when using Type-I sum of squares.

Type-II sum of squares

The Type-II strategy consists in considering that unfortunately, there is always some kind of aliasing. This strategy is valid if the interaction is not “too strong”. Type-II Sums of Squares can be computed by using the aov() function and then the Anova() (with an uppercase A) function from the car package:

mod5 <- aov(body_mass_g~sex*species, data=penguins)
mod5

Call:
   aov(formula = body_mass_g ~ sex * species, data = penguins)

Terms:
                      sex   species sex:species Residuals
Sum of Squares   38878897 143401584     1676557  31302628
Deg. of Freedom         1         2           2       327

Residual standard error: 309.3973
Estimated effects may be unbalanced

library(car)
Anova(mod5)

Anova Table (Type II tests)

Response: body_mass_g
               Sum Sq  Df F value    Pr(>F)    
sex          37090262   1 387.460 < 2.2e-16 ***
species     143401584   2 749.016 < 2.2e-16 ***
sex:species   1676557   2   8.757 0.0001973 ***
Residuals    31302628 327                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this case, the testing strategy is the following

“sex effect”: body_mass_g ~ 1 + species vs body_mass_g ~ 1 + sex, i.e. is there a sex effect or rather a species effect?
“species effect”: body_mass_g ~ 1 + sex vs body_mass_g ~ 1 + species, i.e. is there a species effect or rather a sex effect?
“interaction”: body_mass_g ~ 1 + species + sex vs body_mass_g ~ 1 + species + sex + species:sex, i.e. does the interaction carry much information after the species and the sex effects have been specified? is there an interaction in the residuals when we already accounted for the main effects ?

The idea of the Type-II strategy is that if the factors are very confounded, then the species factor carries much information about the sex factor. Hence, when confronting body_mass_g ~ 1 + species vs body_mass_g ~ 1 + sex, the information will be equivalent, and the null hypothesis will be accepted because of aliasing.

When the interaction is strong, the only proper way to compare the levels of the main effects is to do the comparison conditionally to the other factor, such as:

pairs(lsmeans(mod5, ~sex | species))

species = Adelie:
 contrast      estimate   SE  df t.ratio p.value
 female - male     -675 51.2 327 -13.174  <.0001

species = Chinstrap:
 contrast      estimate   SE  df t.ratio p.value
 female - male     -412 75.0 327  -5.487  <.0001

species = Gentoo:
 contrast      estimate   SE  df t.ratio p.value
 female - male     -805 56.7 327 -14.188  <.0001

plot(pairs(lsmeans(mod5, ~ sex | species)))

pairs(lsmeans(mod5, ~species | sex))

sex = female:
 contrast           estimate   SE  df t.ratio p.value
 Adelie - Chinstrap     -158 64.2 327  -2.465  0.0377
 Adelie - Gentoo       -1311 54.4 327 -24.088  <.0001
 Chinstrap - Gentoo    -1153 66.8 327 -17.246  <.0001

sex = male:
 contrast           estimate   SE  df t.ratio p.value
 Adelie - Chinstrap      105 64.2 327   1.627  0.2357
 Adelie - Gentoo       -1441 53.7 327 -26.855  <.0001
 Chinstrap - Gentoo    -1546 66.2 327 -23.345  <.0001

P value adjustment: tukey method for comparing a family of 3 estimates

plot(pairs(lsmeans(mod5, ~species | sex)))

Type-III sum of squares

We will not see that one.

Coming back to the species/island model

Reminder: we considered the following model body_mass_g ~ species + island.

Is the design balanced?

Solution

No, it is not:

table(penguins$species, penguins$island)

           
            Biscoe Dream Torgersen
  Adelie        44    55        47
  Chinstrap      0    68         0
  Gentoo       119     0         0

Here are the two models (with interactions) body_mass_g ~ species * island and body_mass_g ~ island * species

mod3a <- lm(body_mass_g ~ species * island, data = penguins)
anova(mod3a)

Analysis of Variance Table

Response: body_mass_g
           Df    Sum Sq  Mean Sq  F value Pr(>F)    
species     2 145190219 72595110 339.8328 <2e-16 ***
island      2      2064     1032   0.0048 0.9952    
Residuals 328  70067383   213620                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

mod3b <- lm(body_mass_g ~ island * species, data = penguins)
anova(mod3b)

Analysis of Variance Table

Response: body_mass_g
           Df   Sum Sq  Mean Sq F value    Pr(>F)    
island      2 83740680 41870340  196.00 < 2.2e-16 ***
species     2 61451603 30725801  143.83 < 2.2e-16 ***
Residuals 328 70067383   213620                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What can you say about these two models? Why is there no interaction term?

Solution

First, there is no interaction term because some combination species/island are not represented in the data, so it was not considered.

Second, the model body_mass_g ~ species * island (or equivalenty² and body_mass_g ~ species + island) considers the following:

“species effect”: body_mass_g ~ 1 vs body_mass_g ~ 1 + species, i.e. is there a species effect whatever the other factors?
“island effect”: body_mass_g ~ 1 + species vs body_mass_g ~ 1 + species + island, i.e. does the island effect carry much information after the species effects has been specified?

whereas the model body_mass_g ~ island * species (or equivalenty³ and body_mass_g ~ island + species) considers the following:

“island effect”: body_mass_g ~ 1 vs body_mass_g ~ 1 + island, i.e. is there a island effect whatever the other factors?
“species effect”: body_mass_g ~ 1 + island vs body_mass_g ~ 1 + island + species, i.e. does the species effect carry much information after the island effects has been specified?

Therefore, we have to be very cautious with the order of the factors, because we could find an island effect and a species effect, instead of just a species effect (as suggested by the boxplot).

Balanced design

We will consider another dataset giving the survival time of some animals when being contaminated by different poisons and given different treatments (used to counteract the poison) (more info here).

You can load the dataset with the following command:

poisons <- read_csv("https://gitbio.ens-lyon.fr/LBMC/hub/formations/ens_l3_stats/-/raw/main/tutorial_4_anova/poisons.csv?inline=false")

Note: the read_csv() function is loaded in tidyverse but comes from the readr package.

poisons

# A tibble: 48 × 3
    time poison treat
   <dbl>  <dbl> <chr>
 1  0.31      1 A    
 2  0.45      1 A    
 3  0.46      1 A    
 4  0.43      1 A    
 5  0.36      2 A    
 6  0.29      2 A    
 7  0.4       2 A    
 8  0.23      2 A    
 9  0.22      3 A    
10  0.21      3 A    
# ℹ 38 more rows

Check if the design is balanced?

Solution

Yes, it is:

table(poisons$poison, poisons$treat)

We consider the model time ~ poison * treat and the model time ~ treat * poison:

mod6 <- lm(time ~ factor(poison) * treat, data = poisons)
anova(mod6)

Analysis of Variance Table

Response: time
                     Df  Sum Sq Mean Sq F value    Pr(>F)    
factor(poison)        2 1.03301 0.51651 23.2217 3.331e-07 ***
treat                 3 0.92121 0.30707 13.8056 3.777e-06 ***
factor(poison):treat  6 0.25014 0.04169  1.8743    0.1123    
Residuals            36 0.80072 0.02224                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(mod6)$r.squared # R2

[1] 0.733543

mod7 <- lm(time ~ treat * factor(poison), data = poisons)
anova(mod7)

Analysis of Variance Table

Response: time
                     Df  Sum Sq Mean Sq F value    Pr(>F)    
treat                 3 0.92121 0.30707 13.8056 3.777e-06 ***
factor(poison)        2 1.03301 0.51651 23.2217 3.331e-07 ***
treat:factor(poison)  6 0.25014 0.04169  1.8743    0.1123    
Residuals            36 0.80073 0.02224                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(mod7)$r.squared # R2

[1] 0.733543

Note: we use factor(poison) because the poison variable is an integer and R understand it as a numeric variable in the lm() function.

What can you say regarding the variance decomposition in this balanced case?

Solution

Now, in this balanced design setting, the sums of square are the same whatever the order of the factors in the formula.

When the design is balanced, it is orthogonal and it is possible to unambigously separe all variability sources. In that cases, the sums of squares are equivalent and we can independently interpret the different variability sources, whatever the order of the factors.

Footnotes

“with respect to”↩︎
only because the interaction is not accounted for↩︎
only because the interaction is not accounted for↩︎