Tutorial 6: Linear Regression

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

This tutorial is directly adapted from Lieven Clement and Milan Malfait tutorial on linear regression in their “Practical Statistics for the Life Sciences” course ¹.

Requirements

We first required packages:

# required packages
library(tidyverse)

library(Rmisc)  # functions useful for data analysis and utility operations
library(GGally) # ggplot2 extension
library(performance)  # evaluate models

Note: we will need the ggplot2, dplyr, readr, tidyr and tibble packages from the tidyverse meta-package. Thus, if you encounter issues when installing tidyverse, you can only install these four packages.

Breast cancer dataset

This practice is a subset of a more general study [1].

[1] Christos Sotiriou, Pratyaksha Wirapati, Sherene Loi, Adrian Harris, Steve Fox, Johanna Smeds, Hans Nordgren, Pierre Farmer, Viviane Praz, Benjamin Haibe-Kains, Christine Desmedt, Denis Larsimont, Fatima Cardoso, Hans Peterse, Dimitry Nuyten, Marc Buyse, Marc J. Van de Vijver, Jonas Bergh, Martine Piccart, Mauro Delorenzi, Gene Expression Profiling in Breast Cancer: Understanding the Molecular Basis of Histologic Grade To Improve Prognosis, JNCI: Journal of the National Cancer Institute, Volume 98, Issue 4, 15 February 2006, Pages 262–272, https://doi.org/10.1093/jnci/djj052

We consider 32 patients with breast cancer that are estrogen-receptor positive tumor. They received tamoxifen chemotherapy.

Loading and introducing the data

The data are available in this repository and can be loaded with the following command:

brca <- read_csv("https://raw.githubusercontent.com/GTPB/PSLS20/master/data/breastcancer.csv")

brca

# A tibble: 32 × 10
   sample_name filename     treatment    er grade  node  size   age  ESR1 S100A8
   <chr>       <chr>        <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
 1 OXFT_209    gsm65344.ce… tamoxifen     1     3     1   2.5    66 1939.  207. 
 2 OXFT_1769   gsm65345.ce… tamoxifen     1     1     1   3.5    86 2752.   37.0
 3 OXFT_2093   gsm65347.ce… tamoxifen     1     1     1   2.2    74  379. 2364. 
 4 OXFT_1770   gsm65348.ce… tamoxifen     1     1     1   1.7    69 2532.   23.6
 5 OXFT_1342   gsm65350.ce… tamoxifen     1     3     0   2.5    62  141. 3219. 
 6 OXFT_2338   gsm65352.ce… tamoxifen     1     3     1   1.4    63 1495.  108. 
 7 OXFT_2341   gsm65353.ce… tamoxifen     1     1     1   3.3    76 3406.   14.0
 8 OXFT_1902   gsm65354.ce… tamoxifen     1     3     0   2.4    61 2813.   68.4
 9 OXFT_1982   gsm65355.ce… tamoxifen     1     1     0   1.7    62  950.   74.2
10 OXFT_5210   gsm65356.ce… tamoxifen     1     3     0   3.5    65 1053.  182. 
# ℹ 22 more rows

The recorded variables include:

grade: histological grade of tumor (grade 1 vs 3),
node: lymph node status (0: not affected, 1: lymph nodes affected and removed),
size: tumor size in cm,
ESR1 and S100A8: gene expression measurement for 2 genes in tumor biopsy (microarray technology)

Notes:

Gene ESR1 (estrogen receptor 1) is expressed in $\pm$ 75% of breast cancer tumors.
Patients that express ESR1 better respond to treatment by hormone therapy
Tamoxifen interacts with ER and modulates its expression.
Proteins of the S100 family are low-molecular weight proteins with calcium binding sites. Their expression is often disregulated in cancer. The expression of S100A8 represses the immune system in tumor and creates an environment of inflamation that promotes tumor growth.

Here is a representation of the distribution of the S100A8 gene expression:

brca %>% ggplot(aes(x = "", y = S100A8)) +
    geom_boxplot() +
    xlab("") + theme_bw()

What can you say about the observed values for S100A8 gene expression?

Solution

Apparently the measurement of the expression of gene S100A8 failed for 3 individuals. These individuals can be considered as outliers.

We remove the outliers:

brcaSubset <- brca %>% filter(S100A8 < 2000)

Here is the new boxplot representation of S100A8 gene expression after removing the outliers:

brcaSubset %>% ggplot(aes(x = "", y = S100A8)) +
    geom_boxplot() +
    xlab("") + theme_bw()

Association between ESR1 and S100A8 expression

We will study the possible relationship between the expression of the ESR1 and S100A8 genes:

brcaSubset %>%
    ggplot(aes(x = ESR1, y = S100A8)) +
    geom_point() +
    theme_bw()

What kind of method are we going to use?

Solution

We want to investigate the possible relationship between two quantitative variables, this is a regression problem.

Regression

Formalize the regression framework?

Solution

We first start with a linear regression between two variables $(X_{i}, Y_{i})$ , measured on each subject $i = 1, . . ., n$ .

For instance, we may consider the following setting:

Response $Y$ : S100A8 expression
Predictor $X$ : ESR1 expression

The regression model

Describe the principle of the regression model?

Solution

The principle of regression (not necessarily linear) is to model the variations of the output variable $Y$ (observations) by a contribution of fixed covariates $X$ in the standard additive framework: $observation = signal + noise$

with mathematical notations: $Y_{i} = g (x_{i}) + ε_{i}$

$g (x)$ is called the regression function that models the expected outcome $Y_{i}$ depending on input variable $x_{i}$ .

Linear regression

Describe the principle of the linear regression model?

Solution

To obtain an accurate and interpretable model, the simplest choice for $g (x)$ is a linear function with unknown parameters.

$g (x) = β_{0} + β_{1} x$ with unknown intercept $β_{0}$ and slope $β_{1}$ .

Linear model is based on an assumption regarding the form of the relationship between $Y$ and $X$ , and on the i.i.d. assumption for the error terms.

We can also add a Gaussian assumption on the error terms.

These assumption must be checked afterwards.

The linear model is generally written: $Y_{i} = β_{0} + β_{1} x_{i} + ε_{i}$ where $ε_{i}$ is a noise/error term. We generally assume that all $ε_{i}$ are independent, identically distributed with mean = 0 and constant variance $σ^{2}$ (homoskedasticity case).

Use

What are the use cases of the linear regression model?

Solution

Prediction: when $Y$ is unknown but $X = x$ is known we can predict $Y$ using $\hat{y} = {\hat{β}}_{0} + {\hat{β}}_{1} x$
Association: biological relation between variable $X$ and response $Y$

What is the interpretation of the coefficient in the linear regression model? in particular when we draw the following regression line?

brcaSubset %>%
    ggplot(aes(x = ESR1, y = S100A8)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) + 
    theme_bw()

Solution

Intercept: $β_{0}$ is the value for $Y$ predicted by the model when $x = 0$

Slope: it quantifies how much $Y$ will increase/decrease when $x$ increases slightly according to the model, i.e. $\begin{aligned} g (x + δ) - g (x) & = β_{0} + β_{1} (x + δ) - β_{0} - β_{1} x \\ = β_{1} δ \end{aligned}$

$β_{1} =$ quantifies the strength of the linear association between $Y$ and $X$

Parameter estimation

(optional) Least squares

Parameters $β_{0}$ en $β_{1}$ are unknown.

They are estimated them using all the data points by the least-squares method.

How does the least-squares method work?

Solution

To determine the best fitting line:

Use a sample $(x_{i}, y_{i})_{i = 1, \dots, n}$ containing pairs of observations for $X$ and $Y$
Purpose: for each $x_{i}$ , we want the predicted value ${\hat{y}}_{i} = β_{0} + β_{1} x_{i}$ to be as close as possible from observed value $y_{i}$
Choose $β_{0}$ and $β_{1}$ so that the sum of squares distance between predicted and observed points becomes as small as possible, i.e. the sum of squared errors (RSS): $R S S = \sum_{i = 1}^{n} (y_{i} - β_{0} - β_{1} x_{i})^{2} = \sum_{i = 1}^{n} e_{i}^{2}$ with residuals $e_{i}$ the vertical distances from the observations to the fitted regression line

On the following graph, in blue is the linear regression line estimated by the least-squares method, and in red are the corresponding observed value ${\hat{e}}_{i}$ for the residuals in the fitted model.

Estimators that minimise RSS

What are the values of the $β_{0}$ and $β_{1}$ that minimizes the RSS?

Solution

$\begin{aligned} {\hat{β}}_{1} = & \frac{\sum_{i = 1}^{n} (y_{i} - \bar{y}) (x_{i} - \bar{x})}{\sum_{i = 1}^{n} (x_{i} - \bar{x})^{2}} \\ = & \frac{cov (x, y)}{var (x)} \\ = & \frac{cor (x, y) s_{y}}{s_{x}} \end{aligned}$

${\hat{β}}_{0} = \bar{y} - {\hat{β}}_{1} \bar{x}$ where $cov (x, y)$ is the empirical covariance between $X$ and $Y$ , $var (x)$ the empirical variance of $X$ , $cor (x, y)$ the empirical correlation between $X$ and $Y$ , $s_{x}$ and $s_{y}$ respectively the empirical standard deviation for $X$ and $Y$ , $\bar{x}$ and $\bar{y}$ respectively the empirical mean for $X$ and $Y$ .

Note: the slope of the least squares fit is proportional to the correlation between the response and the predictor.

How can we use the fitted model (i.e. estimations of $β_{0}$ and $β_{1}$ ?

Solution

Fitted model allows to:

predict the response for subjects with a given value $x$ for the predictor: $\hat{y} = {\hat{β}}_{0} + {\hat{β}}_{1} x$
Assess how the mean response differs between two groups of subjects that differ of $δ$ units in the predictor: $\hat{g} (x + δ) - \hat{g} (x) = {\hat{β}}_{1} δ$

Breast cancer example

Here is the linear regression model fitted in R:

lm1 <- lm(S100A8 ~ ESR1, data = brcaSubset)
summary(lm1)


Call:
lm(formula = S100A8 ~ ESR1, data = brcaSubset)

Residuals:
   Min     1Q Median     3Q    Max 
-95.43 -34.81  -6.79  34.23 145.21 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 208.47145   28.57207   7.296 7.56e-08 ***
ESR1         -0.05926    0.01212  -4.891 4.08e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 59.91 on 27 degrees of freedom
Multiple R-squared:  0.4698,    Adjusted R-squared:  0.4502 
F-statistic: 23.93 on 1 and 27 DF,  p-value: 4.078e-05

What is the equation of the linear regression line?

Solution

$\hat{y} = 208.47 - 0.059 x$

Interpret the evolution of S100A8 gene expression depending on ESR1 expression?

Solution

The expected expression of S100A8 is on average 59 units lower for patients with ESR1 expression level that is 1000 units higher.

What would be the predicted S100A8 expression level for patients with an ESR1 expression level of 2000? and an ESR1 expression level of 4000?

Solution

Expected S100A8 expression level for patients with an ESR1 expression level of 2000: $208.47 - 0.059 \times 2000 = 89.94$

Expected S100A8 expression level for patients with an ESR1 expression level of 4000: $208.47 - 0.059 \times 4000 = - 28.58$ (!!!)

Reminder: here is a graphical representation of the data and the fitted linear regression line:

brcaSubset %>%
    ggplot(aes(x = ESR1, y = S100A8)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) + 
    theme_bw()

What is the problem with predicted S100A8 expression level for patients with an ESR1 expression level of 4000? What could be the problem with this prediction?

Solution

WARNING when you extrapolate ! The assumption of linearity can only be studied within the range of the available and observed data, and the linear regression model may predict unrealistic values for your response (here a negative expression level).

Note:

interpolation: prediction within the range of the available and observed data
extrapolation: prediction outside the range of the available and observed data

Statistical inference

To draw conclusions based on the regression model $Y_{i} = β_{0} + β_{1} x_{i} + ε_{i}$ , we need to know

how the least squares parameter estimators vary from sample to sample?
how they deviate under the null hypothesis that there is no association between predictor and response?

How could we proceed?

Solution

This requires a statistical model.

We need to model the distribution of $Y$ given $X$ explicitly (conditional distribution).

Assumptions

For the moment, we made the following assumptions:

There is a linear relationship between the response and the predictor.

Besides linearity we need additional assumptions!

What other assumptions are we going to make?

Solution

Independence: Observations $(X_{1}, Y_{1}), . . ., (X_{n}, Y_{n})$ are made for n independent subjects (which is required to estimate the variance)
Homoscedasticity or equal variances: observations vary with equal mean around the regression line

Residuals $ε_{i}$ have equal variance for each individual $i$ (and then for any $X_{i} = x$ )
$var (ε) = σ^{2}$ for any $x$ value
$σ$ is referred to as the residual standard deviation

Normality: the residuals $ε_{i}$ are normally distributed with mean=0

(optional) Modeling the distribution of Y?

What can you derive from assumptions 2, 3 and 4?

Solution

$ε_{i} i.i.d. N (0, σ^{2})$

And from assumptions 1, 2, 3, 4?

Solution

The conditional distribution on the response is Gaussian: $Y_{i} | X_{i} \sim N (β_{0} + β_{1} X_{i}, σ^{2})$
Using these assumptions we can derive the variance of the estimators : $σ_{{\hat{β}}_{0}}^{2} = \frac{\sum_{i = 1}^{n} X_{i}^{2}}{\sum_{i = 1}^{n} (X_{i} - \bar{X})^{2}} \times \frac{σ^{2}}{n} en σ_{{\hat{β}}_{1}}^{2} = \frac{σ^{2}}{\sum_{i = 1}^{n} (X_{i} - \bar{X})^{2}}$
Using the Gaussian assumption we can show that the parameter estimators are normally distributed ${\hat{β}}_{0} \sim N (β_{0}, σ_{{\hat{β}}_{0}}^{2}) en {\hat{β}}_{1} \sim N (β_{1}, σ_{{\hat{β}}_{1}}^{2})$

(optional) High spread of $X$ improves the precision

We remind the value of the variance for the estimator of $β_{1}$ : $σ_{{\hat{β}}_{1}}^{2} = \frac{σ^{2}}{\sum_{i = 1}^{n} (X_{i} - \bar{X})^{2}}$

What can you say about the two regression examples in the next plot regarding the variance of the estimate ${\hat{β}}_{1}$ ?

Solution

In the first case, the observations for X are widely spread, then the sum of squared deviations from the empirical mean $\sum_{i = 1}^{n} (X_{i} - \bar{X})^{2}$ will be larger and the variance of ${\hat{β}}_{1}$ will be lower, hence a better precision when estimating $β_{1}$ .

In the second case, the observations for X are not widely spread, then the sum of squared deviations from the empirical mean $\sum_{i = 1}^{n} (X_{i} - \bar{X})^{2}$ will be smaller and the variance of ${\hat{β}}_{1}$ will be higher, hence the precision will not be lower.

The conditional variance ( $σ^{2}$ ) is unknown? How can we estimate it?

Solution

The ${\hat{σ}}^{2}$ estimate of $σ^{2}$ uses the mean squared error (MSE)
The MSE is the average squared error: $MSE = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{β}}_{0} - {\hat{β}}_{1} \times x_{i})}^{2} = \frac{1}{n} \sum_{i = 1}^{n} e_{i}^{2}$
The estimate of $σ^{2}$ is given by: ${\hat{σ}}^{2} = \frac{n}{n - 2} MSE = \frac{1}{n - 2} \sum_{i = 1}^{n} {(y_{i} - {\hat{β}}_{0} - {\hat{β}}_{1} \times x_{i})}^{2} = \frac{1}{n - 2} \sum_{i = 1}^{n} e_{i}^{2}$
This estimator is based on independence (assumption 2) and equality of the variance (assumption 3).
Note the division by $n - 2$

Upon the estimation of $σ^{2}$ , what are the standard error (i.e. estimates of the standard deviation) regarding ${\hat{β}}_{0}$ and ${\hat{β}}_{1}$ ?

Solution

Upon the estimation of $σ^{2}$ we obtain following standard errors:

${SE}_{{\hat{β}}_{0}} = {\hat{σ}}_{{\hat{β}}_{0}} = \hat{σ} \sqrt{\frac{\sum_{i = 1}^{n} X_{i}^{2}}{\sum_{i = 1}^{n} (X_{i} - \bar{X})^{2}}}$

${SE}_{{\hat{β}}_{1}} = {\hat{σ}}_{{\hat{β}}_{1}} = \hat{σ} \sqrt{\frac{1}{\sum_{i = 1}^{n} (X_{i} - \bar{X})^{2}}}$

Reminder: ${\hat{σ}}^{2} = \frac{n}{n - 2} MSE$

How can we use the previous results regarding the distribution of the estimates ${\hat{β}}_{0}$ and ${\hat{β}}_{1}$ to build confidence intervals and tests?

Solution

Thanks to the distribution of the parameter estimators, we can construct confidence intervals and tests using $T = \frac{{\hat{β}}_{k} - β_{k}}{S E ({\hat{β}}_{k})} with k = 0, 1$
If all assumptions are valid $T$ follows a $T (n - 2)$ Student distribution with n-2 degrees of freedom
If no normality, but independence, linearity, equality of mean and large dataset $\to Central Limit theorem$

Breast cancer example

Reminder:

lm1 <- lm(S100A8 ~ ESR1, data = brcaSubset)
lm1


Call:
lm(formula = S100A8 ~ ESR1, data = brcaSubset)

Coefficients:
(Intercept)         ESR1  
  208.47145     -0.05926

$\to$ The association between between S100A8 and ESR1 gene expression is negative.

Generalize the effect in the sample to the population using the confidence interval on ${\hat{β}}_{1}$ ?

Solution

$[{\hat{β}}_{1} - t_{n - 2, α / 2} {SE}_{{\hat{β}}_{1}}, {\hat{β}}_{1} + t_{n - 2, α / 2} {SE}_{{\hat{β}}_{1}}]$ .

confint(lm1)

                   2.5 %       97.5 %
(Intercept) 149.84639096 267.09649989
ESR1         -0.08412397  -0.03440378

(optional) Hypothesis test

How could we statistically test the significance of the effect of ESR1 gene expression on S100A8 gene expression?

Solution

We can restate the hypothesis of association the S100A8 and ESR1 gene expression in the form of a null hypothesis on the parameters of the model.

Under the null hypothesis of the absence of an association in the expression of both genes: $H_{0} : β_{1} = 0$
Under the alternative hypothesis, there is an association between the expression of both genes: $H_{1} : β_{1} \neq 0$

Test statistic: $T = \frac{{\hat{β}}_{1} - 0}{S E ({\hat{β}}_{1})}$

Under $H_{0}$ the statistics follows a T-distribution with n-2 degrees of freedom.

Breast cancer example

summary(lm1)


Call:
lm(formula = S100A8 ~ ESR1, data = brcaSubset)

Residuals:
   Min     1Q Median     3Q    Max 
-95.43 -34.81  -6.79  34.23 145.21 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 208.47145   28.57207   7.296 7.56e-08 ***
ESR1         -0.05926    0.01212  -4.891 4.08e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 59.91 on 27 degrees of freedom
Multiple R-squared:  0.4698,    Adjusted R-squared:  0.4502 
F-statistic: 23.93 on 1 and 27 DF,  p-value: 4.078e-05

What can you say about association between the S100A8 and ESR1 gene expression?

Solution

The association between the S100A8 and ESR1 expression is extremely significant ( $p ≪ 0.001$ )

What did we forget to check?

Solution

But, we first have to check all assumptions!
Otherwise the conclusions based on the statistical test and the CI can be incorrect.

Assumption assessment

What assumptions should we check?

Solution

Independence: design
Linearity: inference is useless if the association is not linear
Homoscedasticity: inference/p-value is incorrect if data are heteroscedastic
Normality: inference/p-value is incorrect if data are not normally distributed (especially in small samples)

Linearity

brcaSubset %>%
    ggplot(aes(x = ESR1, y = S100A8)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    theme_bw()

Residual analysis

Assumption of linearity is typically assessed using a residual plot:

predictions ${\hat{y}}_{i} = {\hat{β}}_{0} + {\hat{β}}_{1} x_{i}$ on $X$ -axis
residuals ${\hat{e}}_{i} = y_{i} - {\hat{y}}_{i} = y_{i} - {\hat{β}}_{0} - {\hat{β}}_{1} \times x_{i}$ on $Y$ -axis

plot(lm1, which = 1)

With the residual plot, we can also check the assumptions that the average of the residuals is around 0.

Homoskedasticity (equal variances)

Residuals and squared residuals carry information on the residual variability

Association with predictors $\to$ indication of heteroskedasticity
Scatterplot of ${\hat{e}}_{i}$ vs $x_{i}$ or predictions ${\hat{y}}_{i} = {\hat{β}}_{0} + {\hat{β}}_{1} x_{i}$
Scatterplot of standardized residual versus $x_{i}$ or predictions

performance::check_heteroskedasticity(lm1)

Warning: Heteroscedasticity (non-constant error variance) detected (p = 0.009).

(caution with small sample size)

Normality

If the sample size is large the estimators are normally distributed even if the observations are not normally distributed: central limit theorem
How many observations? $\to$ depends on shape and magnitude of deviations
Assumption: Data are Normally distributed conditional on X: $Y_{i} | X_{i} \sim N (β_{0} + β_{1} X_{i}, σ^{2})$
QQ-plot of the residuals ${\hat{e}}_{i}$

Example of non-Gaussian data and approximately Gaussian residuals:

QQ-plot for our model:

plot(lm1, which = 2)

performance::check_normality(lm1)

OK: residuals appear as normally distributed (p = 0.212).

(caution with small sample size)

Model evaluation

You can use the default R function to get a visual summary about your model residuals:

plot(lm1)

or you can use the check_model() function from the performance package:

performance::check_model(lm1)

Invalid assumptions?

If the assumptions regarding the model are not verified, what can we do?

Solution

We can either:

use the fitted model to do prediction but without statistical modeling (so no test, nor uncertainty quantification), and we would need to rely on geometrical model evaluation metrics like $R^{2}$ coefficient and analysis of variance (c.f. below), or prediction error quantification;
apply a transformation to the data.

Transformation of the predictor $X$ does not change the distribution of $Y$ for given $X$ :

it is not useful to obtain homoscedasticity or Normal distribution
it is useful to improve linearity when normality and homoscedasticity are valid
it is also possible to include higher order terms: $X^{2}$ , $X^{3}$ , … $Y_{i} = β_{0} + β_{1} X_{i} + β_{2} X_{i}^{2} + . . . + ε_{i}$

Transformation of the response $Y$ can be useful to obtain normality and homoscedasticity:

$\sqrt{Y}$
$\log (Y)$
$1 / Y$
etc.

Breast cancer example

What are the issues with the linear regression model in the case of our breast cancer data?

Solution

Problems with

heteroscedasticity
possibly deviations from normality (skewed to the right)
negative concentration predictions are theoretically impossible
non-linearity

This is often the case for concentration and intensity measurements

These are often log-normal distributed (normal distribution upon log-transform)
We also observed a kind of exponential relation with the smoother
In gene expression literature often $\log_{2}$ transformation is adopted
gene-expression on log scale: differences on log scale are fold changes on original scale!

We consider the original data with the outliers that we removed earlier.

Here a scatterplot of the original data:

brca %>% ggplot(aes(x = ESR1, y = S100A8)) +
    geom_point() +
    geom_smooth() +
    theme_bw()

And here a scatterplot of the $\log_{2}$ transformed data:

brca %>% ggplot(aes(x = log2(ESR1), y = log2(S100A8))) +
    geom_point() +
    geom_smooth() +
    theme_bw()

Note: the blue line represents a specific non-linear regression with the confidence interval.

What can you say?

Solution

The linear relationship between S100A8 and ESR1 gene expression is not clear at all when using the original measurements, but appears more clearly when using the $\log_{2}$ -transformed measurements.

Here is the linear regression model for the transformed gene expression:

lm2 <- lm(log2(S100A8) ~ log2(ESR1), data = brca)
performance::check_model(lm2)

performance::check_normality(lm2)

OK: residuals appear as normally distributed (p = 0.888).

performance::check_heteroskedasticity(lm2)

OK: Error variance appears to be homoscedastic (p = 0.430).

summary(lm2)


Call:
lm(formula = log2(S100A8) ~ log2(ESR1), data = brca)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.94279 -0.66537  0.08124  0.68468  1.92714 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   23.401      1.603   14.60 3.57e-15 ***
log2(ESR1)    -1.615      0.150  -10.76 8.07e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.026 on 30 degrees of freedom
Multiple R-squared:  0.7942,    Adjusted R-squared:  0.7874 
F-statistic: 115.8 on 1 and 30 DF,  p-value: 8.07e-12

confint(lm2)

                2.5 %    97.5 %
(Intercept) 20.128645 26.674023
log2(ESR1)  -1.921047 -1.308185

What can you say about it?

Solution

The linearity, normality and heteroskedasticity assumptions seems to now be verified (at least we cannot significantly reject them).

Exploration of the model with log2 transformed data

How do we interpret the slope in the model with $\log_{2}$ -transformed data?

Solution

A patient with an ESR1 expression that is one unit on $\log_{2}$ scale higher than that of another patient on average has a $\log_{2}$ expression for S100A8 that is 1.61 units lower (95% CI [-1.92,-1.31]).

We note $(x_{1}, y_{1})$ the respective gene expressions for ESR1 and S100A8 in patient 1, and similary $(x_{2}, y_{2})$ for patient 2.

And we assume that $x_{2} = x_{1} + 1$ .

$\log_{2} ({\hat{y}}_{1}) = 23.401 - 1.615 \times \log_{2} (x_{1})$

$\log_{2} ({\hat{y}}_{2}) = 23.401 - 1.615 \times \log_{2} (x_{2})$ $\log_{2} (\frac{{\hat{y}}_{2}}{{\hat{y}}_{1}}) = \log_{2} ({\hat{y}}_{2}) - \log_{2} ({\hat{y}}_{1}) = - 1.615 (\log_{2} (x_{2} - x_{1}) = - 1.615 \times 1 = - 1.615$

Prediction

We can compute the predicted values ${\hat{y}}_{i}$ for each data point $x_{i}$ with the fitted model:

brca$pred_y <- predict(lm2)

We can compare the predicted and observed values:

ggplot(brca, aes(x = S100A8, y = pred_y)) +
    geom_point() +
    theme_bw()

What is happening? Are the predicted ${\hat{y}}_{i}$ close to the original $y_{i}$ ?

Solution

The predicted values are in the log-transformed space whereas the original $Y$ (S100A8) value are not log-transformed.

If we log-transform the response S100A8, we can compare it with the predicted values in the same log-transformed space:

ggplot(brca, aes(x = log2(S100A8), y = pred_y)) +
    geom_point() +
    geom_abline(intercept = 0, slope = 1, linetype="dotted") + 
    coord_fixed() +
    theme_bw()

Note: the dotted line represent the line of equation $y = x$ and we see that the observed and predicted values are close.

We can also draw the regression line in the log-transformed space and represent the predicted values (blue cross):

# extract the coefficients
hat_beta <- coef(lm2)
hat_beta_0 <- hat_beta[1]
hat_beta_1 <- hat_beta[2]

# plot
brca %>% ggplot(aes(x = log2(ESR1), y = log2(S100A8))) +
    geom_point() +
    geom_point(aes(y = pred_y), col = "blue", shape = 3) +
    geom_function(
        fun = function(x) return(hat_beta_0 + hat_beta_1*x), 
        col = "red") +
    theme_bw()

Back-transformation

We can transformed back the predicted values using the model on the $\log_{2}$ -transformed data with the inverse transformation ( $z \mapsto 2^{z}$ ) and represent the regression line in the original data space:

brca %>% ggplot(aes(x = ESR1, y = S100A8)) +
    geom_point() +
    geom_point(aes(y = 2^pred_y), col = "blue", shape = 3) +
    geom_function(
        fun = function(x) return(2^(hat_beta_0 + hat_beta_1*log2(x))), 
        col = "red") +
    theme_bw()

(optional) Inference on the mean outcome

A regression model can also be used for prediction: and in particular the inference of the average outcome (i.e. the response average) for a given value of $X = x$ , i.e. $\hat{g} (x) = {\hat{β}}_{0} + {\hat{β}}_{1} x$

$\hat{g} (x)$ is an estimator of the conditional mean $E [Y | X = x]$
Parameter estimators are Normally distributed and unbiased $\to$ estimator $\hat{g} (x)$ is also Normally distributed and unbiased.

${SE}_{\hat{g} (x)} = \hat{σ} \sqrt{\frac{1}{n} + \frac{(x - \bar{X})^{2}}{\sum_{i = 1}^{n} (X_{i} - \bar{X})^{2}}}$

$T = \frac{\hat{g} (x) - g (x)}{S E_{\hat{g} (x)}} \sim T_{n - 2}$

We can use thepredict(.) function in R to compute the inferred average outcome for the response and related confidence intervals.

newdata argument: predictor values (x-values) at which we want to calculate the mean response
interval="confidence" argument to obtain CI.
without newdata argument we perform predictions for all predictor values in the dataset used to fit the model

grid <- 140:4000
g <- predict(lm2, newdata = data.frame(ESR1 = grid), interval = "confidence")
head(g)

       fit      lwr      upr
1 11.89028 10.76082 13.01974
2 11.87370 10.74721 13.00019
3 11.85724 10.73370 12.98078
4 11.84089 10.72028 12.96151
5 11.82466 10.70696 12.94237
6 11.80854 10.69372 12.92336

Note: we do not have to transform the new data that we specified for the ESR1 expression because we fitted the model with a call to the lm() function and specified the transformation within the formula.

We represent the inferred mean outcome (in red) with the confidence interval upper and lower bounds (in grey) in the $\log_{2}$ -transformed data space:

newdata <- data.frame(cbind(grid = log2(grid), g))
brca %>% ggplot(aes(x = log2(ESR1), y = log2(S100A8))) +
    geom_point() +
    geom_line(aes(x = grid, y = fit), newdata, col = "red") +
    geom_line(aes(x = grid, y = lwr), newdata, color = "grey") +
    geom_line(aes(x = grid, y = upr), newdata, color = "grey") +
    theme_bw()

Back-transformation

We can transformed back the inferred mean outcome using the model on the $\log_{2}$ -transformed data with the inverse transformation ( $z \mapsto 2^{z}$ ) and represent the inferred mean outcome (in red) with the confidence interval upper and lower bounds (in grey) in the original data space:

newdata <- data.frame(cbind(grid, 2^g))
brca %>% ggplot(aes(x = ESR1, y = S100A8)) +
    geom_point() +
    geom_line(aes(x = grid, y = fit), newdata, col = "red") +
    geom_line(aes(x = grid, y = lwr), newdata, color = "grey") +
    geom_line(aes(x = grid, y = upr), newdata, color = "grey") +
    theme_bw()

(optional) Prediction intervals

We can also make a prediction for the location of a new observation that would be collected in a new experiment for a patient with a particular value for their ESR1 expression

It is important to notice that this experiment still has to be conducted. So we want to predict the non-observed individual expression value for a novel patient.

For a novel independent observation $Y^{*}$ : $Y^{*} = g (x) + ε^{*}$ with $ε^{*} \sim N (0, σ^{2})$ and $ε^{*}$ independent of the observations in the sample $Y_{1}, \dots, Y_{n}$ .

We predict a new log-S100A8 for a patient with a known log2-ESR1 expression level x: $\hat{y} (x) = {\hat{β}}_{0} + {\hat{β}}_{1} \times x$

What is the relationship between the mean outcome and the predicted value for the response?

Solution

The estimated mean outcome and prediction for a new observation are equal.

But, their sample distributions are different!

Uncertainty on the estimated mean outcome $\leftarrow$ uncertainty on estimated model parameters ${\hat{β}}_{0}$ en ${\hat{β}}_{1}$ .
Uncertainty on new observation ${\hat{Y}}^{*}$ $\leftarrow$ uncertainty on estimated mean and additional uncertainty because the new observation will deviate around the mean!

${SE}_{\hat{Y} (x)} = \sqrt{{\hat{σ}}^{2} + {\hat{σ}}_{\hat{g} (x)}^{2}} = \hat{σ} \sqrt{1 + \frac{1}{n} + \frac{(x - \bar{X})^{2}}{\sum_{i = 1}^{n} (X_{i} - \bar{X})^{2}}}$

$\frac{\hat{Y} (x) - Y}{{SE}_{\hat{Y} (x)}} \sim T_{n - 2}$

Note, that a prediction-interval (PI) is an improved version of a reference-interval when the model parameters are unknown: uncertainty on model parameters + t-distribution.

pred <- predict(lm2, newdata = data.frame(ESR1 = grid), interval = "prediction")
head(pred)

       fit      lwr      upr
1 11.89028 9.510524 14.27004
2 11.87370 9.495354 14.25205
3 11.85724 9.480288 14.23419
4 11.84089 9.465324 14.21646
5 11.82466 9.450461 14.19886
6 11.80854 9.435698 14.18138

We represent the predicted values (in red) with the prediction interval upper and lower bounds (in blue) and the confidence interval upper and lower bounds for the mean outcome (in grey) in the $\log_{2}$ -transformed data space:

newdata <- data.frame(cbind(grid = log2(grid), g))
preddata <- data.frame(cbind(grid = log2(grid), pred))
brca %>% ggplot(aes(x = log2(ESR1), y = log2(S100A8))) +
    geom_point() +
    geom_line(aes(x = grid, y = fit), newdata) +
    geom_line(aes(x = grid, y = lwr), newdata, color = "grey") +
    geom_line(aes(x = grid, y = upr), newdata, color = "grey") +
    geom_line(aes(x = grid, y = lwr), preddata, color = "blue") +
    geom_line(aes(x = grid, y = upr), preddata, color = "blue") +
    theme_bw()

We can transformed back the predicted values using the model on the $\log_{2}$ -transformed data with the inverse transformation ( $z \mapsto 2^{z}$ ) and represent the predicted values (in red) with the prediction interval upper and lower bounds (in blue) and the confidence interval upper and lower bounds for the mean outcome (in grey) in the original data space:

newdata <- data.frame(cbind(grid, 2^g))
preddata <- data.frame(cbind(grid, 2^pred))
brca %>% ggplot(aes(x = ESR1, y = S100A8)) +
    geom_point() +
    geom_line(aes(x = grid, y = fit), newdata) +
    geom_line(aes(x = grid, y = lwr), newdata, color = "grey") +
    geom_line(aes(x = grid, y = upr), newdata, color = "grey") +
    geom_line(aes(x = grid, y = lwr), preddata, color = "blue") +
    geom_line(aes(x = grid, y = upr), preddata, color = "blue") +
    theme_bw()

Sum of squares and ANOVA table

Total sum of squares

What is the total sum of squares?

Solution

$TSS = \sum_{i = 1}^{n} (y_{i} - \bar{y})^{2}$

TSS can be used to estimate the variance of the marginal distribution of the response.

Sum of squares of the model

$MSS = \sum_{i = 1}^{n} ({\hat{y}}_{i} - \bar{y})^{2} = \sum_{i = 1}^{n} (\hat{g} (x_{i}) - \bar{y})^{2} .$

It is a measure for the deviation of the predictions on the regression line and the marginal mean of the response.
Another interpretation: difference between two models
- Estimated model $\hat{g} (x) = {\hat{β}}_{0} + {\hat{β}}_{1} x$
- Estimated model without predictor (only intercept): $g (x) = β_{0}$ $\to$ $β_{0}$ will be equal to $\bar{y}$ .
MSS measures the size of the effect of the predictor

Sum of squares of the residuals

$RSS = \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2} = \sum_{i = 1}^{n} {y_{i} - \hat{g} (x_{i})}^{2}$

The smaller RSS the better the fit
Least squares method!

Sum of squares decomposition

We can show that SST can be decomposed in $\begin{array}{rcl} TSS & = & \sum_{i = 1}^{n} (y_{i} - \bar{y})^{2} \\ = & \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i} + {\hat{y}}_{i} - \bar{y})^{2} \\ = & \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2} + \sum_{i = 1}^{n} ({\hat{y}}_{i} - \bar{y})^{2} \\ = & RSS + MSS \end{array}$

Total variability in the data (TSS) is partially explained by the predictor (MSS).
Variability that we cannot explain with the regression model is the residual variability (RSS).

Determination coefficient

$R^{2} = 1 - \frac{RSS}{TSS} = \frac{MSS}{TSS} .$

Fraction of total variability of the sample outcomes explained by the model.
Large $R^{2}$ indicates that the model has the potential to make good predictions (small RSS).
Not very indicative for p-value of the test $H_{0} : β_{1} = 0$ vs $H_{1} : β_{1} \neq 0$ .
- p-value is largely determined by RSS and sample size $n$ , but not by TSS.
- $R^{2}$ is determined by RSS and TSS but not by sample size $n$ .
Model with low $R^{2}$ is still useful to study associations as long as the association is modeled correctly!

Breast cancer example

summary(lm2)


Call:
lm(formula = log2(S100A8) ~ log2(ESR1), data = brca)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.94279 -0.66537  0.08124  0.68468  1.92714 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   23.401      1.603   14.60 3.57e-15 ***
log2(ESR1)    -1.615      0.150  -10.76 8.07e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.026 on 30 degrees of freedom
Multiple R-squared:  0.7942,    Adjusted R-squared:  0.7874 
F-statistic: 115.8 on 1 and 30 DF,  p-value: 8.07e-12

(optional) F-Test in simple linear model

Sum of squares are the bases for $F$ -tests $F = \frac{MMS}{RMS}$

with $MMS = \frac{MSS}{1} and RMS = \frac{RSS}{n - 2} .$

MMS = model mean of squares,
RMS = residual mean of squares,
denominators 1 en $n - 2$ are the degrees of freedom of MSS and RSS.
Under $H_{0} : β_{1} = 0$ $H_{0} : F = \frac{MMS}{RMS} \sim F_{1, n - 2},$
F-test is always two-sided! $H_{1} : β_{1} \neq 0$ $p = P_{0} [F \geq f] = 1 - F_{F} (f; 1, n - 2)$

summary(lm2)


Call:
lm(formula = log2(S100A8) ~ log2(ESR1), data = brca)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.94279 -0.66537  0.08124  0.68468  1.92714 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   23.401      1.603   14.60 3.57e-15 ***
log2(ESR1)    -1.615      0.150  -10.76 8.07e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.026 on 30 degrees of freedom
Multiple R-squared:  0.7942,    Adjusted R-squared:  0.7874 
F-statistic: 115.8 on 1 and 30 DF,  p-value: 8.07e-12

Anova Table

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
Regression	degrees of freedom MSS	MSS	MSR	f-statistic	p-value
Error	degrees of freedom RSS	RSS	MSE

anova(lm2)

Analysis of Variance Table

Response: log2(S100A8)
           Df  Sum Sq Mean Sq F value   Pr(>F)    
log2(ESR1)  1 121.814 121.814   115.8 8.07e-12 ***
Residuals  30  31.559   1.052                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Exercises

Exercise 1

Bivariate data relating the number of claMSSoom absences and the number of points earned prior to the final exam in the course was gathered from a random sampling of 15 students.

Here is the table gathering the collected data:

absences	points
1	552
9	210
6	315
4	454
0	619
5	378
13	108
2	534
5	398
3	462
0	593
1	645
0	602
7	315
10	307

Here is the scatter plot representing the data:

We also give the following quantities (with $x =$ the number of absences and $y =$ the number of points):

$\bar{x} = 4.4$
$\bar{y} = 432.8$
$cov (x, y) = - 625.63$
$sd (x) = 4.01$
$sd (y) = 160.95$

What is the correlation between the number of absences and the number of points earned by the students?
From your answer to the previous question and from the scatter plot figures, what could be the link between the number of absences and the number of points earned by the students?
How would you model this relationship?
Can you compute the estimation of the parameters in this model?
What assumptions are generally done in this context?
A student with 7 absences would probably have how many points in the course? A student with 20 absences would probably have how many points? Do you need any of the previous assumptions to do this prediction?

Exercise 2

Physicians feel that there is a relationship between a person’s age and the number of days they are sick that year. A random sample of 20 patients was obtained to this end comparing age and days sick.

Here is the table gathering the collected data:

age	sickDays
12	0
19	2
35	8
4	10
68	21
42	16
21	18
24	24
32	29
75	15
14	3
8	7
25	12
39	17
56	15
18	11
72	31
64	26
45	20
87	30

We also give the following quantities (with $x =$ the age and $y =$ the number of sick days):

$\bar{x} = 38$
$\bar{y} = 15.75$
$cov (x, y) = 157.89$
$sd (x) = 24.77$
$sd (y) = 9.3$

We fit a linear model and we get the following result:

$\sum_{i = 1}^{n} (y_{i} - \bar{y})^{2} = 1643.75$
$\sum_{i = 1}^{n} ({\hat{y}}_{i} - \bar{y})^{2} = 771.87$
$\sum_{i = 1}^{n} ({\hat{y}}_{i} - y_{i})^{2} = 871.88$

Here is the scatter plot representing the data and we also represent the fitted linear model:

What can you say about this situation? How is the linear model suitable to question the physician hypothesis?
How was the linear model fitted?
What can you say about this model?
How many days in a year would you predict that a 60 year old person will be sick?

Footnotes

sources , DOI:10.5281/zenodo.3813544 ↩︎

Requirements

Breast cancer dataset

Loading and introducing the data

Association between ESR1 and S100A8 expression

Regression

The regression model

Linear regression

Use

Parameter estimation

(optional) Least squares

Estimators that minimise RSS

Breast cancer example

Statistical inference

Assumptions

(optional) Modeling the distribution of Y?

(optional) High spread of X improves the precision

Breast cancer example

(optional) Hypothesis test

Breast cancer example

Assumption assessment

Linearity

Residual analysis

Homoskedasticity (equal variances)

Normality

Model evaluation

Invalid assumptions?

Breast cancer example

Exploration of the model with log2 transformed data

Prediction

Back-transformation

(optional) Inference on the mean outcome

Back-transformation

(optional) Prediction intervals

Sum of squares and ANOVA table

Total sum of squares

Sum of squares of the model

Sum of squares of the residuals

Sum of squares decomposition

Determination coefficient

Breast cancer example

(optional) F-Test in simple linear model

Anova Table

Exercises

Exercise 1

Exercise 2

Footnotes

(optional) High spread of $X$ improves the precision