Tutorial 5: Linear Regression

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

This tutorial is directly adapted from Lieven Clement and Milan Malfait tutorial on linear regression in their “Practical Statistics for the Life Sciences” course.

Requirements

We first required packages:

# required packages
library(tidyverse)

library(Rmisc)  # functions useful for data analysis and utility operations
library(GGally) # ggplot2 extension
library(performance)  # evaluate models

Note: we will need the ggplot2, dplyr, readr, tidyr and tibble packages from the tidyverse meta-package. Thus, if you encounter issues when installing tidyverse, you can only install these four packages.

Breast cancer dataset

This practice is a subset of a more general study [1].

[1] Christos Sotiriou, Pratyaksha Wirapati, Sherene Loi, Adrian Harris, Steve Fox, Johanna Smeds, Hans Nordgren, Pierre Farmer, Viviane Praz, Benjamin Haibe-Kains, Christine Desmedt, Denis Larsimont, Fatima Cardoso, Hans Peterse, Dimitry Nuyten, Marc Buyse, Marc J. Van de Vijver, Jonas Bergh, Martine Piccart, Mauro Delorenzi, Gene Expression Profiling in Breast Cancer: Understanding the Molecular Basis of Histologic Grade To Improve Prognosis, JNCI: Journal of the National Cancer Institute, Volume 98, Issue 4, 15 February 2006, Pages 262–272, https://doi.org/10.1093/jnci/djj052

We consider 32 patients with breast cancer that are estrogen-receptor positive tumor. They received tamoxifen chemotherapy.

Loading and introducing the data

The data are available in this repository and can be loaded with the following command:

brca <- read_csv("https://raw.githubusercontent.com/GTPB/PSLS20/master/data/breastcancer.csv")
brca

# A tibble: 32 × 10
   sample_name filename     treatment    er grade  node  size   age  ESR1 S100A8
   <chr>       <chr>        <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
 1 OXFT_209    gsm65344.ce… tamoxifen     1     3     1   2.5    66 1939.  207. 
 2 OXFT_1769   gsm65345.ce… tamoxifen     1     1     1   3.5    86 2752.   37.0
 3 OXFT_2093   gsm65347.ce… tamoxifen     1     1     1   2.2    74  379. 2364. 
 4 OXFT_1770   gsm65348.ce… tamoxifen     1     1     1   1.7    69 2532.   23.6
 5 OXFT_1342   gsm65350.ce… tamoxifen     1     3     0   2.5    62  141. 3219. 
 6 OXFT_2338   gsm65352.ce… tamoxifen     1     3     1   1.4    63 1495.  108. 
 7 OXFT_2341   gsm65353.ce… tamoxifen     1     1     1   3.3    76 3406.   14.0
 8 OXFT_1902   gsm65354.ce… tamoxifen     1     3     0   2.4    61 2813.   68.4
 9 OXFT_1982   gsm65355.ce… tamoxifen     1     1     0   1.7    62  950.   74.2
10 OXFT_5210   gsm65356.ce… tamoxifen     1     3     0   3.5    65 1053.  182. 
# ℹ 22 more rows

The recorded variables include:

grade: histological grade of tumor (grade 1 vs 3),
node: lymph node status (0: not affected, 1: lymph nodes affected and removed),
size: tumor size in cm,
ESR1 and S100A8: gene expression measurement for 2 genes in tumor biopsy (microarray technology)

Notes:

Gene ESR1 (estrogen receptor 1) is expressed in \(\pm\) 75% of breast cancer tumors.
Patients that express ESR1 better respond to treatment by hormone therapy
Tamoxifen interacts with ER and modulates its expression.
Proteins of the S100 family are low-molecular weight proteins with calcium binding sites. Their expression is often dysregulated in cancer. The expression of S100A8 represses the immune system in tumor and creates an environment of inflamation that promotes tumor growth.

Here is a representation of the distribution of the S100A8 gene expression:

brca %>% ggplot(aes(x = "", y = S100A8)) +
    geom_boxplot() +
    xlab("") + theme_bw()

What can you say about the observed values for S100A8 gene expression?

Solution

Apparently the measurement of the expression of gene S100A8 failed for 3 individuals. These individuals can be considered as outliers.

We remove the outliers:

brcaSubset <- brca %>% filter(S100A8 < 2000)

Here is the new boxplot representation of S100A8 gene expression after removing the outliers:

brcaSubset %>% ggplot(aes(x = "", y = S100A8)) +
    geom_boxplot() +
    xlab("") + theme_bw()

Association between ESR1 and S100A8 expression

In this practice we will study the association between the expression of the ESR1 and S100A8 genes:

brcaSubset %>%
    ggplot(aes(x = ESR1, y = S100A8)) +
    geom_point() +
    theme_bw()

What kind of method are we going to use?

Solution

We want to investigate the possible relationship between two quantitative variables, this is a regression problem.

Regression

Formalize the regression framework?

Solution

We first start with a linear regression between two variables \((X_i, Y_i)\), measured on each subject \(i = 1, ..., n\).

For instance, we may consider the following setting:

Response Y : S100A8 expression
Predictor X: ESR1 expression

The regression model

Describe the principle of the regression model?

Solution

The principle of regression (not necessarily linear) is to model the variations of the output variable \(Y\) (observations) by a contribution of fixed covariates \(X\) in the standard additive framework: \[\text{observation = signal + noise}\]

with mathematical notations: \[Y_i=g(X_i)+\epsilon_i\]

\(g(x)\) is called the regression function that models the expected outcome \(Y_i\) for subjects with \(X_i=x\)

\[E[Y_i|X_i=x]=g(x)\]

Hence, \(\epsilon_i\) is on average 0 for subjects with same \(X_i\): \[E[\epsilon_i|X_i]=0\]

Linear regression

Describe the principle of the linear regression model?

Solution

To obtain an accurate and interpretable model, the simplest choice for \(g(x)\) is a linear function with unknown parameter.

\[\EE(Y|X=x)=\beta_0 + \beta_1 x\] with unknown intercept \(\beta_0\) and slope \(\beta_1\).

Linear model is based on an assumption on the form of the relationship between \(Y\) and \(X\), and on the i.i.d. Gaussian assumption for the error term. These assumption must be checked afterwards.

Use

What are the use cases of the linear regression model?

Solution

Prediction: when \(Y\) is unknown but \(X\) is known we can predict \(Y\) using \[\EE(Y|X=x)=\beta_0 + \beta_1 x\]
Association: biological relation between variable \(X\) and response \(Y\)

What is the interpretation of the coefficient in the linear regression model? in particular when we draw the following regression line?

brcaSubset %>%
    ggplot(aes(x = ESR1, y = S100A8)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) + 
    theme_bw()

Solution

Intercept: \(\EE(Y|X=0)=\beta_0\)

Slope: \[\begin{eqnarray*} \EE(Y|X=x+\delta)-\EE(Y|X=x)&=&\beta_0 + \beta_1 (x+\delta) -\beta_0-\beta_1 x\\ &=& \beta_1\delta \end{eqnarray*}\]

\(\beta_1=\) quantifies the strength of the linear association between \(Y\) and \(X\)

Parameter estimation

Least squares

Parameters \(\beta_0\) en \(\beta_1\) are unknown.

They are estimated them using all the data points by the least-squares method.

How does the least-squares method work?

Solution

To determine the best fitting line:

Point on regression line for a given \(x_i\): \((x_i, \beta_0 + \beta_1 x_i)\) as close as possible \((x_i, y_i)\)
Choose \(\beta_0\) and \(\beta_1\) so that the sum of squares distance between predicted and observed points becomes as small as possible, i.e. the sum of squared errors (SSE): \[SSE=\sum_{i=1}^n (y_i-\beta_0-\beta_1 x_i)^2=\sum_{i=1}^n e_i^2\] with residuals \(e_i\) the vertical distances from the observations to the fitted regression line

On the following graph, in blue is the linear regression line estimated by the least-squares method, and in red are the corresponding observed value \(\hat{e}_i\) for the residuals in the fitted model.

Estimators that minimise SSE

What are the values of the \(\beta_0\) and \(\beta_1\) that minimizes the SSE?

Solution

\[\hat{\beta}_1= \frac{\sum_{i=1}^n (y_i-\bar y)(x_i-\bar x)}{\sum_{i=1}^n (x_i-\bar x)^2}=\frac{\mbox{cor}(x,y)s_y}{s_x} \]

\[\hat{\beta}_0=\bar y - \hat{\beta}_1 \bar x \]

Note: the slope of the least squares fit is proportional to the correlation between the response and the predictor.

How can we use the fitted model (i.e. estimations of \(\beta_0\) and \(\beta_1\)?

Solution

Fitted model allows to:

predict the response for subjects with a given value \(x\) for the predictor: \[\EE [ Y | X = x]=\hat{\beta}_0+\hat{\beta}_1x\]
Assess how the mean response differs between two groups of subjects that differ \(\delta\) units in the predictor: \[\EE\left[Y|X=x+\delta\right]-\EE\left[Y|X=x\right]= \hat{\beta}_1\delta\]

Breast cancer example

Here is the linear regression model fitted in R:

lm1 <- lm(S100A8 ~ ESR1, data = brcaSubset)
summary(lm1)


Call:
lm(formula = S100A8 ~ ESR1, data = brcaSubset)

Residuals:
   Min     1Q Median     3Q    Max 
-95.43 -34.81  -6.79  34.23 145.21 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 208.47145   28.57207   7.296 7.56e-08 ***
ESR1         -0.05926    0.01212  -4.891 4.08e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 59.91 on 27 degrees of freedom
Multiple R-squared:  0.4698,    Adjusted R-squared:  0.4502 
F-statistic: 23.93 on 1 and 27 DF,  p-value: 4.078e-05

What is the equation of the linear regression line?

Solution

\[\EE(Y|X=x)=208.47-0.059 x\]

Interpret the evolution of S100A8 gene expression depending on ESR1 expression?

Solution

The expected expression of S100A8 is on average 59 units lower for patients with ESR1 expression level that is 1000 units higher.

What would be the predicted S100A8 expression level for patients with an ESR1 expression level of 2000? and an ESR1 expression level of 4000?

Solution

Expected S100A8 expression level for patients with an ESR1 expression level of 2000: \[208.47-0.059\times 2000=89.94\]

Expected S100A8 expression level for patients with an ESR1 expression level of 4000: \[208.47-0.059\times 4000=-28.58\] (!!!)

Reminder: here is a graphical representation of the data and the fitted linear regression line:

brcaSubset %>%
    ggplot(aes(x = ESR1, y = S100A8)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) + 
    theme_bw()

What is the problem with predicted S100A8 expression level for patients with an ESR1 expression level of 4000? What could be the problem with this prediction?

Solution

WARNING when you extrapolate ! The assumption of linearity can only be studied within the range of the available and observed data, and the linear regression model may predict unrealistic values for your response (here a negative expression level).

Note:

interpolation: prediction within the range of the available and observed data
extrapolation: prediction outside the range of the available and observed data

Statistical inference

To draw conclusions based on the regression model \(\EE(Y|X)=\beta_0+\beta_1 X\), we need to know

how the least squares parameter estimators vary from sample to sample?
how they deviate under the null hypothesis that there is no association between predictor and response?

How could we proceed?

Solution

This requires a statistical model.

We need to model the distribution of \(Y\) given \(X\) explicitly (conditional distribution).

Modeling the distribution of Y?

For the moment, we made the following assumptions:

There is a linear relationship between the response and the predictor.

Besides linearity we need additional assumptions!

What other assumptions are we going to make?

Solution

Independence: Observations \((X_1,Y_1), ..., (X_n,Y_n)\) are made for n independent subjects (is required to estimate the variance)
Homoscedasticity or equal variances: observations vary with equal mean around the regression line

Residuals \(\epsilon_i\) have equal variance for each individual \(i\) (and then for any \(X_i=x\))
\(\text{var}(Y\vert X=x) = \sigma^2\) for each \(X=x\)
\(\sigma\) is referred to as the residual standard deviation

Normality: the residuals \(\epsilon_i\) are normally distributed with mean=0

What can you derive from assumptions 2, 3 and 4?

Solution

\[\epsilon_i \text{ i.i.d. } \NN(0,\sigma^2)\]

And from assumptions 1, 2, 3, 4?

Solution

The conditional distribution on the response is Gaussian: \[Y_i\vert X_i\sim \NN(\beta_0+\beta_1 X_i,\sigma^2)\]
Using these assumptions we can derive the variance of the estimators : \[\sigma^2_{\hat{\beta}_0}=\frac{\sum_{i=1}^n X^2_i}{\sum_{i=1}^n (X_i-\bar X)^2} \times\frac{\sigma^2}{n} \text{ en } \sigma^2_{\hat{\beta}_1}=\frac{\sigma^2}{\sum_{i=1}^n (X_i-\bar X)^2}\]
Using the Gaussian assumption we can show that the parameter estimators are normally distributed \[\hat\beta_0 \sim N\left(\beta_0,\sigma^2_{\hat \beta_0}\right) \text{ en } \hat\beta_1 \sim N\left(\beta_1,\sigma^2_{\hat \beta_1}\right)\]

High spread of \(X\) improves the precision (super important)

We remind the value of the variance for the estimator of \(\beta_1\): \[\sigma^2_{\hat{\beta}_1}=\frac{\sigma^2}{\sum_{i=1}^n (X_i-\bar X)^2}\]

What can you say about the two regression examples in the next plot regarding the variance of the estimate \(\hat{\beta}_1\)?

Solution

In the first case, the observations for X are widely spread, then the sum of squared deviations from the empirical mean \(\sum_{i=1}^n (X_i-\bar X)^2\) will be larger and the variance of \(\hat{\beta}_1\) will be lower, hence a better precision when estimating \(\beta_1\).

In the second case, the observations for X are not widely spread, then the sum of squared deviations from the empirical mean \(\sum_{i=1}^n (X_i-\bar X)^2\) will be smaller and the variance of \(\hat{\beta}_1\) will be higher, hence the precision will not be lower.

The conditional variance (\(\sigma^2\)) is unknown? How can we estimate it?

Solution

The \(\hat\sigma^2\) estimate of \(\sigma^2\) uses the mean squared error (MSE)
The MSE is the average squared error: \[\text{MSE} = \frac{1}{n} \sum_{i=1}^n \left(y_i-\hat\beta_0-\hat\beta_1\times x_i\right)^2 = \frac{1}{n} \sum_{i=1}^n e^2_i\]
The estimate of \(\sigma^2\) is given by: \[\hat\sigma^2 = \frac{n}{n-2} \text{MSE} = \frac{1}{n-2} \sum_{i=1}^n \left(y_i-\hat\beta_0-\hat\beta_1\times x_i\right)^2 = \frac{1}{n-2} \sum_{i=1}^n e^2_i\]
This estimator is based on independence (assumption 2) and equality of the variance (assumption 3).
Note the division by \(n-2\)

Upon the estimation of \(\sigma^2\), what are the standard error (i.e. estimates of the standard deviation) regarding \(\hat{\beta}_0\) and \(\hat{\beta}_1\)?

Solution

Upon the estimation of \(\sigma^2\) we obtain following standard errors:

\[\text{SE}_{\hat{\beta}_0} = \hat\sigma_{\hat{\beta}_0} = \hat{\sigma} \sqrt{\frac{\sum_{i=1}^n X^2_i}{\sum_{i=1}^n (X_i-\bar X)^2}}\]

\[\text{SE}_{\hat{\beta}_1}=\hat\sigma_{\hat{\beta}_1} = \hat{\sigma} \sqrt{ \frac{1}{\sum_{i=1}^n (X_i-\bar X)^2}}\]

Reminder: \(\hat{\sigma}^2 = \frac{n}{n-2} \text{MSE}\)

How can we use the previous results regarding the distribution of the estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\) to build confidence intervals and tests?

Solution

Thanks to the distribution of the parameter estimators, we can construct confidence intervals and tests using \[T=\frac{\hat{\beta}_k-\beta_k}{SE(\hat{\beta}_k)} \text{ with } k=0,1\]
If all assumptions are valid \(T\) follows a \(\TT(n-2)\) Student distribution with n-2 degrees of freedom
If no normality, but independence, linearity, equality of mean and large dataset \[\rightarrow \text{Central Limit theorem}\]

Breast cancer example

Reminder:

lm1 <- lm(S100A8 ~ ESR1, data = brcaSubset)
lm1


Call:
lm(formula = S100A8 ~ ESR1, data = brcaSubset)

Coefficients:
(Intercept)         ESR1  
  208.47145     -0.05926

\(\rightarrow\) The association between between S100A8 and ESR1 gene expression is negative.

Generalize the effect in the sample to the population using the confidence interval on \(\hat{\beta}_1\)?

Solution

\[[\hat\beta_1 - t_{n-2,\alpha/2} \text{SE}_{\hat\beta_1},\hat\beta_1 + t_{n-2,\alpha/2} \text{SE}_{\hat\beta_1}]\].

confint(lm1)

                   2.5 %       97.5 %
(Intercept) 149.84639096 267.09649989
ESR1         -0.08412397  -0.03440378

Hypothesis test

How could we statistically test the significance of the effect of ESR1 gene expression on S100A8 gene expression?

Solution

We can restate the hypothesis of association the S100A8 and ESR1 gene expression in the form of a null hypothesis on the parameters of the model.

Under the null hypothesis of the absence of an association in the expression of both genes: \[H_0: \beta_1=0\]
Under the alternative hypothesis, there is an association between the expression of both genes: \[H_1: \beta_1\neq0\]

Test statistic: \[T=\frac{\hat{\beta}_1-0}{SE(\hat{\beta}_1)}\]

Under \(H_0\) the statistics follows a T-distribution with n-2 degrees of freedom.

Breast cancer example

summary(lm1)


Call:
lm(formula = S100A8 ~ ESR1, data = brcaSubset)

Residuals:
   Min     1Q Median     3Q    Max 
-95.43 -34.81  -6.79  34.23 145.21 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 208.47145   28.57207   7.296 7.56e-08 ***
ESR1         -0.05926    0.01212  -4.891 4.08e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 59.91 on 27 degrees of freedom
Multiple R-squared:  0.4698,    Adjusted R-squared:  0.4502 
F-statistic: 23.93 on 1 and 27 DF,  p-value: 4.078e-05

What can you say about association between the S100A8 and ESR1 gene expression?

Solution

The association between the S100A8 and ESR1 expression is extremely significant (\(p\ll 0.001\))

What did we forget to check?

Solution

But, we first have to check all assumptions!
Otherwise the conclusions based on the statistical test and the CI can be incorrect.

Assess assumptions

What assumptions should we check?

Solution

Independence: design
Linearity: inference is useless if the association is not linear
Homoscedasticity: inference/p-value is incorrect if data are heteroscedastic
Normality: inference/p-value is incorrect if data are not normally distributed in small samples

Linearity

brcaSubset %>%
    ggplot(aes(x = ESR1, y = S100A8)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    theme_bw()

Residual analysis

Assumption of linearity is typically assessed using a residual plot:

predictions \(\hat{y}_i = \hat\beta_0+\hat\beta_1 x\) on \(X\)-axis
residuals \(\hat{e}_i=y_i-\hat{g}(x_i)=y_i-\hat\beta_0-\hat\beta_1\times x_i,\) on \(Y\)-axis

plot(lm1, which = 1)

With the residual plot, we can also check the assumptions that the average of the residuals is 0.

Homoskedasticity (equal variances)

Residuals and squared residuals carry information on the residual variability

Association with predictors \(\rightarrow\) indication of heteroskedasticity
Scatterplot of \(\hat{e}_i\) vs \(x_i\) or predictions \(\hat y_i = \hat \beta_0+ \hat \beta_1 x_i\)
Scatterplot of standardized residual versus \(x_i\) or predictions

check_heteroskedasticity(lm1)

Warning: Heteroscedasticity (non-constant error variance) detected (p = 0.009).

(caution with small sample size)

Normality

If the sample size is large the estimators are normally distributed even if the observations are not normally distributed: central limit theorem
How many observations? \(\rightarrow\) depends on shape and magnitude of deviations
Assumption: Data are Normally distributed conditional on X: \[Y_i\vert X_i\sim \NN(\beta_0+\beta_1X_i,\sigma^2)\]
QQ-plot of the residuals \(\hat{e}_i\)

Example of non-Gaussian data and approximately Gaussian residuals:

QQ-plot for our model:

plot(lm1, which = 2)

check_normality(lm1)

OK: residuals appear as normally distributed (p = 0.212).

(caution with small sample size)

Model evaluation

You can use the default R function to get a visual summary about your model residuals:

plot(lm1)

or you can use the check_model() function from the performance package:

check_model(lm1)

Invalid assumptions

If the assumptions regarding the model are not verified, what can we do?

Solution

We can apply a transformation to the data.

Transformation of predictor does not change distribution of Y for given X:

not useful to obtain homoscedasticity or Normal distribution
useful for linearity when normality and homoscedasticity are valid
often inclusion of higher order terms: \(X^2\), \(X^3\), … \[Y_i=\beta_0+\beta_1X_i+\beta_2X_i^2+ ... + \epsilon_i\]

Transformation of response Y can be useful to obtain normality and homoscedasticity

\(\sqrt{Y}\)
\(\log(Y)\)
\(1/Y\)
etc.

Breast cancer example

What are the issues with the linear regression model in the case of our breast cancer data?

Solution

Problems with

heteroscedasticity
possibly deviations from normality (skewed to the right)
negative concentration predictions are theoretically impossible
non-linearity

This is often the case for concentration and intensity measurements

These are often log-normal distributed (normal distribution upon log-transform)
We also observed a kind of exponential relation with the smoother
In gene expression literature often \(\log_2\) transformation is adopted
gene-expression on log scale: differences on log scale are fold changes on original scale!

We consider the original data with the outliers that we removed earlier.

Here a scatterplot of the original data:

brca %>% ggplot(aes(x = ESR1, y = S100A8)) +
    geom_point() +
    geom_smooth() +
    theme_bw()

And here a scatterplot of the \(\log_2\) transformed data:

brca %>% ggplot(aes(x = log2(ESR1), y = log2(S100A8))) +
    geom_point() +
    geom_smooth() +
    theme_bw()

Note: the blue line represents a specific non-linear regression with the confidence interval.

What can you say?

Solution

The linear relationship between S100A8 and ESR1 gene expression is not clear at all when using the original measurements, but appears more clearly when using the \(\log_2\)-transformed measurements.

Here is the linear regression model for the transformed gene expression:

lm2 <- lm(log2(S100A8) ~ log2(ESR1), data = brca)
check_model(lm2)

check_normality(lm2)

OK: residuals appear as normally distributed (p = 0.888).

check_heteroskedasticity(lm2)

OK: Error variance appears to be homoscedastic (p = 0.430).

summary(lm2)


Call:
lm(formula = log2(S100A8) ~ log2(ESR1), data = brca)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.94279 -0.66537  0.08124  0.68468  1.92714 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   23.401      1.603   14.60 3.57e-15 ***
log2(ESR1)    -1.615      0.150  -10.76 8.07e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.026 on 30 degrees of freedom
Multiple R-squared:  0.7942,    Adjusted R-squared:  0.7874 
F-statistic: 115.8 on 1 and 30 DF,  p-value: 8.07e-12

confint(lm2)

                2.5 %    97.5 %
(Intercept) 20.128645 26.674023
log2(ESR1)  -1.921047 -1.308185

What can you say about it?

Solution

The linearity, normality and heteroskedasticity assumpltions seems to now be verified (at least we cannot significantly reject them).

Exploration of the model with log2 transformed data

How do we interpret the slope in the model with \(\log_2\)-transformed data?

Solution

A patient with an ESR1 expression that is one unit on \(\log_2\) scale higher than that of another patient on average has a \(\log_2\) expression for S100A8 that is 1.61 units lower (95% CI [-1.92,-1.31]).

We note \((x_1,y_1)\) the respective gene expressions for ESR1 and S100A8 in patient 1, and similary \((x_2,y_2)\) for patient 2.

And we assume that \(x_2 = x_1 + 1\).

\[\log_2(\hat{y}_1) = 23.401 -1.615 \times \log_2(x_1)\]

\[\log_2(\hat{y}_2) = 23.401 -1.615 \times \log_2(x_2)\] \[\log_2\left(\frac{\hat{y}_2}{\hat{y}_1}\right) = \log_2(\hat{y}_2) - \log_2(\hat{y}_1) = -1.615 (\log_2(x_2 - x_1) = -1.615 \times 1 = -1.615\]

Inference on the mean outcome

A regression model can also be used for prediction: and in particular the inference of the average outcome (i.e. the response average) for a given value of \(X=x\), i.e. \[\hat{g}(x)= \hat{\beta}_0 + \hat{\beta}_1 x\]

\(\hat{g}(x)\) is an estimator of the conditional mean \(\EE[Y\vert X=x]\)
Parameter estimators are Normally distributed and unbiased \(\rightarrow\) estimator \(\hat{g}(x)\) is also Normally distributed and unbiased.

\[\text{SE}_{\hat{g}(x)} = \hat{\sigma} \sqrt{\frac{1}{n}+\frac{(x-\bar X)^2}{\sum_{i=1}^n (X_i-\bar X)^2}}\]

\[T=\frac{\hat{g}(x)-g(x)}{SE_{\hat{g}(x)}}\sim \TT_{n-2}\]

We can use thepredict(.) function in R to compute the inferred average outcome for the response and related confidence intervals.

newdata argument: predictor values (x-values) at which we want to calculate the mean response
interval="confidence" argument to obtain CI.
without newdata argument we perform predictions for all predictor values in the dataset used to fit the model

grid <- 140:4000
g <- predict(lm2, newdata = data.frame(ESR1 = grid), interval = "confidence")
head(g)

       fit      lwr      upr
1 11.89028 10.76082 13.01974
2 11.87370 10.74721 13.00019
3 11.85724 10.73370 12.98078
4 11.84089 10.72028 12.96151
5 11.82466 10.70696 12.94237
6 11.80854 10.69372 12.92336

Note: we do not have to transform the new data that we specified for the ESR1 expression because we fitted the model with a call to the lm() function and specified the transformation within the formula.

We represent the inferred mean outcome (in red) with the confidence interval upper and lower bounds (in grey) in the \(\log_2\)-transformed data space:

newdata <- data.frame(cbind(grid = log2(grid), g))
brca %>% ggplot(aes(x = log2(ESR1), y = log2(S100A8))) +
    geom_point() +
    geom_line(aes(x = grid, y = fit), newdata, col = "red") +
    geom_line(aes(x = grid, y = lwr), newdata, color = "grey") +
    geom_line(aes(x = grid, y = upr), newdata, color = "grey") +
    theme_bw()

Back-transformation

We can transformed back the inferred mean outcome using the model on the \(\log_2\)-transformed data with the inverse transformation (\(z \mapsto 2^z\)) and represent the inferred mean outcome (in red) with the confidence interval upper and lower bounds (in grey) in the original data space:

newdata <- data.frame(cbind(grid, 2^g))
brca %>% ggplot(aes(x = ESR1, y = S100A8)) +
    geom_point() +
    geom_line(aes(x = grid, y = fit), newdata, col = "red") +
    geom_line(aes(x = grid, y = lwr), newdata, color = "grey") +
    geom_line(aes(x = grid, y = upr), newdata, color = "grey") +
    theme_bw()

Prediction-intervals

We can also make a prediction for the location of a new observation that would be collected in a new experiment for a patient with a particular value for their ESR1 expression

It is important to notice that this experiment still has to be conducted. So we want to predict the non-observed individual expression value for a novel patient.

For a novel independent observation \(Y^*\): \[Y^* = g(x) + \epsilon^*\] with \(\epsilon^*\sim \NN(0,\sigma^2)\) and \(\epsilon^*\) independent of the observations in the sample \(Y_1,\ldots, Y_n\).

We predict a new log-S100A8 for a patient with a known log2-ESR1 expression level x: \[\hat{y}(x)=\hat{\beta}_0+\hat{\beta}_1 \times x\]

What is the relationship between the mean outcome and the predicted value for the response?

Solution

The estimated mean outcome and prediction for a new observation are equal.

But, their sample distributions are different!

Uncertainty on the estimated mean outcome \(\leftarrow\) uncertainty on estimated model parameters \(\hat\beta_0\) en \(\hat\beta_1\).
Uncertainty on new observation \(\hat Y^*\) \(\leftarrow\) uncertainty on estimated mean and additional uncertainty because the new observation will deviate around the mean!

\[\text{SE}_{\hat{Y}(x)} = \sqrt{\hat\sigma^2+\hat\sigma^2_{\hat{g}(x)}} = \hat{\sigma} \sqrt{1+\frac{1}{n}+\frac{(x-\bar X)^2}{\sum_{i=1}^n (X_i-\bar X)^2}}\]

\[\frac{\hat{Y}(x)-Y}{\text{SE}_{\hat{Y}(x)}}\sim \TT _{n-2}\]

Note, that a prediction-interval (PI) is an improved version of a reference-interval when the model parameters are unknown: uncertainty on model parameters + t-distribution.

pred <- predict(lm2, newdata = data.frame(ESR1 = grid), interval = "prediction")
head(pred)

       fit      lwr      upr
1 11.89028 9.510524 14.27004
2 11.87370 9.495354 14.25205
3 11.85724 9.480288 14.23419
4 11.84089 9.465324 14.21646
5 11.82466 9.450461 14.19886
6 11.80854 9.435698 14.18138

We represent the predicted values (in red) with the prediction interval upper and lower bounds (in blue) and the confidence interval upper and lower bounds for the mean outcome (in grey) in the \(\log_2\)-transformed data space:

newdata <- data.frame(cbind(grid = log2(grid), g))
preddata <- data.frame(cbind(grid = log2(grid), pred))
brca %>% ggplot(aes(x = log2(ESR1), y = log2(S100A8))) +
    geom_point() +
    geom_line(aes(x = grid, y = fit), newdata) +
    geom_line(aes(x = grid, y = lwr), newdata, color = "grey") +
    geom_line(aes(x = grid, y = upr), newdata, color = "grey") +
    geom_line(aes(x = grid, y = lwr), preddata, color = "blue") +
    geom_line(aes(x = grid, y = upr), preddata, color = "blue") +
    theme_bw()

We can transformed back the predicted values using the model on the \(\log_2\)-transformed data with the inverse transformation (\(z \mapsto 2^z\)) and represent the predicted values (in red) with the prediction interval upper and lower bounds (in blue) and the confidence interval upper and lower bounds for the mean outcome (in grey) in the original data space:

newdata <- data.frame(cbind(grid, 2^g))
preddata <- data.frame(cbind(grid, 2^pred))
brca %>% ggplot(aes(x = ESR1, y = S100A8)) +
    geom_point() +
    geom_line(aes(x = grid, y = fit), newdata) +
    geom_line(aes(x = grid, y = lwr), newdata, color = "grey") +
    geom_line(aes(x = grid, y = upr), newdata, color = "grey") +
    geom_line(aes(x = grid, y = lwr), preddata, color = "blue") +
    geom_line(aes(x = grid, y = upr), preddata, color = "blue") +
    theme_bw()

Sum of squares and Anova-table

Total sum of squares

What is the total sum of squares?

Solution

\[\text{SSTot} = \sum_{i=1}^n (Y_i-\bar{Y})^2\]

SStot can be used to estimate the variance of the marginal distribution of the response.
In this chapter we focused on the conditional distribution \(f(Y\vert X=x)\).
We known that MSE is a good estimate of the variance of the conditional distribution of \(Y\vert X=x\).

Sum of squares of the regression SSR

\[\text{SSR} = \sum_{i=1}^n (\hat{Y}_i - \bar{Y})^2 = \sum_{i=1}^n (\hat{g}(x_i) - \bar{Y})^2.\]

It is a measure for the deviation of the predictions on the regression line and the marginal mean of the response.
Another interpretation: difference between two models
- Estimated model \(\hat{g}(x)=\hat\beta_0+\hat\beta_1x\)
- Estimated model without predictor (only intercept): \(g(x)=\beta_0\) \(\rightarrow\) \(\beta_0\) will be equal to \(\bar{Y}\).
SSR measures the size of the effect of the predictor

Sum of Squares of the Error

\[\text{SSE} = \sum_{i=1}^n (Y_i-\hat{Y}_i )^2 = \sum_{i=1}^n \left\{Y_i-\hat{g}\left(x_i\right)\right\}^2\]

The smaller SSE the better the fit
Least squares method!

Sum of squares decomposition

We can show that SST can be decomposed in \[\begin{eqnarray*} \text{SSTot} &=& \sum_{i=1}^n (Y_i-\bar{Y})^2 \\ &=& \sum_{i=1}^n (Y_i-\hat{Y}_i+\hat{Y}_i-\bar{Y})^2 \\ &=& \sum_{i=1}^n (Y_i-\hat{Y}_i)^2+\sum_{i=1}^n(\hat{Y}_i-\bar{Y})^2 \\ &=& \text{SSE }+\text{SSR} \end{eqnarray*}\]

Total variability in the data (SSTot) is partially explained by the predictor (SSR).
Variability that we cannot explain with the regression model is the residual variability (SSE).

Determination coefficient

\[ R^2 = 1-\frac{\text{SSE}}{\text{SSTot}}=\frac{\text{SSR}}{\text{SSTot}}.\]

Fraction of total variability of the sample outcomes explained by the model.
Large \(R^2\) indicates that the model has the potential to make good predictions (small SSE).
Not very indicative for p-value of the test \(H_0:\beta_1=0\) vs \(H_1:\beta_1\neq0\).
- p-value is largely determined by SSE and sample size \(n\), but not by SSTot.
- \(R^2\) is determined by SSE and SSTot but not by sample size \(n\).
Model with low \(R^2\) is still useful to study associations as long as the association is modelled correctly!

Breast cancer example

summary(lm2)


Call:
lm(formula = log2(S100A8) ~ log2(ESR1), data = brca)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.94279 -0.66537  0.08124  0.68468  1.92714 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   23.401      1.603   14.60 3.57e-15 ***
log2(ESR1)    -1.615      0.150  -10.76 8.07e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.026 on 30 degrees of freedom
Multiple R-squared:  0.7942,    Adjusted R-squared:  0.7874 
F-statistic: 115.8 on 1 and 30 DF,  p-value: 8.07e-12

F-Test in simple linear model

Sum of squares are the bases for \(F\)-tests \[ F = \frac{\text{MSR}}{\text{MSE}}\]

with \(\text{MSR} = \frac{\text{SSR}}{1} \text{ and } \text{MSE} = \frac{\text{SSE}}{n-2}.\)

MSR mean sum of squares of the regression,
denominators 1 en \(n-2\) are the degrees of freedom of SSR and SSE.
Under \(H_0: \beta_1=0\) \[H_0:F = \frac{\text{MSR}}{\text{MSE}} \sim F_{1,n-2},\]
F-test is always two-sided! \(H_1:\beta_1\neq 0\) \[ p = P_0\left[F\geq f\right]=1-F_F(f;1,n-2)\]

summary(lm2)


Call:
lm(formula = log2(S100A8) ~ log2(ESR1), data = brca)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.94279 -0.66537  0.08124  0.68468  1.92714 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   23.401      1.603   14.60 3.57e-15 ***
log2(ESR1)    -1.615      0.150  -10.76 8.07e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.026 on 30 degrees of freedom
Multiple R-squared:  0.7942,    Adjusted R-squared:  0.7874 
F-statistic: 115.8 on 1 and 30 DF,  p-value: 8.07e-12

Anova Table

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
Regression	degrees of freedom SSR	SSR	MSR	f-statistic	p-value
Error	degrees of freedom SSE	SSE	MSE

anova(lm2)

Analysis of Variance Table

Response: log2(S100A8)
           Df  Sum Sq Mean Sq F value   Pr(>F)    
log2(ESR1)  1 121.814 121.814   115.8 8.07e-12 ***
Residuals  30  31.559   1.052                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1