Tutorial 6: Linear Regression

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

This tutorial is directly adapted from Lieven Clement and Milan Malfait tutorial on linear regression in their “Practical Statistics for the Life Sciences” course.

Requirements

We first required packages:

# required packages
library(tidyverse)

library(Rmisc)  # functions useful for data analysis and utility operations
library(GGally) # ggplot2 extension
library(performance)  # evaluate models

Note: we will need the ggplot2, dplyr, readr, tidyr and tibble packages from the tidyverse meta-package. Thus, if you encounter issues when installing tidyverse, you can only install these four packages.

Breast cancer dataset

This practice is a subset of a more general study [1].

[1] Christos Sotiriou, Pratyaksha Wirapati, Sherene Loi, Adrian Harris, Steve Fox, Johanna Smeds, Hans Nordgren, Pierre Farmer, Viviane Praz, Benjamin Haibe-Kains, Christine Desmedt, Denis Larsimont, Fatima Cardoso, Hans Peterse, Dimitry Nuyten, Marc Buyse, Marc J. Van de Vijver, Jonas Bergh, Martine Piccart, Mauro Delorenzi, Gene Expression Profiling in Breast Cancer: Understanding the Molecular Basis of Histologic Grade To Improve Prognosis, JNCI: Journal of the National Cancer Institute, Volume 98, Issue 4, 15 February 2006, Pages 262–272, https://doi.org/10.1093/jnci/djj052

We consider 32 patients with breast cancer that are estrogen-receptor positive tumor. They received tamoxifen chemotherapy.

Loading and introducing the data

The data are available in this repository and can be loaded with the following command:

brca <- read_csv("https://raw.githubusercontent.com/GTPB/PSLS20/master/data/breastcancer.csv")
brca
# A tibble: 32 × 10
   sample_name filename     treatment    er grade  node  size   age  ESR1 S100A8
   <chr>       <chr>        <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
 1 OXFT_209    gsm65344.ce… tamoxifen     1     3     1   2.5    66 1939.  207. 
 2 OXFT_1769   gsm65345.ce… tamoxifen     1     1     1   3.5    86 2752.   37.0
 3 OXFT_2093   gsm65347.ce… tamoxifen     1     1     1   2.2    74  379. 2364. 
 4 OXFT_1770   gsm65348.ce… tamoxifen     1     1     1   1.7    69 2532.   23.6
 5 OXFT_1342   gsm65350.ce… tamoxifen     1     3     0   2.5    62  141. 3219. 
 6 OXFT_2338   gsm65352.ce… tamoxifen     1     3     1   1.4    63 1495.  108. 
 7 OXFT_2341   gsm65353.ce… tamoxifen     1     1     1   3.3    76 3406.   14.0
 8 OXFT_1902   gsm65354.ce… tamoxifen     1     3     0   2.4    61 2813.   68.4
 9 OXFT_1982   gsm65355.ce… tamoxifen     1     1     0   1.7    62  950.   74.2
10 OXFT_5210   gsm65356.ce… tamoxifen     1     3     0   3.5    65 1053.  182. 
# ℹ 22 more rows

The recorded variables include:

  • grade: histological grade of tumor (grade 1 vs 3),
  • node: lymph node status (0: not affected, 1: lymph nodes affected and removed),
  • size: tumor size in cm,
  • ESR1 and S100A8: gene expression measurement for 2 genes in tumor biopsy (microarray technology)

Notes:

  • Gene ESR1 (estrogen receptor 1) is expressed in ± 75% of breast cancer tumors.
  • Patients that express ESR1 better respond to treatment by hormone therapy
  • Tamoxifen interacts with ER and modulates its expression.
  • Proteins of the S100 family are low-molecular weight proteins with calcium binding sites. Their expression is often disregulated in cancer. The expression of S100A8 represses the immune system in tumor and creates an environment of inflamation that promotes tumor growth.

Here is a representation of the distribution of the S100A8 gene expression:

brca %>% ggplot(aes(x = "", y = S100A8)) +
    geom_boxplot() +
    xlab("") + theme_bw()

What can you say about the observed values for S100A8 gene expression?

Solution

Apparently the measurement of the expression of gene S100A8 failed for 3 individuals. These individuals can be considered as outliers.


We remove the outliers:

brcaSubset <- brca %>% filter(S100A8 < 2000)

Here is the new boxplot representation of S100A8 gene expression after removing the outliers:

brcaSubset %>% ggplot(aes(x = "", y = S100A8)) +
    geom_boxplot() +
    xlab("") + theme_bw()

Association between ESR1 and S100A8 expression

We will study the possible relationship between the expression of the ESR1 and S100A8 genes:

brcaSubset %>%
    ggplot(aes(x = ESR1, y = S100A8)) +
    geom_point() +
    theme_bw()

What kind of method are we going to use?

Solution

We want to investigate the possible relationship between two quantitative variables, this is a regression problem.


Regression

Formalize the regression framework?

Solution

We first start with a linear regression between two variables (Xi,Yi), measured on each subject i=1,...,n.

For instance, we may consider the following setting:

  • Response Y: S100A8 expression
  • Predictor X: ESR1 expression

The regression model

Describe the principle of the regression model?

Solution

The principle of regression (not necessarily linear) is to model the variations of the output variable Y (observations) by a contribution of fixed covariates X in the standard additive framework: observation = signal + noise

with mathematical notations: Yi=g(xi)+εi

g(x) is called the regression function that models the expected outcome Yi depending on input variable xi.


Linear regression

Describe the principle of the linear regression model?

Solution

To obtain an accurate and interpretable model, the simplest choice for g(x) is a linear function with unknown parameters.

g(x)=β0+β1x with unknown intercept β0 and slope β1.

Linear model is based on an assumption regarding the form of the relationship between Y and X, and on the i.i.d. assumption for the error terms.

We can also add a Gaussian assumption on the error terms.

These assumption must be checked afterwards.

The linear model is generally written: Yi=β0+β1xi+εi where εi is a noise/error term. We generally assume that all εi are independent, identically distributed with mean = 0 and constant variance σ2 (homoskedasticity case).


Use

What are the use cases of the linear regression model?

Solution

  • Prediction: when Y is unknown but X=x is known we can predict Y using y^=β^0+β^1x

  • Association: biological relation between variable X and response Y


What is the interpretation of the coefficient in the linear regression model? in particular when we draw the following regression line?

brcaSubset %>%
    ggplot(aes(x = ESR1, y = S100A8)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) + 
    theme_bw()

Solution

Intercept: β0 is the value for Y predicted by the model when x=0

Slope: it quantifies how much Y will increase/decrease when x increases slightly according to the model, i.e. g(x+δ)g(x)=β0+β1(x+δ)β0β1x=β1δ

β1= quantifies the strength of the linear association between Y and X


Parameter estimation

(optional) Least squares

Parameters β0 en β1 are unknown.

They are estimated them using all the data points by the least-squares method.

How does the least-squares method work?

Solution

To determine the best fitting line:

  • Use a sample (xi,yi)i=1,,n containing pairs of observations for X and Y
  • Purpose: for each xi, we want the predicted value y^i=β0+β1xi to be as close as possible from observed value yi
  • Choose β0 and β1 so that the sum of squares distance between predicted and observed points becomes as small as possible, i.e. the sum of squared errors (RSS): RSS=i=1n(yiβ0β1xi)2=i=1nei2 with residuals ei the vertical distances from the observations to the fitted regression line

On the following graph, in blue is the linear regression line estimated by the least-squares method, and in red are the corresponding observed value e^i for the residuals in the fitted model.


Estimators that minimise RSS

What are the values of the β0 and β1 that minimizes the RSS?

Solution

β^1=i=1n(yiy¯)(xix¯)i=1n(xix¯)2=cov(x,y)var(x)=cor(x,y)sysx

β^0=y¯β^1x¯ where cov(x,y) is the empirical covariance between X and Y, var(x) the empirical variance of X, cor(x,y) the empirical correlation between X and Y, sx and sy respectively the empirical standard deviation for X and Y, x¯ and y¯ respectively the empirical mean for X and Y.

Note: the slope of the least squares fit is proportional to the correlation between the response and the predictor.


How can we use the fitted model (i.e. estimations of β0 and β1?

Solution

Fitted model allows to:

  • predict the response for subjects with a given value x for the predictor: y^=β^0+β^1x

  • Assess how the mean response differs between two groups of subjects that differ of δ units in the predictor: g^(x+δ)g^(x)=β^1δ


Breast cancer example

Here is the linear regression model fitted in R:

lm1 <- lm(S100A8 ~ ESR1, data = brcaSubset)
summary(lm1)

Call:
lm(formula = S100A8 ~ ESR1, data = brcaSubset)

Residuals:
   Min     1Q Median     3Q    Max 
-95.43 -34.81  -6.79  34.23 145.21 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 208.47145   28.57207   7.296 7.56e-08 ***
ESR1         -0.05926    0.01212  -4.891 4.08e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 59.91 on 27 degrees of freedom
Multiple R-squared:  0.4698,    Adjusted R-squared:  0.4502 
F-statistic: 23.93 on 1 and 27 DF,  p-value: 4.078e-05

What is the equation of the linear regression line?

Solution

y^=208.470.059x


Interpret the evolution of S100A8 gene expression depending on ESR1 expression?

Solution

The expected expression of S100A8 is on average 59 units lower for patients with ESR1 expression level that is 1000 units higher.


What would be the predicted S100A8 expression level for patients with an ESR1 expression level of 2000? and an ESR1 expression level of 4000?

Solution

Expected S100A8 expression level for patients with an ESR1 expression level of 2000: 208.470.059×2000=89.94

Expected S100A8 expression level for patients with an ESR1 expression level of 4000: 208.470.059×4000=28.58 (!!!)


Reminder: here is a graphical representation of the data and the fitted linear regression line:

brcaSubset %>%
    ggplot(aes(x = ESR1, y = S100A8)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) + 
    theme_bw()

What is the problem with predicted S100A8 expression level for patients with an ESR1 expression level of 4000? What could be the problem with this prediction?

Solution

WARNING when you extrapolate ! The assumption of linearity can only be studied within the range of the available and observed data, and the linear regression model may predict unrealistic values for your response (here a negative expression level).

Note:

  • interpolation: prediction within the range of the available and observed data
  • extrapolation: prediction outside the range of the available and observed data

Statistical inference

To draw conclusions based on the regression model Yi=β0+β1xi+εi, we need to know

  • how the least squares parameter estimators vary from sample to sample?
  • how they deviate under the null hypothesis that there is no association between predictor and response?

How could we proceed?

Solution

This requires a statistical model.

We need to model the distribution of Y given X explicitly (conditional distribution).


Assumptions

For the moment, we made the following assumptions:

  1. There is a linear relationship between the response and the predictor.

Besides linearity we need additional assumptions!

What other assumptions are we going to make?

Solution

  1. Independence: Observations (X1,Y1),...,(Xn,Yn) are made for n independent subjects (which is required to estimate the variance)

  2. Homoscedasticity or equal variances: observations vary with equal mean around the regression line

  • Residuals εi have equal variance for each individual i (and then for any Xi=x)
  • var(ε)=σ2 for any x value
  • σ is referred to as the residual standard deviation
  1. Normality: the residuals εi are normally distributed with mean=0

(optional) Modeling the distribution of Y?

What can you derive from assumptions 2, 3 and 4?

Solution

εi i.i.d. N(0,σ2)


And from assumptions 1, 2, 3, 4?

Solution

  • The conditional distribution on the response is Gaussian: Yi|XiN(β0+β1Xi,σ2)

  • Using these assumptions we can derive the variance of the estimators : σβ^02=i=1nXi2i=1n(XiX¯)2×σ2n en σβ^12=σ2i=1n(XiX¯)2

  • Using the Gaussian assumption we can show that the parameter estimators are normally distributed β^0N(β0,σβ^02) en β^1N(β1,σβ^12)


(optional) High spread of X improves the precision

We remind the value of the variance for the estimator of β1: σβ^12=σ2i=1n(XiX¯)2

What can you say about the two regression examples in the next plot regarding the variance of the estimate β^1?

Solution

In the first case, the observations for X are widely spread, then the sum of squared deviations from the empirical mean i=1n(XiX¯)2 will be larger and the variance of β^1 will be lower, hence a better precision when estimating β1.

In the second case, the observations for X are not widely spread, then the sum of squared deviations from the empirical mean i=1n(XiX¯)2 will be smaller and the variance of β^1 will be higher, hence the precision will not be lower.


The conditional variance (σ2) is unknown? How can we estimate it?

Solution

  • The σ^2 estimate of σ2 uses the mean squared error (MSE)

  • The MSE is the average squared error: MSE=1ni=1n(yiβ^0β^1×xi)2=1ni=1nei2

  • The estimate of σ2 is given by: σ^2=nn2MSE=1n2i=1n(yiβ^0β^1×xi)2=1n2i=1nei2

  • This estimator is based on independence (assumption 2) and equality of the variance (assumption 3).

  • Note the division by n2


Upon the estimation of σ2, what are the standard error (i.e. estimates of the standard deviation) regarding β^0 and β^1?

Solution

Upon the estimation of σ2 we obtain following standard errors:

SEβ^0=σ^β^0=σ^i=1nXi2i=1n(XiX¯)2

SEβ^1=σ^β^1=σ^1i=1n(XiX¯)2

Reminder: σ^2=nn2MSE


How can we use the previous results regarding the distribution of the estimates β^0 and β^1 to build confidence intervals and tests?

Solution

  • Thanks to the distribution of the parameter estimators, we can construct confidence intervals and tests using T=β^kβkSE(β^k) with k=0,1

  • If all assumptions are valid T follows a T(n2) Student distribution with n-2 degrees of freedom

  • If no normality, but independence, linearity, equality of mean and large dataset Central Limit theorem


Breast cancer example

Reminder:

lm1 <- lm(S100A8 ~ ESR1, data = brcaSubset)
lm1

Call:
lm(formula = S100A8 ~ ESR1, data = brcaSubset)

Coefficients:
(Intercept)         ESR1  
  208.47145     -0.05926  

The association between between S100A8 and ESR1 gene expression is negative.

Generalize the effect in the sample to the population using the confidence interval on β^1?

Solution

[β^1tn2,α/2SEβ^1,β^1+tn2,α/2SEβ^1].

confint(lm1)
                   2.5 %       97.5 %
(Intercept) 149.84639096 267.09649989
ESR1         -0.08412397  -0.03440378

(optional) Hypothesis test

How could we statistically test the significance of the effect of ESR1 gene expression on S100A8 gene expression?

Solution

We can restate the hypothesis of association the S100A8 and ESR1 gene expression in the form of a null hypothesis on the parameters of the model.

  • Under the null hypothesis of the absence of an association in the expression of both genes: H0:β1=0
  • Under the alternative hypothesis, there is an association between the expression of both genes: H1:β10

Test statistic: T=β^10SE(β^1)

Under H0 the statistics follows a T-distribution with n-2 degrees of freedom.


Breast cancer example

summary(lm1)

Call:
lm(formula = S100A8 ~ ESR1, data = brcaSubset)

Residuals:
   Min     1Q Median     3Q    Max 
-95.43 -34.81  -6.79  34.23 145.21 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 208.47145   28.57207   7.296 7.56e-08 ***
ESR1         -0.05926    0.01212  -4.891 4.08e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 59.91 on 27 degrees of freedom
Multiple R-squared:  0.4698,    Adjusted R-squared:  0.4502 
F-statistic: 23.93 on 1 and 27 DF,  p-value: 4.078e-05

What can you say about association between the S100A8 and ESR1 gene expression?

Solution

The association between the S100A8 and ESR1 expression is extremely significant (p0.001)


What did we forget to check?

Solution

  • But, we first have to check all assumptions!
  • Otherwise the conclusions based on the statistical test and the CI can be incorrect.

Assumption assessment

What assumptions should we check?

Solution

  • Independence: design
  • Linearity: inference is useless if the association is not linear
  • Homoscedasticity: inference/p-value is incorrect if data are heteroscedastic
  • Normality: inference/p-value is incorrect if data are not normally distributed (especially in small samples)

Linearity

brcaSubset %>%
    ggplot(aes(x = ESR1, y = S100A8)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    theme_bw()

Residual analysis

Assumption of linearity is typically assessed using a residual plot:

  • predictions y^i=β^0+β^1xi on X-axis
  • residuals e^i=yiy^i=yiβ^0β^1×xi on Y-axis
plot(lm1, which = 1)

With the residual plot, we can also check the assumptions that the average of the residuals is around 0.

Homoskedasticity (equal variances)

Residuals and squared residuals carry information on the residual variability

  • Association with predictors indication of heteroskedasticity
  • Scatterplot of e^i vs xi or predictions y^i=β^0+β^1xi
  • Scatterplot of standardized residual versus xi or predictions
performance::check_heteroskedasticity(lm1)
Warning: Heteroscedasticity (non-constant error variance) detected (p = 0.009).

(caution with small sample size)

Normality

  • If the sample size is large the estimators are normally distributed even if the observations are not normally distributed: central limit theorem
  • How many observations? depends on shape and magnitude of deviations
  • Assumption: Data are Normally distributed conditional on X: Yi|XiN(β0+β1Xi,σ2)
  • QQ-plot of the residuals e^i

Example of non-Gaussian data and approximately Gaussian residuals:

QQ-plot for our model:

plot(lm1, which = 2)

performance::check_normality(lm1)
OK: residuals appear as normally distributed (p = 0.212).

(caution with small sample size)

Model evaluation

You can use the default R function to get a visual summary about your model residuals:

plot(lm1)

or you can use the check_model() function from the performance package:

performance::check_model(lm1)

Invalid assumptions?

If the assumptions regarding the model are not verified, what can we do?

Solution

We can either:

  • use the fitted model to do prediction but without statistical modeling (so no test, nor uncertainty quantification), and we would need to rely on geometrical model evaluation metrics like R2 coefficient and analysis of variance (c.f. below), or prediction error quantification;

  • apply a transformation to the data.

Transformation of the predictor X does not change the distribution of Y for given X:

  • it is not useful to obtain homoscedasticity or Normal distribution
  • it is useful to improve linearity when normality and homoscedasticity are valid
  • it is also possible to include higher order terms: X2, X3, … Yi=β0+β1Xi+β2Xi2+...+εi

Transformation of the response Y can be useful to obtain normality and homoscedasticity:

  • Y
  • log(Y)
  • 1/Y
  • etc.

Breast cancer example

What are the issues with the linear regression model in the case of our breast cancer data?

Solution

Problems with

  • heteroscedasticity
  • possibly deviations from normality (skewed to the right)
  • negative concentration predictions are theoretically impossible
  • non-linearity

This is often the case for concentration and intensity measurements

  • These are often log-normal distributed (normal distribution upon log-transform)
  • We also observed a kind of exponential relation with the smoother
  • In gene expression literature often log2 transformation is adopted
  • gene-expression on log scale: differences on log scale are fold changes on original scale!

We consider the original data with the outliers that we removed earlier.

Here a scatterplot of the original data:

brca %>% ggplot(aes(x = ESR1, y = S100A8)) +
    geom_point() +
    geom_smooth() +
    theme_bw()

And here a scatterplot of the log2 transformed data:

brca %>% ggplot(aes(x = log2(ESR1), y = log2(S100A8))) +
    geom_point() +
    geom_smooth() +
    theme_bw()

Note: the blue line represents a specific non-linear regression with the confidence interval.

What can you say?

Solution

The linear relationship between S100A8 and ESR1 gene expression is not clear at all when using the original measurements, but appears more clearly when using the log2-transformed measurements.


Here is the linear regression model for the transformed gene expression:

lm2 <- lm(log2(S100A8) ~ log2(ESR1), data = brca)
performance::check_model(lm2)

performance::check_normality(lm2)
OK: residuals appear as normally distributed (p = 0.888).
performance::check_heteroskedasticity(lm2)
OK: Error variance appears to be homoscedastic (p = 0.430).
summary(lm2)

Call:
lm(formula = log2(S100A8) ~ log2(ESR1), data = brca)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.94279 -0.66537  0.08124  0.68468  1.92714 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   23.401      1.603   14.60 3.57e-15 ***
log2(ESR1)    -1.615      0.150  -10.76 8.07e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.026 on 30 degrees of freedom
Multiple R-squared:  0.7942,    Adjusted R-squared:  0.7874 
F-statistic: 115.8 on 1 and 30 DF,  p-value: 8.07e-12
confint(lm2)
                2.5 %    97.5 %
(Intercept) 20.128645 26.674023
log2(ESR1)  -1.921047 -1.308185

What can you say about it?

Solution

The linearity, normality and heteroskedasticity assumptions seems to now be verified (at least we cannot significantly reject them).


Exploration of the model with log2 transformed data

How do we interpret the slope in the model with log2-transformed data?

Solution

A patient with an ESR1 expression that is one unit on log2 scale higher than that of another patient on average has a log2 expression for S100A8 that is 1.61 units lower (95% CI [-1.92,-1.31]).

We note (x1,y1) the respective gene expressions for ESR1 and S100A8 in patient 1, and similary (x2,y2) for patient 2.

And we assume that x2=x1+1.

log2(y^1)=23.4011.615×log2(x1)

log2(y^2)=23.4011.615×log2(x2) log2(y^2y^1)=log2(y^2)log2(y^1)=1.615(log2(x2x1)=1.615×1=1.615


Prediction

We can compute the predicted values y^i for each data point xi with the fitted model:

brca$pred_y <- predict(lm2)

We can compare the predicted and observed values:

ggplot(brca, aes(x = S100A8, y = pred_y)) +
    geom_point() +
    theme_bw()

What is happening? Are the predicted y^i close to the original yi?

Solution

The predicted values are in the log-transformed space whereas the original Y (S100A8) value are not log-transformed.

If we log-transform the response S100A8, we can compare it with the predicted values in the same log-transformed space:

ggplot(brca, aes(x = log2(S100A8), y = pred_y)) +
    geom_point() +
    geom_abline(intercept = 0, slope = 1, linetype="dotted") + 
    coord_fixed() +
    theme_bw()

Note: the dotted line represent the line of equation y=x and we see that the observed and predicted values are close.


We can also draw the regression line in the log-transformed space and represent the predicted values (blue cross):

# extract the coefficients
hat_beta <- coef(lm2)
hat_beta_0 <- hat_beta[1]
hat_beta_1 <- hat_beta[2]

# plot
brca %>% ggplot(aes(x = log2(ESR1), y = log2(S100A8))) +
    geom_point() +
    geom_point(aes(y = pred_y), col = "blue", shape = 3) +
    geom_function(
        fun = function(x) return(hat_beta_0 + hat_beta_1*x), 
        col = "red") +
    theme_bw()

Back-transformation

We can transformed back the predicted values using the model on the log2-transformed data with the inverse transformation (z2z) and represent the regression line in the original data space:

brca %>% ggplot(aes(x = ESR1, y = S100A8)) +
    geom_point() +
    geom_point(aes(y = 2^pred_y), col = "blue", shape = 3) +
    geom_function(
        fun = function(x) return(2^(hat_beta_0 + hat_beta_1*log2(x))), 
        col = "red") +
    theme_bw()

(optional) Inference on the mean outcome

A regression model can also be used for prediction: and in particular the inference of the average outcome (i.e. the response average) for a given value of X=x, i.e. g^(x)=β^0+β^1x

  • g^(x) is an estimator of the conditional mean E[Y|X=x]
  • Parameter estimators are Normally distributed and unbiased estimator g^(x) is also Normally distributed and unbiased.

SEg^(x)=σ^1n+(xX¯)2i=1n(XiX¯)2

T=g^(x)g(x)SEg^(x)Tn2

We can use thepredict(.) function in R to compute the inferred average outcome for the response and related confidence intervals.

  • newdata argument: predictor values (x-values) at which we want to calculate the mean response
  • interval="confidence" argument to obtain CI.
  • without newdata argument we perform predictions for all predictor values in the dataset used to fit the model
grid <- 140:4000
g <- predict(lm2, newdata = data.frame(ESR1 = grid), interval = "confidence")
head(g)
       fit      lwr      upr
1 11.89028 10.76082 13.01974
2 11.87370 10.74721 13.00019
3 11.85724 10.73370 12.98078
4 11.84089 10.72028 12.96151
5 11.82466 10.70696 12.94237
6 11.80854 10.69372 12.92336

Note: we do not have to transform the new data that we specified for the ESR1 expression because we fitted the model with a call to the lm() function and specified the transformation within the formula.

We represent the inferred mean outcome (in red) with the confidence interval upper and lower bounds (in grey) in the log2-transformed data space:

newdata <- data.frame(cbind(grid = log2(grid), g))
brca %>% ggplot(aes(x = log2(ESR1), y = log2(S100A8))) +
    geom_point() +
    geom_line(aes(x = grid, y = fit), newdata, col = "red") +
    geom_line(aes(x = grid, y = lwr), newdata, color = "grey") +
    geom_line(aes(x = grid, y = upr), newdata, color = "grey") +
    theme_bw()

Back-transformation

We can transformed back the inferred mean outcome using the model on the log2-transformed data with the inverse transformation (z2z) and represent the inferred mean outcome (in red) with the confidence interval upper and lower bounds (in grey) in the original data space:

newdata <- data.frame(cbind(grid, 2^g))
brca %>% ggplot(aes(x = ESR1, y = S100A8)) +
    geom_point() +
    geom_line(aes(x = grid, y = fit), newdata, col = "red") +
    geom_line(aes(x = grid, y = lwr), newdata, color = "grey") +
    geom_line(aes(x = grid, y = upr), newdata, color = "grey") +
    theme_bw()

(optional) Prediction intervals

We can also make a prediction for the location of a new observation that would be collected in a new experiment for a patient with a particular value for their ESR1 expression

It is important to notice that this experiment still has to be conducted. So we want to predict the non-observed individual expression value for a novel patient.

For a novel independent observation Y: Y=g(x)+ε with εN(0,σ2) and ε independent of the observations in the sample Y1,,Yn.

We predict a new log-S100A8 for a patient with a known log2-ESR1 expression level x: y^(x)=β^0+β^1×x

What is the relationship between the mean outcome and the predicted value for the response?

Solution

The estimated mean outcome and prediction for a new observation are equal.

But, their sample distributions are different!

  • Uncertainty on the estimated mean outcome uncertainty on estimated model parameters β^0 en β^1.
  • Uncertainty on new observation Y^ uncertainty on estimated mean and additional uncertainty because the new observation will deviate around the mean!

SEY^(x)=σ^2+σ^g^(x)2=σ^1+1n+(xX¯)2i=1n(XiX¯)2

Y^(x)YSEY^(x)Tn2

  • Note, that a prediction-interval (PI) is an improved version of a reference-interval when the model parameters are unknown: uncertainty on model parameters + t-distribution.

pred <- predict(lm2, newdata = data.frame(ESR1 = grid), interval = "prediction")
head(pred)
       fit      lwr      upr
1 11.89028 9.510524 14.27004
2 11.87370 9.495354 14.25205
3 11.85724 9.480288 14.23419
4 11.84089 9.465324 14.21646
5 11.82466 9.450461 14.19886
6 11.80854 9.435698 14.18138

We represent the predicted values (in red) with the prediction interval upper and lower bounds (in blue) and the confidence interval upper and lower bounds for the mean outcome (in grey) in the log2-transformed data space:

newdata <- data.frame(cbind(grid = log2(grid), g))
preddata <- data.frame(cbind(grid = log2(grid), pred))
brca %>% ggplot(aes(x = log2(ESR1), y = log2(S100A8))) +
    geom_point() +
    geom_line(aes(x = grid, y = fit), newdata) +
    geom_line(aes(x = grid, y = lwr), newdata, color = "grey") +
    geom_line(aes(x = grid, y = upr), newdata, color = "grey") +
    geom_line(aes(x = grid, y = lwr), preddata, color = "blue") +
    geom_line(aes(x = grid, y = upr), preddata, color = "blue") +
    theme_bw()

We can transformed back the predicted values using the model on the log2-transformed data with the inverse transformation (z2z) and represent the predicted values (in red) with the prediction interval upper and lower bounds (in blue) and the confidence interval upper and lower bounds for the mean outcome (in grey) in the original data space:

newdata <- data.frame(cbind(grid, 2^g))
preddata <- data.frame(cbind(grid, 2^pred))
brca %>% ggplot(aes(x = ESR1, y = S100A8)) +
    geom_point() +
    geom_line(aes(x = grid, y = fit), newdata) +
    geom_line(aes(x = grid, y = lwr), newdata, color = "grey") +
    geom_line(aes(x = grid, y = upr), newdata, color = "grey") +
    geom_line(aes(x = grid, y = lwr), preddata, color = "blue") +
    geom_line(aes(x = grid, y = upr), preddata, color = "blue") +
    theme_bw()

Sum of squares and ANOVA table

Total sum of squares

What is the total sum of squares?

Solution

TSS=i=1n(yiy¯)2


  • TSS can be used to estimate the variance of the marginal distribution of the response.

Sum of squares of the model

MSS=i=1n(y^iy¯)2=i=1n(g^(xi)y¯)2.

  • It is a measure for the deviation of the predictions on the regression line and the marginal mean of the response.

  • Another interpretation: difference between two models

    • Estimated model g^(x)=β^0+β^1x
    • Estimated model without predictor (only intercept): g(x)=β0 β0 will be equal to y¯.
  • MSS measures the size of the effect of the predictor

Sum of squares of the residuals

RSS=i=1n(yiy^i)2=i=1n{yig^(xi)}2

  • The smaller RSS the better the fit

  • Least squares method!

Sum of squares decomposition

We can show that SST can be decomposed in TSS=i=1n(yiy¯)2=i=1n(yiy^i+y^iy¯)2=i=1n(yiy^i)2+i=1n(y^iy¯)2=RSS +MSS

  • Total variability in the data (TSS) is partially explained by the predictor (MSS).
  • Variability that we cannot explain with the regression model is the residual variability (RSS).

Determination coefficient

R2=1RSSTSS=MSSTSS.

  • Fraction of total variability of the sample outcomes explained by the model.

  • Large R2 indicates that the model has the potential to make good predictions (small RSS).

  • Not very indicative for p-value of the test H0:β1=0 vs H1:β10.

    • p-value is largely determined by RSS and sample size n, but not by TSS.
    • R2 is determined by RSS and TSS but not by sample size n.
  • Model with low R2 is still useful to study associations as long as the association is modeled correctly!

Breast cancer example

summary(lm2)

Call:
lm(formula = log2(S100A8) ~ log2(ESR1), data = brca)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.94279 -0.66537  0.08124  0.68468  1.92714 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   23.401      1.603   14.60 3.57e-15 ***
log2(ESR1)    -1.615      0.150  -10.76 8.07e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.026 on 30 degrees of freedom
Multiple R-squared:  0.7942,    Adjusted R-squared:  0.7874 
F-statistic: 115.8 on 1 and 30 DF,  p-value: 8.07e-12

(optional) F-Test in simple linear model

  • Sum of squares are the bases for F-tests F=MMSRMS

with MMS=MSS1 and RMS=RSSn2.

  • MMS = model mean of squares,

  • RMS = residual mean of squares,

  • denominators 1 en n2 are the degrees of freedom of MSS and RSS.

  • Under H0:β1=0 H0:F=MMSRMSF1,n2,

  • F-test is always two-sided! H1:β10 p=P0[Ff]=1FF(f;1,n2)

summary(lm2)

Call:
lm(formula = log2(S100A8) ~ log2(ESR1), data = brca)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.94279 -0.66537  0.08124  0.68468  1.92714 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   23.401      1.603   14.60 3.57e-15 ***
log2(ESR1)    -1.615      0.150  -10.76 8.07e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.026 on 30 degrees of freedom
Multiple R-squared:  0.7942,    Adjusted R-squared:  0.7874 
F-statistic: 115.8 on 1 and 30 DF,  p-value: 8.07e-12

Anova Table

Df Sum Sq Mean Sq F value Pr(>F)
Regression degrees of freedom MSS MSS MSR f-statistic p-value
Error degrees of freedom RSS RSS MSE
anova(lm2)
Analysis of Variance Table

Response: log2(S100A8)
           Df  Sum Sq Mean Sq F value   Pr(>F)    
log2(ESR1)  1 121.814 121.814   115.8 8.07e-12 ***
Residuals  30  31.559   1.052                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Exercises

Exercise 1

Bivariate data relating the number of claMSSoom absences and the number of points earned prior to the final exam in the course was gathered from a random sampling of 15 students.

Here is the table gathering the collected data:

absences points
1 552
9 210
6 315
4 454
0 619
5 378
13 108
2 534
5 398
3 462
0 593
1 645
0 602
7 315
10 307

Here is the scatter plot representing the data:

We also give the following quantities (with x= the number of absences and y= the number of points):

  • x¯=4.4
  • y¯=432.8
  • cov(x,y)=625.63
  • sd(x)=4.01
  • sd(y)=160.95
  1. What is the correlation between the number of absences and the number of points earned by the students?

  2. From your answer to the previous question and from the scatter plot figures, what could be the link between the number of absences and the number of points earned by the students?

  3. How would you model this relationship?

  4. Can you compute the estimation of the parameters in this model?

  5. What assumptions are generally done in this context?

  6. A student with 7 absences would probably have how many points in the course? A student with 20 absences would probably have how many points? Do you need any of the previous assumptions to do this prediction?

Exercise 2

Physicians feel that there is a relationship between a person’s age and the number of days they are sick that year. A random sample of 20 patients was obtained to this end comparing age and days sick.

Here is the table gathering the collected data:

age sickDays
12 0
19 2
35 8
4 10
68 21
42 16
21 18
24 24
32 29
75 15
14 3
8 7
25 12
39 17
56 15
18 11
72 31
64 26
45 20
87 30

We also give the following quantities (with x= the age and y= the number of sick days):

  • x¯=38
  • y¯=15.75
  • cov(x,y)=157.89
  • sd(x)=24.77
  • sd(y)=9.3

We fit a linear model and we get the following result:

  • i=1n(yiy¯)2=1643.75
  • i=1n(y^iy¯)2=771.87
  • i=1n(y^iyi)2=871.88

Here is the scatter plot representing the data and we also represent the fitted linear model:

  1. What can you say about this situation? How is the linear model suitable to question the physician hypothesis?

  2. How was the linear model fitted?

  3. What can you say about this model?

  4. How many days in a year would you predict that a 60 year old person will be sick?