# required packages
library(tidyverse)
library(Rmisc) # functions useful for data analysis and utility operations
library(GGally) # ggplot2 extension
library(performance) # evaluate models
Tutorial 6: Linear Regression
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This tutorial is directly adapted from Lieven Clement and Milan Malfait tutorial on linear regression in their “Practical Statistics for the Life Sciences” course1.
Requirements
We first required packages:
Note: we will need the
ggplot2
,dplyr
,readr
,tidyr
andtibble
packages from thetidyverse
meta-package. Thus, if you encounter issues when installingtidyverse
, you can only install these four packages.
Breast cancer dataset
This practice is a subset of a more general study [1].
[1] Christos Sotiriou, Pratyaksha Wirapati, Sherene Loi, Adrian Harris, Steve Fox, Johanna Smeds, Hans Nordgren, Pierre Farmer, Viviane Praz, Benjamin Haibe-Kains, Christine Desmedt, Denis Larsimont, Fatima Cardoso, Hans Peterse, Dimitry Nuyten, Marc Buyse, Marc J. Van de Vijver, Jonas Bergh, Martine Piccart, Mauro Delorenzi, Gene Expression Profiling in Breast Cancer: Understanding the Molecular Basis of Histologic Grade To Improve Prognosis, JNCI: Journal of the National Cancer Institute, Volume 98, Issue 4, 15 February 2006, Pages 262–272, https://doi.org/10.1093/jnci/djj052
We consider 32 patients with breast cancer that are estrogen-receptor positive tumor. They received tamoxifen chemotherapy.
Loading and introducing the data
The data are available in this repository and can be loaded with the following command:
<- read_csv("https://raw.githubusercontent.com/GTPB/PSLS20/master/data/breastcancer.csv") brca
brca
# A tibble: 32 × 10
sample_name filename treatment er grade node size age ESR1 S100A8
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 OXFT_209 gsm65344.ce… tamoxifen 1 3 1 2.5 66 1939. 207.
2 OXFT_1769 gsm65345.ce… tamoxifen 1 1 1 3.5 86 2752. 37.0
3 OXFT_2093 gsm65347.ce… tamoxifen 1 1 1 2.2 74 379. 2364.
4 OXFT_1770 gsm65348.ce… tamoxifen 1 1 1 1.7 69 2532. 23.6
5 OXFT_1342 gsm65350.ce… tamoxifen 1 3 0 2.5 62 141. 3219.
6 OXFT_2338 gsm65352.ce… tamoxifen 1 3 1 1.4 63 1495. 108.
7 OXFT_2341 gsm65353.ce… tamoxifen 1 1 1 3.3 76 3406. 14.0
8 OXFT_1902 gsm65354.ce… tamoxifen 1 3 0 2.4 61 2813. 68.4
9 OXFT_1982 gsm65355.ce… tamoxifen 1 1 0 1.7 62 950. 74.2
10 OXFT_5210 gsm65356.ce… tamoxifen 1 3 0 3.5 65 1053. 182.
# ℹ 22 more rows
The recorded variables include:
grade
: histological grade of tumor (grade 1 vs 3),node
: lymph node status (0: not affected, 1: lymph nodes affected and removed),size
: tumor size in cm,ESR1
andS100A8
: gene expression measurement for 2 genes in tumor biopsy (microarray technology)
Notes:
- Gene ESR1 (estrogen receptor 1) is expressed in
75% of breast cancer tumors. - Patients that express ESR1 better respond to treatment by hormone therapy
- Tamoxifen interacts with ER and modulates its expression.
- Proteins of the S100 family are low-molecular weight proteins with calcium binding sites. Their expression is often disregulated in cancer. The expression of S100A8 represses the immune system in tumor and creates an environment of inflamation that promotes tumor growth.
Here is a representation of the distribution of the S100A8 gene expression:
%>% ggplot(aes(x = "", y = S100A8)) +
brca geom_boxplot() +
xlab("") + theme_bw()
What can you say about the observed values for S100A8 gene expression?
Solution
Apparently the measurement of the expression of gene S100A8 failed for 3 individuals. These individuals can be considered as outliers.
We remove the outliers:
<- brca %>% filter(S100A8 < 2000) brcaSubset
Here is the new boxplot representation of S100A8 gene expression after removing the outliers:
%>% ggplot(aes(x = "", y = S100A8)) +
brcaSubset geom_boxplot() +
xlab("") + theme_bw()
Association between ESR1 and S100A8 expression
We will study the possible relationship between the expression of the ESR1 and S100A8 genes:
%>%
brcaSubset ggplot(aes(x = ESR1, y = S100A8)) +
geom_point() +
theme_bw()
What kind of method are we going to use?
Solution
We want to investigate the possible relationship between two quantitative variables, this is a regression problem.
Regression
Formalize the regression framework?
Solution
We first start with a linear regression between two variables
For instance, we may consider the following setting:
- Response
: S100A8 expression - Predictor
: ESR1 expression
The regression model
Describe the principle of the regression model?
Solution
The principle of regression (not necessarily linear) is to model the variations of the output variable
with mathematical notations:
Linear regression
Describe the principle of the linear regression model?
Solution
To obtain an accurate and interpretable model, the simplest choice for
Linear model is based on an assumption regarding the form of the relationship between
We can also add a Gaussian assumption on the error terms.
These assumption must be checked afterwards.
The linear model is generally written:
Use
What are the use cases of the linear regression model?
Solution
Prediction: when
is unknown but is known we can predict usingAssociation: biological relation between variable
and response
What is the interpretation of the coefficient in the linear regression model? in particular when we draw the following regression line?
%>%
brcaSubset ggplot(aes(x = ESR1, y = S100A8)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_bw()
Solution
Intercept:
Slope: it quantifies how much
Parameter estimation
(optional) Least squares
Parameters
They are estimated them using all the data points by the least-squares method.
How does the least-squares method work?
Solution
To determine the best fitting line:
- Use a sample
containing pairs of observations for and - Purpose: for each
, we want the predicted value to be as close as possible from observed value - Choose
and so that the sum of squares distance between predicted and observed points becomes as small as possible, i.e. the sum of squared errors (RSS): with residuals the vertical distances from the observations to the fitted regression line
On the following graph, in blue is the linear regression line estimated by the least-squares method, and in red are the corresponding observed value
Estimators that minimise RSS
What are the values of the
Solution
Note: the slope of the least squares fit is proportional to the correlation between the response and the predictor.
How can we use the fitted model (i.e. estimations of
Solution
Fitted model allows to:
predict the response for subjects with a given value
for the predictor:Assess how the mean response differs between two groups of subjects that differ of
units in the predictor:
Breast cancer example
Here is the linear regression model fitted in R:
<- lm(S100A8 ~ ESR1, data = brcaSubset)
lm1 summary(lm1)
Call:
lm(formula = S100A8 ~ ESR1, data = brcaSubset)
Residuals:
Min 1Q Median 3Q Max
-95.43 -34.81 -6.79 34.23 145.21
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 208.47145 28.57207 7.296 7.56e-08 ***
ESR1 -0.05926 0.01212 -4.891 4.08e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 59.91 on 27 degrees of freedom
Multiple R-squared: 0.4698, Adjusted R-squared: 0.4502
F-statistic: 23.93 on 1 and 27 DF, p-value: 4.078e-05
What is the equation of the linear regression line?
Solution
Interpret the evolution of S100A8 gene expression depending on ESR1 expression?
Solution
The expected expression of S100A8 is on average 59 units lower for patients with ESR1 expression level that is 1000 units higher.
What would be the predicted S100A8 expression level for patients with an ESR1 expression level of 2000? and an ESR1 expression level of 4000?
Solution
Expected S100A8 expression level for patients with an ESR1 expression level of 2000:
Expected S100A8 expression level for patients with an ESR1 expression level of 4000:
Reminder: here is a graphical representation of the data and the fitted linear regression line:
%>%
brcaSubset ggplot(aes(x = ESR1, y = S100A8)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_bw()
What is the problem with predicted S100A8 expression level for patients with an ESR1 expression level of 4000? What could be the problem with this prediction?
Solution
WARNING when you extrapolate ! The assumption of linearity can only be studied within the range of the available and observed data, and the linear regression model may predict unrealistic values for your response (here a negative expression level).
Note:
- interpolation: prediction within the range of the available and observed data
- extrapolation: prediction outside the range of the available and observed data
Statistical inference
To draw conclusions based on the regression model
- how the least squares parameter estimators vary from sample to sample?
- how they deviate under the null hypothesis that there is no association between predictor and response?
How could we proceed?
Solution
This requires a statistical model.
We need to model the distribution of
Assumptions
For the moment, we made the following assumptions:
- There is a linear relationship between the response and the predictor.
Besides linearity we need additional assumptions!
What other assumptions are we going to make?
Solution
Independence: Observations
are made for n independent subjects (which is required to estimate the variance)Homoscedasticity or equal variances: observations vary with equal mean around the regression line
- Residuals
have equal variance for each individual (and then for any ) for any value is referred to as the residual standard deviation
- Normality: the residuals
are normally distributed with mean=0
(optional) Modeling the distribution of Y?
What can you derive from assumptions 2, 3 and 4?
Solution
And from assumptions 1, 2, 3, 4?
Solution
The conditional distribution on the response is Gaussian:
Using these assumptions we can derive the variance of the estimators :
Using the Gaussian assumption we can show that the parameter estimators are normally distributed
(optional) High spread of improves the precision
We remind the value of the variance for the estimator of
What can you say about the two regression examples in the next plot regarding the variance of the estimate
Solution
In the first case, the observations for X are widely spread, then the sum of squared deviations from the empirical mean
In the second case, the observations for X are not widely spread, then the sum of squared deviations from the empirical mean
The conditional variance (
Solution
The
estimate of uses the mean squared error (MSE)The MSE is the average squared error:
The estimate of
is given by:This estimator is based on independence (assumption 2) and equality of the variance (assumption 3).
Note the division by
Upon the estimation of
Solution
Upon the estimation of
Reminder:
How can we use the previous results regarding the distribution of the estimates
Solution
Thanks to the distribution of the parameter estimators, we can construct confidence intervals and tests using
If all assumptions are valid
follows a Student distribution with n-2 degrees of freedomIf no normality, but independence, linearity, equality of mean and large dataset
Breast cancer example
Reminder:
<- lm(S100A8 ~ ESR1, data = brcaSubset)
lm1 lm1
Call:
lm(formula = S100A8 ~ ESR1, data = brcaSubset)
Coefficients:
(Intercept) ESR1
208.47145 -0.05926
Generalize the effect in the sample to the population using the confidence interval on
Solution
confint(lm1)
2.5 % 97.5 %
(Intercept) 149.84639096 267.09649989
ESR1 -0.08412397 -0.03440378
(optional) Hypothesis test
How could we statistically test the significance of the effect of ESR1 gene expression on S100A8 gene expression?
Solution
We can restate the hypothesis of association the S100A8 and ESR1 gene expression in the form of a null hypothesis on the parameters of the model.
- Under the null hypothesis of the absence of an association in the expression of both genes:
- Under the alternative hypothesis, there is an association between the expression of both genes:
Test statistic:
Under
Breast cancer example
summary(lm1)
Call:
lm(formula = S100A8 ~ ESR1, data = brcaSubset)
Residuals:
Min 1Q Median 3Q Max
-95.43 -34.81 -6.79 34.23 145.21
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 208.47145 28.57207 7.296 7.56e-08 ***
ESR1 -0.05926 0.01212 -4.891 4.08e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 59.91 on 27 degrees of freedom
Multiple R-squared: 0.4698, Adjusted R-squared: 0.4502
F-statistic: 23.93 on 1 and 27 DF, p-value: 4.078e-05
What can you say about association between the S100A8 and ESR1 gene expression?
Solution
The association between the S100A8 and ESR1 expression is extremely significant (
What did we forget to check?
Solution
- But, we first have to check all assumptions!
- Otherwise the conclusions based on the statistical test and the CI can be incorrect.
Assumption assessment
What assumptions should we check?
Solution
- Independence: design
- Linearity: inference is useless if the association is not linear
- Homoscedasticity: inference/p-value is incorrect if data are heteroscedastic
- Normality: inference/p-value is incorrect if data are not normally distributed (especially in small samples)
Linearity
%>%
brcaSubset ggplot(aes(x = ESR1, y = S100A8)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_bw()
Residual analysis
Assumption of linearity is typically assessed using a residual plot:
- predictions
on -axis - residuals
on -axis
plot(lm1, which = 1)
With the residual plot, we can also check the assumptions that the average of the residuals is around 0.
Homoskedasticity (equal variances)
Residuals and squared residuals carry information on the residual variability
- Association with predictors
indication of heteroskedasticity - Scatterplot of
vs or predictions - Scatterplot of standardized residual versus
or predictions
::check_heteroskedasticity(lm1) performance
Warning: Heteroscedasticity (non-constant error variance) detected (p = 0.009).
(caution with small sample size)
Normality
- If the sample size is large the estimators are normally distributed even if the observations are not normally distributed: central limit theorem
- How many observations?
depends on shape and magnitude of deviations - Assumption: Data are Normally distributed conditional on X:
- QQ-plot of the residuals
Example of non-Gaussian data and approximately Gaussian residuals:
QQ-plot for our model:
plot(lm1, which = 2)
::check_normality(lm1) performance
OK: residuals appear as normally distributed (p = 0.212).
(caution with small sample size)
Model evaluation
You can use the default R function to get a visual summary about your model residuals:
plot(lm1)
or you can use the check_model()
function from the performance
package:
::check_model(lm1) performance
Invalid assumptions?
If the assumptions regarding the model are not verified, what can we do?
Solution
We can either:
use the fitted model to do prediction but without statistical modeling (so no test, nor uncertainty quantification), and we would need to rely on geometrical model evaluation metrics like
coefficient and analysis of variance (c.f. below), or prediction error quantification;apply a transformation to the data.
Transformation of the predictor
- it is not useful to obtain homoscedasticity or Normal distribution
- it is useful to improve linearity when normality and homoscedasticity are valid
- it is also possible to include higher order terms:
, , …
Transformation of the response
- etc.
Breast cancer example
What are the issues with the linear regression model in the case of our breast cancer data?
Solution
Problems with
- heteroscedasticity
- possibly deviations from normality (skewed to the right)
- negative concentration predictions are theoretically impossible
- non-linearity
This is often the case for concentration and intensity measurements
- These are often log-normal distributed (normal distribution upon log-transform)
- We also observed a kind of exponential relation with the smoother
- In gene expression literature often
transformation is adopted - gene-expression on log scale: differences on log scale are fold changes on original scale!
We consider the original data with the outliers that we removed earlier.
Here a scatterplot of the original data:
%>% ggplot(aes(x = ESR1, y = S100A8)) +
brca geom_point() +
geom_smooth() +
theme_bw()
And here a scatterplot of the
%>% ggplot(aes(x = log2(ESR1), y = log2(S100A8))) +
brca geom_point() +
geom_smooth() +
theme_bw()
Note: the blue line represents a specific non-linear regression with the confidence interval.
What can you say?
Solution
The linear relationship between S100A8 and ESR1 gene expression is not clear at all when using the original measurements, but appears more clearly when using the
Here is the linear regression model for the transformed gene expression:
<- lm(log2(S100A8) ~ log2(ESR1), data = brca)
lm2 ::check_model(lm2) performance
::check_normality(lm2) performance
OK: residuals appear as normally distributed (p = 0.888).
::check_heteroskedasticity(lm2) performance
OK: Error variance appears to be homoscedastic (p = 0.430).
summary(lm2)
Call:
lm(formula = log2(S100A8) ~ log2(ESR1), data = brca)
Residuals:
Min 1Q Median 3Q Max
-1.94279 -0.66537 0.08124 0.68468 1.92714
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.401 1.603 14.60 3.57e-15 ***
log2(ESR1) -1.615 0.150 -10.76 8.07e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.026 on 30 degrees of freedom
Multiple R-squared: 0.7942, Adjusted R-squared: 0.7874
F-statistic: 115.8 on 1 and 30 DF, p-value: 8.07e-12
confint(lm2)
2.5 % 97.5 %
(Intercept) 20.128645 26.674023
log2(ESR1) -1.921047 -1.308185
What can you say about it?
Solution
The linearity, normality and heteroskedasticity assumptions seems to now be verified (at least we cannot significantly reject them).
Exploration of the model with log2 transformed data
How do we interpret the slope in the model with
Solution
A patient with an ESR1 expression that is one unit on
We note
And we assume that
Prediction
We can compute the predicted values
$pred_y <- predict(lm2) brca
We can compare the predicted and observed values:
ggplot(brca, aes(x = S100A8, y = pred_y)) +
geom_point() +
theme_bw()
What is happening? Are the predicted
Solution
The predicted values are in the log-transformed space whereas the original
If we log-transform the response S100A8, we can compare it with the predicted values in the same log-transformed space:
ggplot(brca, aes(x = log2(S100A8), y = pred_y)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, linetype="dotted") +
coord_fixed() +
theme_bw()
Note: the dotted line represent the line of equation
We can also draw the regression line in the log-transformed space and represent the predicted values (blue cross):
# extract the coefficients
<- coef(lm2)
hat_beta <- hat_beta[1]
hat_beta_0 <- hat_beta[2]
hat_beta_1
# plot
%>% ggplot(aes(x = log2(ESR1), y = log2(S100A8))) +
brca geom_point() +
geom_point(aes(y = pred_y), col = "blue", shape = 3) +
geom_function(
fun = function(x) return(hat_beta_0 + hat_beta_1*x),
col = "red") +
theme_bw()
Back-transformation
We can transformed back the predicted values using the model on the
%>% ggplot(aes(x = ESR1, y = S100A8)) +
brca geom_point() +
geom_point(aes(y = 2^pred_y), col = "blue", shape = 3) +
geom_function(
fun = function(x) return(2^(hat_beta_0 + hat_beta_1*log2(x))),
col = "red") +
theme_bw()
(optional) Inference on the mean outcome
A regression model can also be used for prediction: and in particular the inference of the average outcome (i.e. the response average) for a given value of
is an estimator of the conditional mean- Parameter estimators are Normally distributed and unbiased
estimator is also Normally distributed and unbiased.
We can use thepredict(.)
function in R to compute the inferred average outcome for the response and related confidence intervals.
newdata
argument: predictor values (x-values) at which we want to calculate the mean responseinterval="confidence"
argument to obtain CI.- without
newdata
argument we perform predictions for all predictor values in the dataset used to fit the model
<- 140:4000
grid <- predict(lm2, newdata = data.frame(ESR1 = grid), interval = "confidence")
g head(g)
fit lwr upr
1 11.89028 10.76082 13.01974
2 11.87370 10.74721 13.00019
3 11.85724 10.73370 12.98078
4 11.84089 10.72028 12.96151
5 11.82466 10.70696 12.94237
6 11.80854 10.69372 12.92336
Note: we do not have to transform the new data that we specified for the ESR1 expression because we fitted the model with a call to the
lm()
function and specified the transformation within the formula.
We represent the inferred mean outcome (in red) with the confidence interval upper and lower bounds (in grey) in the
<- data.frame(cbind(grid = log2(grid), g))
newdata %>% ggplot(aes(x = log2(ESR1), y = log2(S100A8))) +
brca geom_point() +
geom_line(aes(x = grid, y = fit), newdata, col = "red") +
geom_line(aes(x = grid, y = lwr), newdata, color = "grey") +
geom_line(aes(x = grid, y = upr), newdata, color = "grey") +
theme_bw()
Back-transformation
We can transformed back the inferred mean outcome using the model on the
<- data.frame(cbind(grid, 2^g))
newdata %>% ggplot(aes(x = ESR1, y = S100A8)) +
brca geom_point() +
geom_line(aes(x = grid, y = fit), newdata, col = "red") +
geom_line(aes(x = grid, y = lwr), newdata, color = "grey") +
geom_line(aes(x = grid, y = upr), newdata, color = "grey") +
theme_bw()
(optional) Prediction intervals
We can also make a prediction for the location of a new observation that would be collected in a new experiment for a patient with a particular value for their ESR1 expression
It is important to notice that this experiment still has to be conducted. So we want to predict the non-observed individual expression value for a novel patient.
For a novel independent observation
We predict a new log-S100A8 for a patient with a known log2-ESR1 expression level x:
What is the relationship between the mean outcome and the predicted value for the response?
Solution
The estimated mean outcome and prediction for a new observation are equal.
But, their sample distributions are different!
- Uncertainty on the estimated mean outcome
uncertainty on estimated model parameters en . - Uncertainty on new observation
uncertainty on estimated mean and additional uncertainty because the new observation will deviate around the mean!
- Note, that a prediction-interval (PI) is an improved version of a reference-interval when the model parameters are unknown: uncertainty on model parameters + t-distribution.
<- predict(lm2, newdata = data.frame(ESR1 = grid), interval = "prediction")
pred head(pred)
fit lwr upr
1 11.89028 9.510524 14.27004
2 11.87370 9.495354 14.25205
3 11.85724 9.480288 14.23419
4 11.84089 9.465324 14.21646
5 11.82466 9.450461 14.19886
6 11.80854 9.435698 14.18138
We represent the predicted values (in red) with the prediction interval upper and lower bounds (in blue) and the confidence interval upper and lower bounds for the mean outcome (in grey) in the
<- data.frame(cbind(grid = log2(grid), g))
newdata <- data.frame(cbind(grid = log2(grid), pred))
preddata %>% ggplot(aes(x = log2(ESR1), y = log2(S100A8))) +
brca geom_point() +
geom_line(aes(x = grid, y = fit), newdata) +
geom_line(aes(x = grid, y = lwr), newdata, color = "grey") +
geom_line(aes(x = grid, y = upr), newdata, color = "grey") +
geom_line(aes(x = grid, y = lwr), preddata, color = "blue") +
geom_line(aes(x = grid, y = upr), preddata, color = "blue") +
theme_bw()
We can transformed back the predicted values using the model on the
<- data.frame(cbind(grid, 2^g))
newdata <- data.frame(cbind(grid, 2^pred))
preddata %>% ggplot(aes(x = ESR1, y = S100A8)) +
brca geom_point() +
geom_line(aes(x = grid, y = fit), newdata) +
geom_line(aes(x = grid, y = lwr), newdata, color = "grey") +
geom_line(aes(x = grid, y = upr), newdata, color = "grey") +
geom_line(aes(x = grid, y = lwr), preddata, color = "blue") +
geom_line(aes(x = grid, y = upr), preddata, color = "blue") +
theme_bw()
Sum of squares and ANOVA table
Total sum of squares
What is the total sum of squares?
Solution
- TSS can be used to estimate the variance of the marginal distribution of the response.
Sum of squares of the model
It is a measure for the deviation of the predictions on the regression line and the marginal mean of the response.
Another interpretation: difference between two models
- Estimated model
- Estimated model without predictor (only intercept):
will be equal to .
- Estimated model
MSS measures the size of the effect of the predictor
Sum of squares of the residuals
The smaller RSS the better the fit
Least squares method!
Sum of squares decomposition
We can show that SST can be decomposed in
- Total variability in the data (TSS) is partially explained by the predictor (MSS).
- Variability that we cannot explain with the regression model is the residual variability (RSS).
Determination coefficient
Fraction of total variability of the sample outcomes explained by the model.
Large
indicates that the model has the potential to make good predictions (small RSS).Not very indicative for p-value of the test
vs .- p-value is largely determined by RSS and sample size
, but not by TSS. is determined by RSS and TSS but not by sample size .
- p-value is largely determined by RSS and sample size
Model with low
is still useful to study associations as long as the association is modeled correctly!
Breast cancer example
summary(lm2)
Call:
lm(formula = log2(S100A8) ~ log2(ESR1), data = brca)
Residuals:
Min 1Q Median 3Q Max
-1.94279 -0.66537 0.08124 0.68468 1.92714
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.401 1.603 14.60 3.57e-15 ***
log2(ESR1) -1.615 0.150 -10.76 8.07e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.026 on 30 degrees of freedom
Multiple R-squared: 0.7942, Adjusted R-squared: 0.7874
F-statistic: 115.8 on 1 and 30 DF, p-value: 8.07e-12
(optional) F-Test in simple linear model
- Sum of squares are the bases for
-tests
with
MMS = model mean of squares,
RMS = residual mean of squares,
denominators 1 en
are the degrees of freedom of MSS and RSS.Under
F-test is always two-sided!
summary(lm2)
Call:
lm(formula = log2(S100A8) ~ log2(ESR1), data = brca)
Residuals:
Min 1Q Median 3Q Max
-1.94279 -0.66537 0.08124 0.68468 1.92714
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.401 1.603 14.60 3.57e-15 ***
log2(ESR1) -1.615 0.150 -10.76 8.07e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.026 on 30 degrees of freedom
Multiple R-squared: 0.7942, Adjusted R-squared: 0.7874
F-statistic: 115.8 on 1 and 30 DF, p-value: 8.07e-12
Anova Table
Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
---|---|---|---|---|---|
Regression | degrees of freedom MSS | MSS | MSR | f-statistic | p-value |
Error | degrees of freedom RSS | RSS | MSE |
anova(lm2)
Analysis of Variance Table
Response: log2(S100A8)
Df Sum Sq Mean Sq F value Pr(>F)
log2(ESR1) 1 121.814 121.814 115.8 8.07e-12 ***
Residuals 30 31.559 1.052
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Exercises
Exercise 1
Bivariate data relating the number of claMSSoom absences and the number of points earned prior to the final exam in the course was gathered from a random sampling of 15 students.
Here is the table gathering the collected data:
absences | points |
---|---|
1 | 552 |
9 | 210 |
6 | 315 |
4 | 454 |
0 | 619 |
5 | 378 |
13 | 108 |
2 | 534 |
5 | 398 |
3 | 462 |
0 | 593 |
1 | 645 |
0 | 602 |
7 | 315 |
10 | 307 |
Here is the scatter plot representing the data:
We also give the following quantities (with
What is the correlation between the number of absences and the number of points earned by the students?
From your answer to the previous question and from the scatter plot figures, what could be the link between the number of absences and the number of points earned by the students?
How would you model this relationship?
Can you compute the estimation of the parameters in this model?
What assumptions are generally done in this context?
A student with 7 absences would probably have how many points in the course? A student with 20 absences would probably have how many points? Do you need any of the previous assumptions to do this prediction?
Exercise 2
Physicians feel that there is a relationship between a person’s age and the number of days they are sick that year. A random sample of 20 patients was obtained to this end comparing age and days sick.
Here is the table gathering the collected data:
age | sickDays |
---|---|
12 | 0 |
19 | 2 |
35 | 8 |
4 | 10 |
68 | 21 |
42 | 16 |
21 | 18 |
24 | 24 |
32 | 29 |
75 | 15 |
14 | 3 |
8 | 7 |
25 | 12 |
39 | 17 |
56 | 15 |
18 | 11 |
72 | 31 |
64 | 26 |
45 | 20 |
87 | 30 |
We also give the following quantities (with
We fit a linear model and we get the following result:
Here is the scatter plot representing the data and we also represent the fitted linear model:
What can you say about this situation? How is the linear model suitable to question the physician hypothesis?
How was the linear model fitted?
What can you say about this model?
How many days in a year would you predict that a 60 year old person will be sick?