install.packages("tidyverse") # to manipulate data and make plot
install.packages("glmnet") # to train Lasso model
install.packages("pheatmap") # to draw heatmap
install.packages("rsample") # to do sub-sampling
8 Practice: Introduction to model selection and regularization
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License
8.1 Introduction
Following last session, we will again focus on linear models, that are widely used in data analysis to relate variables and discover relationships between quantitative and/or qualitative measurements. Linear models can be used for modeling, prediction, data normalization, factor/variable effect on an outcome detection or quantification, etc.
We will now question the selection of relevant predictors among a large number of variables in order to predict or explain a response variable, i.e. implement model selection procedures. In particular, we will consider regularization approach for variable selection, but also model performance evaluation procedures.
Again, we will study the possible associations between the genotype (containing 1000s of SNPs) of yeast colonies and some morpholocigal traits directly measured on the yeast through image analysis. We will use data from:
[1] Nogami S, Ohya Y, Yvert G (2007) Genetic Complexity and Quantitative Trait Loci Mapping of Yeast Morphological Traits . PLOS Genetics 3(2): e31. https://doi.org/10.1371/journal.pgen.0030031
8.2 Requirements
We will need the following packages. If not already done, you need to install the following packages:
Now, we can load them:
library(tidyverse) # to manipulate data and make plot
library(glmnet) # to train Lasso model
library(pheatmap) # to draw heatmap
library(rsample) # to do sub-sampling
8.3 Loading the data
We will use data stored in the practical_c.RData
file.
<- load(url("https://lbmc.gitbiopages.ens-lyon.fr/hub/formations/ens_m1_ml/practical_c.Rdata"), verbose = T) data_list
We will focus on the morphological data averaged by strain and cell cycle, i.e. the data frame yeast_av_data
. For each strain, we will consider the morphological measure A101
and the corresponding genotype. Eventually, we will also restrain our study to cell phase "C"
.
We extract the corresponding data, i.e. only strain in "C"
phase, only A101
variable and all SNPs (given by gt_data$RQTL_name
), and we also remove SNPs with missing values (encoded NA
) in some strains:
<- yeast_av_data %>%
yeast_av_data_C_stage filter(cell_cycle == "C") %>% # keep C cell cycle phase
select(any_of(gt_data$RQTL_name), A101) %>% # select A101 and SNP variables
select_if(~ !any(is.na(.))) # remove SNP with missing values
How many observations and variables do we have in this data subset?
Solution
dim(yeast_av_data_C_stage)
[1] 64 1546
We have 64 observations, and 1546 variables, i.e. one morphological traits and 1545 SNPs.
What could be the issue?
Solution
We have far more variables than observations, which is unknown as “high dimensional data”, and which can lead to several challenges and cause several issues during the analysis.
8.4 Linear model
What kind of linear model could we consider to question the relationship between the morphological trait A101
and all the SNPs in our yeast genotypes?
Solution
We want to consider a linear model to relate the morphological trait as a response and the SNP variables as predictors.
- We could build one univariate linear model per SNP: \[Y_i = \beta_0 + X_{ij}\beta_j + \varepsilon_i\] for each SNP \(j\), where \(Y_i\) is the value observed for the response on individual \(i\) and \(X_{ij}\) is the value (0 or 1) observed for SNP \(j\) on individual \(i\) (reminder: here individuals are yeast strain), \(\beta_0\) is the intercept and \(\beta_j\) is the slope for SNP \(j\), and \(\varepsilon_i\) the error term for individual \(i\).
BUT it will we cumbersome to fit 1545 models.
- We will prefer to consider a single multivariate linear model accouting for all \(p\) SNPs: \[Y_i = \beta_0 + \left( \sum_{j=1}^p X_{ij}\beta_j \right) + \varepsilon_i\] The model can also be noted \[Y_i = \beta_0 + \mathbf x_{i}\boldsymbol\beta + \varepsilon_i\] where \(\mathbf x_i = [x_{i1},\dots,x_{ij},\dots,x_{ip}]\) is the vector of dimension \(p\) gathering observed values for all SNPs in individual \(i\), and \(\boldsymbol\beta = [\beta_{1},\dots,\beta_{j},\dots,\beta_{p}]^t\) is the vector of size \(p\) of linear coefficients, i.e. the parameters of the model that we want to estimate.
We will use the data frame yeast_av_data_C_stage
and the functionm lm()
to define the previous model.
<- lm(A101 ~ ., data = yeast_av_data_C_stage) lin_mod
What seems to be the problem? What can we do?
Solution
The lm
function returns an error and cannot fit the model. Since we know the closed-form formula to estimate the \(\boldsymbol\beta\) coefficient, we could try to compute them.
What is the formula to compute the estimation of the \(\boldsymbol\beta\) coefficients in the model \(Y_i = \beta_0 + \mathbf x_{i}\boldsymbol\beta + \varepsilon_i\) using a least square regression approach?
Solution
Reminder: The least square regression approach is based on the minimization of the squared error between the observed and fitted values to estimate the model coefficients:
\[\hat{\boldsymbol\beta} = \underset{\boldsymbol\beta}{\text{argmin}}\ \sum_{i=1}^n \left( Y_i - \mathbf x_i\boldsymbol\beta \right)^2\] Note: we consider \(\mathbf x_i = [1, x_{i1},\dots,x_{ij},\dots,x_{ip}]\) and \(\boldsymbol\beta = [\beta_{0}, \beta_{1},\dots,\beta_{j},\dots,\beta_{p}]^t\) to account for the intercept in the scalar product \(\mathbf x_i\boldsymbol\beta\).
The solution is given by: \[\hat{\boldsymbol\beta} = (\mathbf X^t \mathbf X)^{-1}\mathbf X^t\mathbf y\] where \(\mathbf X = [x_{ij}]_{i=1:n}^{j=1:p}\) is the \(n \times p\) matrix collecting the observations in rows and SNP variables in column, i.e. with the value \(x_{ij}\) for SNP \(j\) in individual \(i\) at row \(i\)/column \(j\) intersection, and \(\mathbf y = [y_i]_{i=1:n}\) the vector of size \(n\) collecting the observed values for response, i.e. the morphologial trait.
Compute the coefficients estimates in our linear model.
Solution
We extract the X matrix and add a column of 1:
<- yeast_av_data_C_stage %>% select(!A101) %>% as.matrix()
X <- cbind(1, X)
Xtilde str(Xtilde)
num [1:64, 1:1546] 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:1546] "" "YAL069W_1" "YAL069W_2" "NAL013C_3" ...
1:10,1:5] Xtilde[
YAL069W_1 YAL069W_2 NAL013C_3 NAL013C_4
[1,] 1 0 0 0 0
[2,] 1 1 1 1 1
[3,] 1 0 0 0 0
[4,] 1 0 0 0 0
[5,] 1 0 0 0 0
[6,] 1 1 1 1 1
[7,] 1 1 1 1 1
[8,] 1 1 1 1 1
[9,] 1 1 1 1 1
[10,] 1 1 1 1 1
We compute \((\mathbf X^t \mathbf X)^{-1}\):
solve(t(Xtilde) %*% Xtilde)
Error in solve.default(t(Xtilde) %*% Xtilde): Lapack routine dgesv: system is exactly singular: U[4,4] = 0
What is happening?
Solution
We have some correlated variables which explains
We have a numerical issue preventing us from fitting the model (i.e. estimating the \(\boldsymbol\beta\) coefficients). It is related to the dimension of our data (and the fact that we have more variables than observations).
Here is the correlation matrix between SNPs, i.e. columns from \(\mathbf X\):
pheatmap(
cor(X),
scale = "none", cluster_rows = FALSE, cluster_cols = FALSE,
labels_row = rep("", ncol(X)), labels_col = rep("", ncol(X)),
legend_breaks = c(-1, 0, 1)
)
Could you guess what could be a problem?
Solution
In high dimensional settings, we generally have correlated variables which is a problem when trying to invert the matrix \(\mathbf X^t \mathbf X\).
What can we do to fix this issue?
Solution
We can try to directly fix the numerical issue with a regularization procedure for the optimization problem that we are solving, i.e. so that \(\mathbf X^t \mathbf X\) become invertible.
We can try to reduce the number of SNPs that we consider in our model, so that it can be fitted. One possible way to do the latter is to apply a variable selection procedure, some of which are also based on a dedicated regularization procedure for the optimization problem that we are solving.
8.5 Getting started with GLMnet
In the following we will use the glmnet
package to perform regularized regression.
Here is an introduction about to use it.
To do so, we need first to explicitly define the \(\mathbf X\) and \(\mathbf Y\) matrices, respectively containing the observed values for all the SNPs and for the morphological trait in all individuals:
<- yeast_av_data_C_stage %>% select(!A101) %>% as.matrix()
X dim(X)
[1] 64 1545
<- yeast_av_data_C_stage %>% select(A101) %>% as.matrix()
Y dim(Y)
[1] 64 1
To fit a regularized linear model with the glmnet
package, we can do:
<- glmnet(X, Y) fit
We will learn in the next sections how to modify the settings to fit the model and how to use the fitted model.
8.6 Ridge regularization
First, we will try to find a way to find an estimation of the \(\boldsymbol\beta\) coefficients in the linear model accounting for all SNPs, i.e. \(Y_i = \beta_0 + \mathbf x_{i}\boldsymbol\beta + \varepsilon_i\), despite the numerical issue during the least square regression optimization procedure when trying to invert the matrix \(\mathbf X^t \mathbf X\).
What can we do?
Solution
We could try to find a way to make the \(\mathbf X^t \mathbf X\) invertible.
What method can we use?
Solution
We will use a Ridge regression based on a \(\ell_2\) regularization of the optimization problem.
We add a \(\ell_2\) penalty in the least square regression optimization problem, so that the penalized problem becomes:
\[\hat{\boldsymbol\beta}_r = \underset{\boldsymbol\beta}{\text{argmin}}\ \left\{ \sum_{i=1}^n \left( Y_i - \mathbf x_i\boldsymbol\beta \right)^2 \right\} + \lambda\, \Vert \boldsymbol\beta \Vert_2^{\,2}\]
where \(\Vert \boldsymbol\beta \Vert_2^{\,2} = \sum_j \vert \beta_j \vert^2\) is the \(\ell_2\) norm, and \(\lambda\geq 0\) a penalty parameter.
The solution of the Ridge regression problem is given by:
\[\hat{\boldsymbol\beta}_r = (\mathbf X^t \mathbf X + \lambda\, \text{I}_p)^{-1}\mathbf X^t\mathbf y\] where \(\text{I}_p\) is the \(p \times p\) identity matrix (with 1 on the diagonal and - elsewhere).
What does the Ridge regularization do?
Solution
By adding a \(\lambda\) quantity on the diagonal of the \(\mathbf X^t \mathbf X\), it becomes invertible and we can compute the estimation for the coefficient in our model.
Compute \(\boldsymbol\beta_r\) for \(\lambda = 1\)?
Solution
<- 1
lambda
<- solve(t(Xtilde) %*% Xtilde +
hat_beta_ridge * diag(1, ncol(Xtilde))) %*% t(Xtilde) %*% Y lambda
We can also use the glmnet
package to fit the Ridge regression by setting the alpha
parameter to 0. We can use the lambda
parameter to set the value of \(\lambda\)1
<- glmnet(X, Y, alpha = 0, lambda = 1) ridge_mod
We represent the values of the coefficients in the Ridge regression model depending on the SNP index.
plot(coef(ridge_mod)[-1])
We are able to estimate the 1545 coefficients in the model. However, what could be the issue regarding the model interpretability?
Solution
We still have numerous predictors (i.e. SNPs here) which makes the model difficult to interpret, especially to assess which ones are relevant to explain the response and which are not.
What can we do?
Solution
We can try to select relevant predictors (here SNPs) that are related to the response and drop the non pertinent ones.
Based on a parsimony (a.k.a. sparsity) hypothesis, we can assume that among the thousand SNPs that we are considering, not all will be useful to explain the response and we should not consider all of them.
8.7 How to select variables
Waht could we do to select variables that we will consider in our model?
Solution
- We could build several models that integrate or not each variable and select the “best one”.
- how to compare models?
- \(2^p\) models to consider (here \(\gg2^{1000}\) models!!!)
- We will prefer to use, for the model considering all SNPs, a regularization method, based on a penalized optimization problem, that will directly select relevant predictors.
8.8 Lasso regularization
What is the Lasso regression?
Solution
The Lasso regression based on a \(\ell_1\) regularization of the optimization problem.
We add a \(\ell_1\) penalty in the least square regression optimization problem, so that the penalized problem becomes:
\[\hat{\boldsymbol\beta}_l = \underset{\boldsymbol\beta}{\text{argmin}}\ \left\{ \sum_{i=1}^n \left( Y_i - \mathbf x_i\boldsymbol\beta \right)^2 \right\} + \lambda\, \Vert \boldsymbol\beta \Vert_1\]
where \(\Vert \boldsymbol\beta \Vert_1 = \sum_j \vert \beta_j \vert\) is the \(\ell_1\) norm, and \(\lambda\geq 0\) a penalty parameter.
What is the effect on the estimation of the \(\boldsymbol\lambda\) coefficient when using the Lasso?
Solution
Some coefficients automatically set to 0 (c.f. below).
8.8.1 Training the model
To fit a Lasso linear model with the glmnet
package, we can do:
<- glmnet(X, Y) fit
We can use the lambda
parameter to specify a given value for \(\lambda\).
We can use the coef()
function to collect the estimated coefficients for the fitted model.
Train the Lasso model for \(\lambda = 1\) and for \(\lambda = 0.001\).
Solution
We train the Lasso model for a given \(\lambda = 1\):
<- glmnet(X, Y, lambda = 1) fit
Here are the values of the estimated coefficients2:
plot(coef(fit)[-1])
We use a smaller value of \(\lambda\) and train the Lasso model for a given \(\lambda = 0.001\):
<- glmnet(X, Y, lambda = 0.001) fit
Here are the value of the estimated coefficients
plot(coef(fit)[-1])
What can we say about this?
Solution
For \(\lambda = 1\), all coefficients are set to 0, the penalty is too strong.
For \(\lambda = 0.001\), many coefficients are set to 0, and some are not. The non null coefficients corresponds to selected SNPs.
In practice, we do not set a given value for \(\lambda\) when calling the glmnet()
function. We let it compute a sequence of candidate values for \(\lambda\) and train the model for this different values of \(\lambda\), hence we will get a sequence of several fitted models, one for each candidate value for \(\lambda\).
To do so, we do:
<- glmnet(X, Y) fit
The glmnet()
function trains different models with 100 different lambda values. Here are the lambda values that are used in this case:
$lambda fit
[1] 0.0103321581 0.0098625456 0.0094142776 0.0089863842 0.0085779392
[6] 0.0081880586 0.0078158987 0.0074606540 0.0071215558 0.0067978701
[11] 0.0064888964 0.0061939660 0.0059124407 0.0056437112 0.0053871958
[16] 0.0051423394 0.0049086122 0.0046855082 0.0044725446 0.0042692605
[21] 0.0040752161 0.0038899912 0.0037131851 0.0035444151 0.0033833160
[26] 0.0032295390 0.0030827515 0.0029426357 0.0028088883 0.0026812200
[31] 0.0025593545 0.0024430278 0.0023319885 0.0022259960 0.0021248210
[36] 0.0020282446 0.0019360578 0.0018480610 0.0017640637 0.0016838843
[41] 0.0016073492 0.0015342927 0.0014645567 0.0013979904 0.0013344495
[46] 0.0012737967 0.0012159007 0.0011606361 0.0011078834 0.0010575284
[51] 0.0010094621 0.0009635805 0.0009197843 0.0008779787 0.0008380732
[56] 0.0007999815 0.0007636211 0.0007289133 0.0006957831 0.0006641587
[61] 0.0006339716 0.0006051567 0.0005776513 0.0005513962 0.0005263344
[66] 0.0005024117 0.0004795763 0.0004577788 0.0004369721 0.0004171110
[71] 0.0003981527 0.0003800560 0.0003627819 0.0003462929 0.0003305533
[76] 0.0003155292 0.0003011879 0.0002874984 0.0002744312 0.0002619579
[81] 0.0002500515 0.0002386863 0.0002278376 0.0002174820 0.0002075971
[86] 0.0001981615 0.0001891548 0.0001805574 0.0001723508 0.0001645172
[91] 0.0001570396 0.0001499019 0.0001430886 0.0001365850 0.0001303770
[96] 0.0001244512 0.0001187947 0.0001133953 0.0001082413 0.0001033216
The coef()
function then returns a matrix with the value of the coefficients for each variable in row and each value of \(\lambda\) in columns:
coef(fit)[1:10,1:5]
10 x 5 sparse Matrix of class "dgCMatrix"
s0 s1 s2 s3 s4
(Intercept) 0.2236754 0.2240095 0.2239531 0.2238992 0.2238478
YAL069W_1 . . . . .
YAL069W_2 . . . . .
NAL013C_3 . . . . .
NAL013C_4 . . . . .
NAL013C_5 . . . . .
NAL013C_6 . . . . .
NAL013C_7 . . . . .
NAL013C_8 . . . . .
NAL013C_9 . . . . .
8.8.2 Regularatization path
Here is the regularization path, i.e. the plot of the estimated value for each coefficient (one line per SNP) depending on the value of \(\lambda\) across the different fitted models (the number at the top quantifies the number of non null coefficients for the corresponding value of \(\lambda\)):
plot(fit, xvar = "lambda")
What is the impact of \(\lambda\) on the coefficient estimations?
Solution
As \(\lambda\) increases, the penalty on each coefficient becomes stronger, and all coefficients are driven towards 0.
We can count the number of non-null coefficients in the fitted model for each value of \(\lambda\) and plot it depending on \(\lambda\):
<- apply(
non_null_count coef(fit)[-1,], # we remove the line corresponding to the intercept
2, # we apply the count on each column
function(col) sum(col!=0))
plot(log(fit$lambda), non_null_count, xlab = "log(Lambda)")
What can say about this plot?
Solution
It confirms the point made with the regularization path plot, as \(\lambda\) increases, more and more coefficients are set to 0, i.e. the number of non-null coefficients decreases.
How can we manage this sequence of candidate values? Can we choose one? Which one?
Solution
We have trained numerous different models, one for each candidate value for \(\lambda\). In practice we need to choose which value of \(\lambda\) we are going to use, because the null coefficients, i.e. corresponding to variables that are discarded from the model, are not the same depending on \(\lambda\).
8.9 Model selection
Choosing a \(\lambda\) value when fitting a Lasso regression model is equivalent to choosing a model among several candidate models, one for each value of \(\lambda\), which call for a model selection procedure.
What does it mean that a coefficient of a variable (i.e. a SNP in our case) is set to 0 or not in the fitted model? (depending on the value of \(\lambda\))
Solution
If the coefficient of a variable is set to 0 in the fitted model, then the corresponding variable does not contribute to the model, and can be considered as non pertinent to explain/predict the response.
Why the Lasso regression can be viewed as a variable selection procedure?
Solution
The variable with non null coefficients in the fitted model are the only ones that contribute to the model, to explain/predict the response. In that sense, the Lasso regression is a variable selection procedure.
In our context of Lasso regression, the different models differs by the coefficients that are set to 0 or not, i.e. by which variables are selected or not to contribute in the model, depending on the value of the penalty parameter \(\lambda\).
In that case, choosing the “best” value for \(\lambda\) corresponds to choosing the best model to fit our data. Hence, we need a model selection procedure.
How can we compare different models? On which criterion can we define the “best” one, i.e. the one that we will choose?
Solution
Since we have a model relating a response \(y\in\mathbb R\) (here the morphological trait) to a set of predictors \(\mathbf x\in\mathbb R^p\) (here the SNPs), we can use the fitted model to predict the value for the response corresponding to the observed values for the predictors.
Then, we can compare the average difference between predicted and observed values for each model, i.e. compute an error of prediction for each model.
Eventually, we could choose the model with the lowest prediction error.
There exist other criterion that can be used to compare model, e.g. based on geometric properties, such as the decomposition of variance, or based on probabilistic assumption (and using penalized (log-)likelihood), such the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC). We will not use those here.
More below in the “Tuning lambda” section on this matter.
8.10 Generalization ability
We can use the predict()
function to computed the response values predicted by the fitted model for given observations of the predictors.
<- predict(fit, newx = X) y_pred
We get a matrix with predictions for each individual in newx
input (in rows) and for each candidate value for \(\lambda\) (in columns):
1:10,1:5] y_pred[
s0 s1 s2 s3 s4
[1,] 0.2236754 0.2231958 0.2231297 0.2230667 0.2230065
[2,] 0.2236754 0.2230775 0.2222374 0.2214355 0.2206700
[3,] 0.2236754 0.2230775 0.2222374 0.2214355 0.2206700
[4,] 0.2236754 0.2241277 0.2248454 0.2255305 0.2261844
[5,] 0.2236754 0.2230775 0.2222374 0.2214355 0.2206700
[6,] 0.2236754 0.2240095 0.2239531 0.2238992 0.2238478
[7,] 0.2236754 0.2230775 0.2222374 0.2214355 0.2206700
[8,] 0.2236754 0.2231958 0.2231297 0.2230667 0.2230065
[9,] 0.2236754 0.2241277 0.2248454 0.2255305 0.2261844
[10,] 0.2236754 0.2230775 0.2222374 0.2214355 0.2206700
In general, we are studying a phenomenon/characteristic/property of an entire population of individuals, not just of individuals that have been sampled and for which we have observations for the quantities of interest.
We rather aim at inferring general properties/characteristics in the entire population from the observed sample.
How can we process to be sure that the properties/characteristics that we derive from the model that we are using, built with the observed sample, will be consistent with what is happening in the entire population?
Solution
We need a way to quantify that the properties/characteristics of our model fitted on the observed sample will correspond to the properties/characteristics of the general population.
To do so, we can build statistical models based on probabilistic tools (like the Gaussian linear model and the ANOVA) to quantify the uncertainty of our model.
Or we can imagine a process that will evaluate the the ability of our fitted model to generalize to unknown individuals.
Here, we question the ability of our model to generalize from the observed sample to unobserved individual in the entire population.
Imagine a procedure to evaluate the the ability of our fitted model to generalize to unknown individuals?
Solution
We could split our data (= observed sample) in two separate set:
- the training set that we will be used to fit/train the model.
- the test set that will NOT be used in the training phase, but for which we will compute the predicted values for the response according to our fitted model and compute a prediction error.
In that way, the prediction error on the test set will quantify if our fitted model is able to generalize to observations that are unknown.
What could be the issue when computing the prediction with the same data used to train the model?
Solution
We could have a model which is very good to do prediction on the training set but not suited to generalize to the entire population, hence performs poorly in prediction on the test set (c.f. below).
Here is the procedure:
- We split our data in a training set and a test set:
# data partition
<- initial_split(yeast_av_data_C_stage, prop = 3/4)
rand_split # train set
<- training(rand_split)
train_set
<- train_set %>% select(!A101) %>% as.matrix()
X_train <- train_set %>% select(A101) %>% as.matrix()
Y_train
# test set
<- testing(rand_split)
test_set
<- test_set %>% select(!A101) %>% as.matrix()
X_test <- test_set %>% select(A101) %>% as.matrix() Y_test
- We fit/train the model using only the training set:
<- glmnet(X_train,Y_train) fit
- We compute the predicted values for the response, both on the training set and the test set:
<- predict(fit, newx = X_train)
Y_train_pred
<- predict(fit, newx = X_test) Y_test_pred
- Using the following function, we compute the mean squared error (MSE) of prediction, both on the train set and the test set:
<- function(pred, truth) {
mse return(mean((pred-truth)^2))
}
<- apply(Y_train_pred, 2, mse, truth = Y_train)
error_train_tab
<- apply(Y_test_pred, 2, mse, truth = Y_test) error_test_tab
We have a predicted values for each observations and each value of \(\lambda\) since the glmnet()
function trains several Lasso model for different value of \(\lambda\).
- We plot the prediction error on the training set and on the test set (here depending on \(\lambda\)).
<- tibble(
error_tab error = c(error_train_tab, error_test_tab),
lambda = rep(fit$lambda, times = 2),
status = rep(c("train", "test"), each = length(fit$lambda)))
ggplot(
error_tab, aes(x = log(lambda), y = error, shape = status, color = status)) +
geom_point() + geom_line() +
xlab("log(Lambda)") + ylab("Prediction error") +
theme_bw()
What can we say about the different fitted models (one per \(\lambda\) value in this case?
Solution
The model achieving the best performance (regarding to prediction error) is not the same on the training set and the test set.
Regarding the training set, the prediction error increases with \(\lambda\).
Regarding the test set, the prediction error decreases for increasing small values of \(\lambda\) then reach a minimum and increases for large \(\lambda\) values.
What can you about the difference between the prediction error on the training set and on the test set when \(\lambda\) is small (left part of the plot)?
Solution
The prediction error on the training set is very will while the prediction error on the test set is very high, meaning that models fitted with small \(\lambda\) values do not generalize well to observations that were not used for the training.
This phenomenon is called over-fitting. The model is too specific for the training set but not suitable outside of it.
Imagine another application where this training/test split and prediction error computation procedure might be handy?
Solution
Imagine we want to select yeast strain with the highest growth rate. Based on the model relating the morphological trait measure and the genotype in the 64 yeast strains that we considered, we could genotype several other yeast strains and predict the value for the morphological trait that we could expect based on the genotype (without doing the experiment of measuring it).
Then, we would focus on just a few strains with the highest prediction for the morphological trait and run the experiment of growing them and actually measuring the morphological trait, which would save some time and money (instead of doing it on all yeast strains that we might consider).
The prediction task is at the core of many application of what we call machine learning. In this particular cases, the train/test split and prediction error computation procedure is at the core of model training and selection.
8.11 Tuning lambda
We still need to choose which model we are going to use (i.e. choose the value for \(\lambda\)), in order to know which SNPs are selected when relating them to the morphological trait.
This is called hyper-parameter tuning or calibration.
Using the previous graph representing the prediction error on the training and test set depending on \(\lambda\), can we choose which model we consider the “best” one?
Solution
Considering the generalization ability question, we could choose the model for which the prediction error on the test set is the smallest.
Here is another repetion of the training/test split procedure:
# data partition
<- initial_split(yeast_av_data_C_stage, prop = 3/4)
rand_split # train set
<- training(rand_split)
train_set
<- train_set %>% select(!A101) %>% as.matrix()
X_train <- train_set %>% select(A101) %>% as.matrix()
Y_train
# test set
<- testing(rand_split)
test_set
<- test_set %>% select(!A101) %>% as.matrix()
X_test <- test_set %>% select(A101) %>% as.matrix()
Y_test
# fit the model
<- glmnet(X_train,Y_train)
fit
# prediction and error
<- predict(fit, newx = X_train)
Y_train_pred <- apply(Y_train_pred, 2, mse, truth = Y_train)
error_train_tab
<- predict(fit, newx = X_test)
Y_test_pred <- apply(Y_test_pred, 2, mse, truth = Y_test)
error_test_tab
# graph
<- tibble(
error_tab error = c(error_train_tab, error_test_tab),
lambda = rep(fit$lambda, times = 2),
status = rep(c("train", "test"), each = length(fit$lambda)))
ggplot(
error_tab, aes(x = log(lambda), y = error, shape = status, color = status)) +
geom_point() + geom_line() +
xlab("log(Lambda)") + ylab("Prediction error") +
theme_bw()
What can we say about this result?
Solution
The profile of the prediction error on the training set depending on \(\lambda\) seems to be consistent between the different sampling.
However, the profile of the prediction error on the test set depending on \(\lambda\) is quite different between the two sampling, and the best model in term of prediction is achevied for a \(\lambda\) value quite different.
What is the issue here?
Solution
The lambda calibration procedure, i.e. the model selection procedure, is very dependent on the initial train/test random split. Hence, the number of SNPs that will be selected is heavily dependent on the original sampling.
We need to implement a procedure to mitigate this dependence since the train/test split is done randomly. We do not know which splitting that we should choose to calibrate \(\lambda\), and we do not want to choose a particular one.
Which procedure could we use to reduce the dependence of the hyper-parameter calibration on the train/test random split?
Solution
We could repeat several times the splitting procedure to compute an average prediction error across several repetition depending on \(\lambda\).
This procedure is called cross-validation.
We are going to keep the test set and do several splittings on the training set itself, that will be split into an actual training set used for the training and a validation set used to compute the prediction error to compare the different models and choose the \(\lambda\) value.
Why are we considering a hierarchical split of the data?
- data used for training: repeat train set/validation set split
- test set
Solution
Since the hyper-parameter calibration is part of the training procedure, we are going to keep the test set for downstream generalization ability quantification for the tuned model3 (i.e. the one for which we have chosen the “best” value of \(\lambda\)).
Imagine different strategies to repeat the train set/validation set split during the cross-cvalidation procedures?
Solution
- We could repeat random sub-sampling (without any link between the different repetitions), this is called Monte-Carlo cross-validation.
- If we have \(n_{\text{train}}\) observations to do the training (not counting the test set), we could train \(n_{\text{train}}\) models, each time using \(n_{\text{train}} - 1\) observations and leave a single observation in the validation test (using once each observation as the validation set), this is called leave-one-out cross-validation (a variant consist in leaving \(p\)4 observations in the validation set in each repetition, called leave-p-out cross-validation).
- To reduce the number of train/validation split, hence to reduce the number of different models to train (which can take some times depending on the type of model that we consider and the dimension of our data), the \(K\)-fold cross-validation was developed: it consists in partitioning the training data into \(K\) groups, and repeat the train/validation split and training \(K\) times, each time using the observations in one group as the validation set and the rest as the training set.
Here is a schematic view explaining the \(K\)-fold cross-validation procedure:
We do the initial train/test split:
# data partition
<- initial_split(yeast_av_data_C_stage, prop = 3/4)
rand_split # train set
<- training(rand_split)
train_set
<- train_set %>% select(!A101) %>% as.matrix()
X_train <- train_set %>% select(A101) %>% as.matrix()
Y_train
# test set
<- testing(rand_split)
test_set
<- test_set %>% select(!A101) %>% as.matrix()
X_test <- test_set %>% select(A101) %>% as.matrix() Y_test
To calibrate \(\lambda\) for the glmnet()
function using cross-validation, we can use the cv.glmnet()
function, it will automatically do the \(K\)-fold splits into train sets and validation sets:
<- cv.glmnet(X_train, Y_train, type.measure = "mse", nfolds = 10) cvfit
We can plot the average prediction error computed across the different repetitions:
plot(cvfit)
Which value of lambda do we choose?
Solution
print(cvfit)
Call: cv.glmnet(x = X_train, y = Y_train, type.measure = "mse", nfolds = 10)
Measure: Mean-Squared Error
Lambda Index Measure SE Nonzero
min 0.005497 18 0.0006852 0.0001076 19
1se 0.012123 1 0.0007323 0.0001013 0
$lambda.min cvfit
[1] 0.005497436
We can compute the prediction for the response on the train set and test set:
<- predict(cvfit, newx = X_train, s = "lambda.min")
Y_train_pred
<- predict(cvfit, newx = X_test, s = "lambda.min") Y_test_pred
and the corresponding errors:
<- mse(Y_train_pred, Y_train)
error_train print(error_train)
[1] 0.0003410473
<- mse(Y_test_pred, Y_test)
error_test print(error_test)
[1] 0.0002332386
We can also extract the coefficients for the “best” model (regarding the value of \(\lambda\)):
head(coef(cvfit, s = "lambda.min"))
6 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 2.338709e-01
YAL069W_1 2.469708e-03
YAL069W_2 .
NAL013C_3 9.421748e-06
NAL013C_4 .
NAL013C_5 .
How many SNPs are selected? Which ones?
Solution
We extract the SNP coefficients (excluding the intercept):
<- coef(cvfit, s = "lambda.min")[-1]
SNP_coef names(SNP_coef) <- colnames(X)
Number of lon-null coefficients:
sum(SNP_coef != 0)
[1] 19
List of selected SNPs with their coefficients:
as.matrix(SNP_coef[which(SNP_coef != 0)])
[,1]
YAL069W_1 2.469708e-03
NAL013C_3 9.421748e-06
YCL039W_343 -3.577301e-03
YCL032W_344 -2.996711e-03
YCL018W_359 -2.387471e-03
YDL119C_464 -7.905564e-04
YDL102W_472 -1.278898e-03
YDL101C_473 -4.469054e-18
YDR251W_550 -2.403105e-03
YDR252W_551 -1.331357e-17
YFR053C_865 7.551417e-03
YHR095W_1279 -6.837983e-03
YKL110C_1705 3.263589e-03
NLL003W_1938 -1.219008e-03
NLL003W_1939 -5.832979e-07
YLR426W_2156 3.782808e-04
YPL238C_2804 -6.491660e-03
YPL235W_2805 -7.555793e-18
YPL158C_2821 -5.628979e-03
We could compare this list of selected SNPs with the results from SNP effect estimation based on the ANOVA during the previous practice.
8.12 Bias and re-estimation
What does it mean?
Solution
It mean that if the expectation of any estimated coefficients \(\hat\beta_j\) is different from the “true” value of the parameter \(\beta_j\):
\[\mathbb E[\hat\beta_j] \ne \beta_j\]
Imagine a procedure to illustrate this point?
Solution
We could simulate some data following a sparse linear model \(Y_i = \mathbf x_i\boldsymbol\beta^* + \varepsilon_i\), i.e. with null entries in \(\boldsymbol\beta^*\).
Then we could repeat several times the estimation procedure for the coefficients \(\boldsymbol\beta\) with a Lasso regression and many different train/test sub-sampling. Then, we could check the empirical distribution of the estimated coefficients \(\hat\beta_j\) for each variable \(j\), in particular the corresponding empirical mean.
We would expect that the empirical mean computed on the several estimations \(\hat\beta_j\) is different from the “true” value \(\beta_j^*\) used to generate the data.
It is recommended to re-estimate the coefficients of the selected variables using a non penalized/non regularized model (and dropping the non selected ones).
Now we can fit a linear model and use its statistical properties to run tests, etc.:
<- X_train[,which(SNP_coef != 0)]
select_X_train
<- cbind(select_X_train, Y_train) %>% as.data.frame()
selected_data
<- lm(A101 ~ ., data = selected_data)
simple_model
summary(simple_model)
Call:
lm(formula = A101 ~ ., data = selected_data)
Residuals:
Min 1Q Median 3Q Max
-0.0267566 -0.0069933 -0.0007951 0.0052838 0.0250843
Coefficients: (5 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.251683 0.008583 29.325 < 2e-16 ***
YAL069W_1 0.011153 0.004711 2.367 0.02394 *
NAL013C_3 NA NA NA NA
YCL039W_343 -0.015157 0.007148 -2.120 0.04158 *
YCL032W_344 0.004006 0.008313 0.482 0.63301
YCL018W_359 -0.004674 0.006957 -0.672 0.50639
YDL119C_464 -0.008598 0.005963 -1.442 0.15877
YDL102W_472 -0.001799 0.006367 -0.283 0.77927
YDL101C_473 NA NA NA NA
YDR251W_550 -0.012570 0.004738 -2.653 0.01217 *
YDR252W_551 NA NA NA NA
YFR053C_865 0.007613 0.004676 1.628 0.11300
YHR095W_1279 -0.010259 0.005274 -1.945 0.06030 .
YKL110C_1705 0.010367 0.004734 2.190 0.03571 *
NLL003W_1938 -0.012060 0.005243 -2.300 0.02788 *
NLL003W_1939 NA NA NA NA
YLR426W_2156 0.003698 0.004800 0.770 0.44657
YPL238C_2804 -0.014885 0.004870 -3.056 0.00442 **
YPL235W_2805 NA NA NA NA
YPL158C_2821 -0.013241 0.004736 -2.796 0.00856 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.01414 on 33 degrees of freedom
Multiple R-squared: 0.806, Adjusted R-squared: 0.7236
F-statistic: 9.791 on 14 and 33 DF, p-value: 4.613e-08
In practice, we should use post-selection inference procedure because we are using the same data twice (for selection and then for estimation).
8.13 References
We will see later that it is recommended to let
glmnet
define a sequence of candidate values for \(\lambda\), c.f.glmnet()
function documentation.↩︎We remove the first coefficient corresponding to the intercept in the plot.↩︎
or “calibrated model”↩︎
no relationship with the \(p\) quantity used earlier to quantify the total number of predictors that we consider in the linear model↩︎