8 Practice: Introduction to model selection and regularization

Author

Ghislain Durif, Laurent Modolo, Franck Picard

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

8.1 Introduction

Following last session, we will again focus on linear models, that are widely used in data analysis to relate variables and discover relationships between quantitative and/or qualitative measurements. Linear models can be used for modeling, prediction, data normalization, factor/variable effect on an outcome detection or quantification, etc.

We will now question the selection of relevant predictors among a large number of variables in order to predict or explain a response variable, i.e. implement model selection procedures. In particular, we will consider regularization approach for variable selection, but also model performance evaluation procedures.

Again, we will study the possible associations between the genotype (containing 1000s of SNPs) of yeast colonies and some morpholocigal traits directly measured on the yeast through image analysis. We will use data from:

[1] Nogami S, Ohya Y, Yvert G (2007) Genetic Complexity and Quantitative Trait Loci Mapping of Yeast Morphological Traits . PLOS Genetics 3(2): e31. https://doi.org/10.1371/journal.pgen.0030031

8.2 Requirements

We will need the following packages. If not already done, you need to install the following packages:

install.packages("tidyverse")   # to manipulate data and make plot
install.packages("glmnet")      # to train Lasso model
install.packages("pheatmap")    # to draw heatmap
install.packages("rsample")     # to do sub-sampling

Now, we can load them:

library(tidyverse)   # to manipulate data and make plot
library(glmnet)      # to train Lasso model
library(pheatmap)    # to draw heatmap
library(rsample)     # to do sub-sampling

8.3 Loading the data

We will use data stored in the practical_c.RData file.

data_list <- load(url("https://lbmc.gitbiopages.ens-lyon.fr/hub/formations/ens_m1_ml/practical_c.Rdata"), verbose = T)

We will focus on the morphological data averaged by strain and cell cycle, i.e. the data frame yeast_av_data. For each strain, we will consider the morphological measure A101 and the corresponding genotype. Eventually, we will also restrain our study to cell phase "C".

We extract the corresponding data, i.e. only strain in "C" phase, only A101 variable and all SNPs (given by gt_data$RQTL_name), and we also remove SNPs with missing values (encoded NA) in some strains:

yeast_av_data_C_stage <- yeast_av_data %>% 
    filter(cell_cycle == "C") %>%               # keep C cell cycle phase
    select(any_of(gt_data$RQTL_name), A101) %>% # select A101 and SNP variables
    select_if(~ !any(is.na(.)))                 # remove SNP with missing values

How many observations and variables do we have in this data subset?

Solution

dim(yeast_av_data_C_stage)

[1]   64 1546

We have 64 observations, and 1546 variables, i.e. one morphological traits and 1545 SNPs.

What could be the issue?

Solution

We have far more variables than observations, which is unknown as “high dimensional data”, and which can lead to several challenges and cause several issues during the analysis.

8.4 Linear model

What kind of linear model could we consider to question the relationship between the morphological trait A101 and all the SNPs in our yeast genotypes?

Solution

We want to consider a linear model to relate the morphological trait as a response and the SNP variables as predictors.

We could build one univariate linear model per SNP: \[Y_i = \beta_0 + X_{ij}\beta_j + \varepsilon_i\] for each SNP $j$, where $Y_i$ is the value observed for the response on individual $i$ and $X_{ij}$ is the value (0 or 1) observed for SNP $j$ on individual $i$ (reminder: here individuals are yeast strain), $\beta_0$ is the intercept and $\beta_j$ is the slope for SNP $j$, and $\varepsilon_i$ the error term for individual $i$.

BUT it will we cumbersome to fit 1545 models.

We will prefer to consider a single multivariate linear model accouting for all $p$ SNPs: \[Y_i = \beta_0 + \left( \sum_{j=1}^p X_{ij}\beta_j \right) + \varepsilon_i\] The model can also be noted \[Y_i = \beta_0 + \mathbf x_{i}\boldsymbol\beta + \varepsilon_i\] where $\mathbf x_i = [x_{i1},\dots,x_{ij},\dots,x_{ip}]$ is the vector of dimension $p$ gathering observed values for all SNPs in individual $i$, and $\boldsymbol\beta = [\beta_{1},\dots,\beta_{j},\dots,\beta_{p}]^t$ is the vector of size $p$ of linear coefficients, i.e. the parameters of the model that we want to estimate.

We will use the data frame yeast_av_data_C_stage and the functionm lm() to define the previous model.

lin_mod <- lm(A101 ~ ., data = yeast_av_data_C_stage)

What seems to be the problem? What can we do?

Solution

The lm function returns an error and cannot fit the model. Since we know the closed-form formula to estimate the $\boldsymbol\beta$ coefficient, we could try to compute them.

What is the formula to compute the estimation of the $\boldsymbol\beta$ coefficients in the model $Y_i = \beta_0 + \mathbf x_{i}\boldsymbol\beta + \varepsilon_i$ using a least square regression approach?

Solution

Reminder: The least square regression approach is based on the minimization of the squared error between the observed and fitted values to estimate the model coefficients:

\[\hat{\boldsymbol\beta} = \underset{\boldsymbol\beta}{\text{argmin}}\ \sum_{i=1}^n \left( Y_i - \mathbf x_i\boldsymbol\beta \right)^2\] Note: we consider $\mathbf x_i = [1, x_{i1},\dots,x_{ij},\dots,x_{ip}]$ and $\boldsymbol\beta = [\beta_{0}, \beta_{1},\dots,\beta_{j},\dots,\beta_{p}]^t$ to account for the intercept in the scalar product $\mathbf x_i\boldsymbol\beta$.

The solution is given by: \[\hat{\boldsymbol\beta} = (\mathbf X^t \mathbf X)^{-1}\mathbf X^t\mathbf y\] where $\mathbf X = [x_{ij}]_{i=1:n}^{j=1:p}$ is the $n \times p$ matrix collecting the observations in rows and SNP variables in column, i.e. with the value $x_{ij}$ for SNP $j$ in individual $i$ at row $i$/column $j$ intersection, and $\mathbf y = [y_i]_{i=1:n}$ the vector of size $n$ collecting the observed values for response, i.e. the morphologial trait.

Compute the coefficients estimates in our linear model.

Solution

We extract the X matrix and add a column of 1:

X <- yeast_av_data_C_stage %>% select(!A101) %>% as.matrix()
Xtilde <- cbind(1, X)
str(Xtilde)

 num [1:64, 1:1546] 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:1546] "" "YAL069W_1" "YAL069W_2" "NAL013C_3" ...

Xtilde[1:10,1:5]

        YAL069W_1 YAL069W_2 NAL013C_3 NAL013C_4
 [1,] 1         0         0         0         0
 [2,] 1         1         1         1         1
 [3,] 1         0         0         0         0
 [4,] 1         0         0         0         0
 [5,] 1         0         0         0         0
 [6,] 1         1         1         1         1
 [7,] 1         1         1         1         1
 [8,] 1         1         1         1         1
 [9,] 1         1         1         1         1
[10,] 1         1         1         1         1

We compute $(\mathbf X^t \mathbf X)^{-1}$:

solve(t(Xtilde) %*% Xtilde)

Error in solve.default(t(Xtilde) %*% Xtilde): Lapack routine dgesv: system is exactly singular: U[4,4] = 0

What is happening?

Solution

We have some correlated variables which explains

Warning

We have a numerical issue preventing us from fitting the model (i.e. estimating the $\boldsymbol\beta$ coefficients). It is related to the dimension of our data (and the fact that we have more variables than observations).

Here is the correlation matrix between SNPs, i.e. columns from $\mathbf X$:

pheatmap(
    cor(X),
    scale = "none", cluster_rows = FALSE, cluster_cols = FALSE,
    labels_row = rep("", ncol(X)), labels_col = rep("", ncol(X)),
    legend_breaks = c(-1, 0, 1)
)

Could you guess what could be a problem?

Solution

In high dimensional settings, we generally have correlated variables which is a problem when trying to invert the matrix $\mathbf X^t \mathbf X$.

What can we do to fix this issue?

Solution

Note

We can try to directly fix the numerical issue with a regularization procedure for the optimization problem that we are solving, i.e. so that $\mathbf X^t \mathbf X$ become invertible.
We can try to reduce the number of SNPs that we consider in our model, so that it can be fitted. One possible way to do the latter is to apply a variable selection procedure, some of which are also based on a dedicated regularization procedure for the optimization problem that we are solving.

8.5 Getting started with GLMnet

Note

In the following we will use the glmnet package to perform regularized regression.

Here is an introduction about to use it.

To do so, we need first to explicitly define the $\mathbf X$ and $\mathbf Y$ matrices, respectively containing the observed values for all the SNPs and for the morphological trait in all individuals:

X <- yeast_av_data_C_stage %>% select(!A101) %>% as.matrix()
dim(X)

[1]   64 1545

Y <- yeast_av_data_C_stage %>% select(A101) %>% as.matrix()
dim(Y)

[1] 64  1

To fit a regularized linear model with the glmnet package, we can do:

fit <- glmnet(X, Y)

We will learn in the next sections how to modify the settings to fit the model and how to use the fitted model.

8.6 Ridge regularization

First, we will try to find a way to find an estimation of the $\boldsymbol\beta$ coefficients in the linear model accounting for all SNPs, i.e. $Y_i = \beta_0 + \mathbf x_{i}\boldsymbol\beta + \varepsilon_i$, despite the numerical issue during the least square regression optimization procedure when trying to invert the matrix $\mathbf X^t \mathbf X$.

What can we do?

Solution

We could try to find a way to make the $\mathbf X^t \mathbf X$ invertible.

What method can we use?

Solution

We will use a Ridge regression based on a $\ell_2$ regularization of the optimization problem.

We add a $\ell_2$ penalty in the least square regression optimization problem, so that the penalized problem becomes:

\[\hat{\boldsymbol\beta}_r = \underset{\boldsymbol\beta}{\text{argmin}}\ \left\{ \sum_{i=1}^n \left( Y_i - \mathbf x_i\boldsymbol\beta \right)^2 \right\} + \lambda\, \Vert \boldsymbol\beta \Vert_2^{\,2}\]

where $\Vert \boldsymbol\beta \Vert_2^{\,2} = \sum_j \vert \beta_j \vert^2$ is the $\ell_2$ norm, and $\lambda\geq 0$ a penalty parameter.

Note

The solution of the Ridge regression problem is given by:

\[\hat{\boldsymbol\beta}_r = (\mathbf X^t \mathbf X + \lambda\, \text{I}_p)^{-1}\mathbf X^t\mathbf y\] where $\text{I}_p$ is the $p \times p$ identity matrix (with 1 on the diagonal and - elsewhere).

What does the Ridge regularization do?

Solution

By adding a $\lambda$ quantity on the diagonal of the $\mathbf X^t \mathbf X$, it becomes invertible and we can compute the estimation for the coefficient in our model.

Compute $\boldsymbol\beta_r$ for $\lambda = 1$?

Solution

lambda <- 1

hat_beta_ridge <- solve(t(Xtilde) %*% Xtilde + 
                            lambda * diag(1, ncol(Xtilde))) %*% t(Xtilde) %*% Y

Tip

We can also use the glmnet package to fit the Ridge regression by setting the alpha parameter to 0. We can use the lambda parameter to set the value of $\lambda$¹

ridge_mod <- glmnet(X, Y, alpha = 0, lambda = 1)

We represent the values of the coefficients in the Ridge regression model depending on the SNP index.

plot(coef(ridge_mod)[-1])

We are able to estimate the 1545 coefficients in the model. However, what could be the issue regarding the model interpretability?

Solution

We still have numerous predictors (i.e. SNPs here) which makes the model difficult to interpret, especially to assess which ones are relevant to explain the response and which are not.

What can we do?

Solution

We can try to select relevant predictors (here SNPs) that are related to the response and drop the non pertinent ones.

Note

Based on a parsimony (a.k.a. sparsity) hypothesis, we can assume that among the thousand SNPs that we are considering, not all will be useful to explain the response and we should not consider all of them.

8.7 How to select variables

Waht could we do to select variables that we will consider in our model?

Solution

We could build several models that integrate or not each variable and select the “best one”.

how to compare models?
$2^p$ models to consider (here $\gg2^{1000}$ models!!!)

We will prefer to use, for the model considering all SNPs, a regularization method, based on a penalized optimization problem, that will directly select relevant predictors.

8.8 Lasso regularization

What is the Lasso regression?

Solution

The Lasso regression based on a $\ell_1$ regularization of the optimization problem.

We add a $\ell_1$ penalty in the least square regression optimization problem, so that the penalized problem becomes:

\[\hat{\boldsymbol\beta}_l = \underset{\boldsymbol\beta}{\text{argmin}}\ \left\{ \sum_{i=1}^n \left( Y_i - \mathbf x_i\boldsymbol\beta \right)^2 \right\} + \lambda\, \Vert \boldsymbol\beta \Vert_1\]

where $\Vert \boldsymbol\beta \Vert_1 = \sum_j \vert \beta_j \vert$ is the $\ell_1$ norm, and $\lambda\geq 0$ a penalty parameter.

What is the effect on the estimation of the $\boldsymbol\lambda$ coefficient when using the Lasso?

Solution

Some coefficients automatically set to 0 (c.f. below).

8.8.1 Training the model

To fit a Lasso linear model with the glmnet package, we can do:

fit <- glmnet(X, Y)

Tip

We can use the lambda parameter to specify a given value for $\lambda$.

We can use the coef() function to collect the estimated coefficients for the fitted model.

Train the Lasso model for $\lambda = 1$ and for $\lambda = 0.001$.

Solution

We train the Lasso model for a given $\lambda = 1$:

fit <- glmnet(X, Y, lambda = 1)

Here are the values of the estimated coefficients²:

plot(coef(fit)[-1])

We use a smaller value of $\lambda$ and train the Lasso model for a given $\lambda = 0.001$:

fit <- glmnet(X, Y, lambda = 0.001)

Here are the value of the estimated coefficients

plot(coef(fit)[-1])

What can we say about this?

Solution

For $\lambda = 1$, all coefficients are set to 0, the penalty is too strong.

For $\lambda = 0.001$, many coefficients are set to 0, and some are not. The non null coefficients corresponds to selected SNPs.

Tip

In practice, we do not set a given value for $\lambda$ when calling the glmnet() function. We let it compute a sequence of candidate values for $\lambda$ and train the model for this different values of $\lambda$, hence we will get a sequence of several fitted models, one for each candidate value for $\lambda$.

To do so, we do:

fit <- glmnet(X, Y)

The glmnet() function trains different models with 100 different lambda values. Here are the lambda values that are used in this case:

fit$lambda

  [1] 0.0103321581 0.0098625456 0.0094142776 0.0089863842 0.0085779392
  [6] 0.0081880586 0.0078158987 0.0074606540 0.0071215558 0.0067978701
 [11] 0.0064888964 0.0061939660 0.0059124407 0.0056437112 0.0053871958
 [16] 0.0051423394 0.0049086122 0.0046855082 0.0044725446 0.0042692605
 [21] 0.0040752161 0.0038899912 0.0037131851 0.0035444151 0.0033833160
 [26] 0.0032295390 0.0030827515 0.0029426357 0.0028088883 0.0026812200
 [31] 0.0025593545 0.0024430278 0.0023319885 0.0022259960 0.0021248210
 [36] 0.0020282446 0.0019360578 0.0018480610 0.0017640637 0.0016838843
 [41] 0.0016073492 0.0015342927 0.0014645567 0.0013979904 0.0013344495
 [46] 0.0012737967 0.0012159007 0.0011606361 0.0011078834 0.0010575284
 [51] 0.0010094621 0.0009635805 0.0009197843 0.0008779787 0.0008380732
 [56] 0.0007999815 0.0007636211 0.0007289133 0.0006957831 0.0006641587
 [61] 0.0006339716 0.0006051567 0.0005776513 0.0005513962 0.0005263344
 [66] 0.0005024117 0.0004795763 0.0004577788 0.0004369721 0.0004171110
 [71] 0.0003981527 0.0003800560 0.0003627819 0.0003462929 0.0003305533
 [76] 0.0003155292 0.0003011879 0.0002874984 0.0002744312 0.0002619579
 [81] 0.0002500515 0.0002386863 0.0002278376 0.0002174820 0.0002075971
 [86] 0.0001981615 0.0001891548 0.0001805574 0.0001723508 0.0001645172
 [91] 0.0001570396 0.0001499019 0.0001430886 0.0001365850 0.0001303770
 [96] 0.0001244512 0.0001187947 0.0001133953 0.0001082413 0.0001033216

The coef() function then returns a matrix with the value of the coefficients for each variable in row and each value of $\lambda$ in columns:

coef(fit)[1:10,1:5]

10 x 5 sparse Matrix of class "dgCMatrix"
                   s0        s1        s2        s3        s4
(Intercept) 0.2236754 0.2240095 0.2239531 0.2238992 0.2238478
YAL069W_1   .         .         .         .         .        
YAL069W_2   .         .         .         .         .        
NAL013C_3   .         .         .         .         .        
NAL013C_4   .         .         .         .         .        
NAL013C_5   .         .         .         .         .        
NAL013C_6   .         .         .         .         .        
NAL013C_7   .         .         .         .         .        
NAL013C_8   .         .         .         .         .        
NAL013C_9   .         .         .         .         .

8.8.2 Regularatization path

Here is the regularization path, i.e. the plot of the estimated value for each coefficient (one line per SNP) depending on the value of $\lambda$ across the different fitted models (the number at the top quantifies the number of non null coefficients for the corresponding value of $\lambda$):

plot(fit, xvar = "lambda")

What is the impact of $\lambda$ on the coefficient estimations?

Solution

As $\lambda$ increases, the penalty on each coefficient becomes stronger, and all coefficients are driven towards 0.

We can count the number of non-null coefficients in the fitted model for each value of $\lambda$ and plot it depending on $\lambda$:

non_null_count <- apply(
    coef(fit)[-1,],         # we remove the line corresponding to the intercept
    2,                      # we apply the count on each column
    function(col) sum(col!=0))
plot(log(fit$lambda), non_null_count, xlab = "log(Lambda)")

What can say about this plot?

Solution

It confirms the point made with the regularization path plot, as $\lambda$ increases, more and more coefficients are set to 0, i.e. the number of non-null coefficients decreases.

How can we manage this sequence of candidate values? Can we choose one? Which one?

Solution

We have trained numerous different models, one for each candidate value for $\lambda$. In practice we need to choose which value of $\lambda$ we are going to use, because the null coefficients, i.e. corresponding to variables that are discarded from the model, are not the same depending on $\lambda$.

8.9 Model selection

Choosing a $\lambda$ value when fitting a Lasso regression model is equivalent to choosing a model among several candidate models, one for each value of $\lambda$, which call for a model selection procedure.

What does it mean that a coefficient of a variable (i.e. a SNP in our case) is set to 0 or not in the fitted model? (depending on the value of $\lambda$)

Solution

If the coefficient of a variable is set to 0 in the fitted model, then the corresponding variable does not contribute to the model, and can be considered as non pertinent to explain/predict the response.

Why the Lasso regression can be viewed as a variable selection procedure?

Solution

The variable with non null coefficients in the fitted model are the only ones that contribute to the model, to explain/predict the response. In that sense, the Lasso regression is a variable selection procedure.

Note

In our context of Lasso regression, the different models differs by the coefficients that are set to 0 or not, i.e. by which variables are selected or not to contribute in the model, depending on the value of the penalty parameter $\lambda$.

In that case, choosing the “best” value for $\lambda$ corresponds to choosing the best model to fit our data. Hence, we need a model selection procedure.

How can we compare different models? On which criterion can we define the “best” one, i.e. the one that we will choose?

Solution

Since we have a model relating a response $y\in\mathbb R$ (here the morphological trait) to a set of predictors $\mathbf x\in\mathbb R^p$ (here the SNPs), we can use the fitted model to predict the value for the response corresponding to the observed values for the predictors.

Then, we can compare the average difference between predicted and observed values for each model, i.e. compute an error of prediction for each model.

Eventually, we could choose the model with the lowest prediction error.

Note

There exist other criterion that can be used to compare model, e.g. based on geometric properties, such as the decomposition of variance, or based on probabilistic assumption (and using penalized (log-)likelihood), such the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC). We will not use those here.

More below in the “Tuning lambda” section on this matter.

8.10 Generalization ability

We can use the predict() function to computed the response values predicted by the fitted model for given observations of the predictors.

y_pred <- predict(fit, newx = X)

We get a matrix with predictions for each individual in newx input (in rows) and for each candidate value for $\lambda$ (in columns):

y_pred[1:10,1:5]

             s0        s1        s2        s3        s4
 [1,] 0.2236754 0.2231958 0.2231297 0.2230667 0.2230065
 [2,] 0.2236754 0.2230775 0.2222374 0.2214355 0.2206700
 [3,] 0.2236754 0.2230775 0.2222374 0.2214355 0.2206700
 [4,] 0.2236754 0.2241277 0.2248454 0.2255305 0.2261844
 [5,] 0.2236754 0.2230775 0.2222374 0.2214355 0.2206700
 [6,] 0.2236754 0.2240095 0.2239531 0.2238992 0.2238478
 [7,] 0.2236754 0.2230775 0.2222374 0.2214355 0.2206700
 [8,] 0.2236754 0.2231958 0.2231297 0.2230667 0.2230065
 [9,] 0.2236754 0.2241277 0.2248454 0.2255305 0.2261844
[10,] 0.2236754 0.2230775 0.2222374 0.2214355 0.2206700

In general, we are studying a phenomenon/characteristic/property of an entire population of individuals, not just of individuals that have been sampled and for which we have observations for the quantities of interest.

We rather aim at inferring general properties/characteristics in the entire population from the observed sample.

How can we process to be sure that the properties/characteristics that we derive from the model that we are using, built with the observed sample, will be consistent with what is happening in the entire population?

Solution

We need a way to quantify that the properties/characteristics of our model fitted on the observed sample will correspond to the properties/characteristics of the general population.

To do so, we can build statistical models based on probabilistic tools (like the Gaussian linear model and the ANOVA) to quantify the uncertainty of our model.

Or we can imagine a process that will evaluate the the ability of our fitted model to generalize to unknown individuals.

Note

Here, we question the ability of our model to generalize from the observed sample to unobserved individual in the entire population.

Imagine a procedure to evaluate the the ability of our fitted model to generalize to unknown individuals?

Solution

We could split our data (= observed sample) in two separate set:

the training set that we will be used to fit/train the model.
the test set that will NOT be used in the training phase, but for which we will compute the predicted values for the response according to our fitted model and compute a prediction error.

In that way, the prediction error on the test set will quantify if our fitted model is able to generalize to observations that are unknown.

What could be the issue when computing the prediction with the same data used to train the model?

Solution

We could have a model which is very good to do prediction on the training set but not suited to generalize to the entire population, hence performs poorly in prediction on the test set (c.f. below).

Here is the procedure:

We split our data in a training set and a test set:

# data partition
rand_split <- initial_split(yeast_av_data_C_stage, prop = 3/4)
# train set
train_set <- training(rand_split)

X_train <- train_set %>% select(!A101) %>% as.matrix()
Y_train <- train_set %>% select(A101) %>% as.matrix()

# test set
test_set <- testing(rand_split)

X_test <- test_set %>% select(!A101) %>% as.matrix()
Y_test <- test_set %>% select(A101) %>% as.matrix()

We fit/train the model using only the training set:

fit <- glmnet(X_train,Y_train)

We compute the predicted values for the response, both on the training set and the test set:

Y_train_pred <- predict(fit, newx = X_train)

Y_test_pred <- predict(fit, newx = X_test)

Using the following function, we compute the mean squared error (MSE) of prediction, both on the train set and the test set:

mse <- function(pred, truth) {
    return(mean((pred-truth)^2))
}

error_train_tab <- apply(Y_train_pred, 2, mse, truth = Y_train)

error_test_tab <- apply(Y_test_pred, 2, mse, truth = Y_test)

Note

We have a predicted values for each observations and each value of $\lambda$ since the glmnet() function trains several Lasso model for different value of $\lambda$.

We plot the prediction error on the training set and on the test set (here depending on $\lambda$).

error_tab <- tibble(
    error = c(error_train_tab, error_test_tab),
    lambda = rep(fit$lambda, times = 2),
    status = rep(c("train", "test"), each = length(fit$lambda)))

ggplot(
    error_tab, 
    aes(x = log(lambda), y = error, shape = status, color = status)) +
    geom_point() + geom_line() +
    xlab("log(Lambda)") + ylab("Prediction error") +
    theme_bw()

What can we say about the different fitted models (one per $\lambda$ value in this case?

Solution

The model achieving the best performance (regarding to prediction error) is not the same on the training set and the test set.

Regarding the training set, the prediction error increases with $\lambda$.
Regarding the test set, the prediction error decreases for increasing small values of $\lambda$ then reach a minimum and increases for large $\lambda$ values.

What can you about the difference between the prediction error on the training set and on the test set when $\lambda$ is small (left part of the plot)?

Solution

The prediction error on the training set is very will while the prediction error on the test set is very high, meaning that models fitted with small $\lambda$ values do not generalize well to observations that were not used for the training.

Note

This phenomenon is called over-fitting. The model is too specific for the training set but not suitable outside of it.

Imagine another application where this training/test split and prediction error computation procedure might be handy?

Solution

Imagine we want to select yeast strain with the highest growth rate. Based on the model relating the morphological trait measure and the genotype in the 64 yeast strains that we considered, we could genotype several other yeast strains and predict the value for the morphological trait that we could expect based on the genotype (without doing the experiment of measuring it).

Then, we would focus on just a few strains with the highest prediction for the morphological trait and run the experiment of growing them and actually measuring the morphological trait, which would save some time and money (instead of doing it on all yeast strains that we might consider).

Note

The prediction task is at the core of many application of what we call machine learning. In this particular cases, the train/test split and prediction error computation procedure is at the core of model training and selection.

8.11 Tuning lambda

We still need to choose which model we are going to use (i.e. choose the value for $\lambda$), in order to know which SNPs are selected when relating them to the morphological trait.

Note

This is called hyper-parameter tuning or calibration.

Using the previous graph representing the prediction error on the training and test set depending on $\lambda$, can we choose which model we consider the “best” one?

Solution

Considering the generalization ability question, we could choose the model for which the prediction error on the test set is the smallest.

Here is another repetion of the training/test split procedure:

# data partition
rand_split <- initial_split(yeast_av_data_C_stage, prop = 3/4)
# train set
train_set <- training(rand_split)

X_train <- train_set %>% select(!A101) %>% as.matrix()
Y_train <- train_set %>% select(A101) %>% as.matrix()

# test set
test_set <- testing(rand_split)

X_test <- test_set %>% select(!A101) %>% as.matrix()
Y_test <- test_set %>% select(A101) %>% as.matrix()

# fit the model
fit <- glmnet(X_train,Y_train)

# prediction and error
Y_train_pred <- predict(fit, newx = X_train)
error_train_tab <- apply(Y_train_pred, 2, mse, truth = Y_train)

Y_test_pred <- predict(fit, newx = X_test)
error_test_tab <- apply(Y_test_pred, 2, mse, truth = Y_test)

# graph
error_tab <- tibble(
    error = c(error_train_tab, error_test_tab),
    lambda = rep(fit$lambda, times = 2),
    status = rep(c("train", "test"), each = length(fit$lambda)))

ggplot(
    error_tab, 
    aes(x = log(lambda), y = error, shape = status, color = status)) +
    geom_point() + geom_line() +
    xlab("log(Lambda)") + ylab("Prediction error") +
    theme_bw()

What can we say about this result?

Solution

The profile of the prediction error on the training set depending on $\lambda$ seems to be consistent between the different sampling.

However, the profile of the prediction error on the test set depending on $\lambda$ is quite different between the two sampling, and the best model in term of prediction is achevied for a $\lambda$ value quite different.

What is the issue here?

Solution

The lambda calibration procedure, i.e. the model selection procedure, is very dependent on the initial train/test random split. Hence, the number of SNPs that will be selected is heavily dependent on the original sampling.

We need to implement a procedure to mitigate this dependence since the train/test split is done randomly. We do not know which splitting that we should choose to calibrate $\lambda$, and we do not want to choose a particular one.

Which procedure could we use to reduce the dependence of the hyper-parameter calibration on the train/test random split?

Solution

We could repeat several times the splitting procedure to compute an average prediction error across several repetition depending on $\lambda$.

Tip

This procedure is called cross-validation.

Warning

We are going to keep the test set and do several splittings on the training set itself, that will be split into an actual training set used for the training and a validation set used to compute the prediction error to compare the different models and choose the $\lambda$ value.

Why are we considering a hierarchical split of the data?

data used for training: repeat train set/validation set split
test set

Solution

Note

Since the hyper-parameter calibration is part of the training procedure, we are going to keep the test set for downstream generalization ability quantification for the tuned model³ (i.e. the one for which we have chosen the “best” value of $\lambda$).

Imagine different strategies to repeat the train set/validation set split during the cross-cvalidation procedures?

Solution

We could repeat random sub-sampling (without any link between the different repetitions), this is called Monte-Carlo cross-validation.
If we have $n_{\text{train}}$ observations to do the training (not counting the test set), we could train $n_{\text{train}}$ models, each time using $n_{\text{train}} - 1$ observations and leave a single observation in the validation test (using once each observation as the validation set), this is called leave-one-out cross-validation (a variant consist in leaving $p$⁴ observations in the validation set in each repetition, called leave-p-out cross-validation).
To reduce the number of train/validation split, hence to reduce the number of different models to train (which can take some times depending on the type of model that we consider and the dimension of our data), the $K$-fold cross-validation was developed: it consists in partitioning the training data into $K$ groups, and repeat the train/validation split and training $K$ times, each time using the observations in one group as the validation set and the rest as the training set.

Here is a schematic view explaining the $K$-fold cross-validation procedure:

Figure 8.1: Reference: Lever, Krzywinski, and Altman (2016) (a) A strategy with K = 5 without model selection. Training sets and test sets are used to derive prediction statistics. (b) Nested K-fold cross-validation with model selection. This strategy uses a validation set for model selection using the strategy of a. The best model is then tested on the separate test set. Gray bars indicate samples not used at the represented stage.

We do the initial train/test split:

# data partition
rand_split <- initial_split(yeast_av_data_C_stage, prop = 3/4)
# train set
train_set <- training(rand_split)

X_train <- train_set %>% select(!A101) %>% as.matrix()
Y_train <- train_set %>% select(A101) %>% as.matrix()

# test set
test_set <- testing(rand_split)

X_test <- test_set %>% select(!A101) %>% as.matrix()
Y_test <- test_set %>% select(A101) %>% as.matrix()

To calibrate $\lambda$ for the glmnet() function using cross-validation, we can use the cv.glmnet() function, it will automatically do the $K$-fold splits into train sets and validation sets:

cvfit <- cv.glmnet(X_train, Y_train, type.measure = "mse", nfolds = 10)

We can plot the average prediction error computed across the different repetitions:

plot(cvfit)

Which value of lambda do we choose?

Solution

print(cvfit)


Call:  cv.glmnet(x = X_train, y = Y_train, type.measure = "mse", nfolds = 10) 

Measure: Mean-Squared Error 

      Lambda Index   Measure        SE Nonzero
min 0.005497    18 0.0006852 0.0001076      19
1se 0.012123     1 0.0007323 0.0001013       0

cvfit$lambda.min

[1] 0.005497436

We can compute the prediction for the response on the train set and test set:

Y_train_pred <- predict(cvfit, newx = X_train, s = "lambda.min")

Y_test_pred <- predict(cvfit, newx = X_test, s = "lambda.min")

and the corresponding errors:

error_train <- mse(Y_train_pred, Y_train)
print(error_train)

[1] 0.0003410473

error_test <- mse(Y_test_pred, Y_test)
print(error_test)

[1] 0.0002332386

We can also extract the coefficients for the “best” model (regarding the value of $\lambda$):

head(coef(cvfit, s = "lambda.min"))

6 x 1 sparse Matrix of class "dgCMatrix"
                      s1
(Intercept) 2.338709e-01
YAL069W_1   2.469708e-03
YAL069W_2   .           
NAL013C_3   9.421748e-06
NAL013C_4   .           
NAL013C_5   .

How many SNPs are selected? Which ones?

Solution

We extract the SNP coefficients (excluding the intercept):

SNP_coef <- coef(cvfit, s = "lambda.min")[-1]
names(SNP_coef) <- colnames(X)

Number of lon-null coefficients:

sum(SNP_coef != 0)

[1] 19

List of selected SNPs with their coefficients:

as.matrix(SNP_coef[which(SNP_coef != 0)])

                      [,1]
YAL069W_1     2.469708e-03
NAL013C_3     9.421748e-06
YCL039W_343  -3.577301e-03
YCL032W_344  -2.996711e-03
YCL018W_359  -2.387471e-03
YDL119C_464  -7.905564e-04
YDL102W_472  -1.278898e-03
YDL101C_473  -4.469054e-18
YDR251W_550  -2.403105e-03
YDR252W_551  -1.331357e-17
YFR053C_865   7.551417e-03
YHR095W_1279 -6.837983e-03
YKL110C_1705  3.263589e-03
NLL003W_1938 -1.219008e-03
NLL003W_1939 -5.832979e-07
YLR426W_2156  3.782808e-04
YPL238C_2804 -6.491660e-03
YPL235W_2805 -7.555793e-18
YPL158C_2821 -5.628979e-03

We could compare this list of selected SNPs with the results from SNP effect estimation based on the ANOVA during the previous practice.

8.12 Bias and re-estimation

The estimations for the coefficients $\beta_j$ for variable $j$ in the Lasso regression model are known to be biased.

What does it mean?

Solution

It mean that if the expectation of any estimated coefficients $\hat\beta_j$ is different from the “true” value of the parameter $\beta_j$:

\[\mathbb E[\hat\beta_j] \ne \beta_j\]

Imagine a procedure to illustrate this point?

Solution

We could simulate some data following a sparse linear model $Y_i = \mathbf x_i\boldsymbol\beta^* + \varepsilon_i$, i.e. with null entries in $\boldsymbol\beta^*$.

Then we could repeat several times the estimation procedure for the coefficients $\boldsymbol\beta$ with a Lasso regression and many different train/test sub-sampling. Then, we could check the empirical distribution of the estimated coefficients $\hat\beta_j$ for each variable $j$, in particular the corresponding empirical mean.

We would expect that the empirical mean computed on the several estimations $\hat\beta_j$ is different from the “true” value $\beta_j^*$ used to generate the data.

Tip

It is recommended to re-estimate the coefficients of the selected variables using a non penalized/non regularized model (and dropping the non selected ones).

Now we can fit a linear model and use its statistical properties to run tests, etc.:

select_X_train <- X_train[,which(SNP_coef != 0)]

selected_data <- cbind(select_X_train, Y_train) %>% as.data.frame()

simple_model <- lm(A101 ~ ., data = selected_data)

summary(simple_model)


Call:
lm(formula = A101 ~ ., data = selected_data)

Residuals:
       Min         1Q     Median         3Q        Max 
-0.0267566 -0.0069933 -0.0007951  0.0052838  0.0250843 

Coefficients: (5 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.251683   0.008583  29.325  < 2e-16 ***
YAL069W_1     0.011153   0.004711   2.367  0.02394 *  
NAL013C_3           NA         NA      NA       NA    
YCL039W_343  -0.015157   0.007148  -2.120  0.04158 *  
YCL032W_344   0.004006   0.008313   0.482  0.63301    
YCL018W_359  -0.004674   0.006957  -0.672  0.50639    
YDL119C_464  -0.008598   0.005963  -1.442  0.15877    
YDL102W_472  -0.001799   0.006367  -0.283  0.77927    
YDL101C_473         NA         NA      NA       NA    
YDR251W_550  -0.012570   0.004738  -2.653  0.01217 *  
YDR252W_551         NA         NA      NA       NA    
YFR053C_865   0.007613   0.004676   1.628  0.11300    
YHR095W_1279 -0.010259   0.005274  -1.945  0.06030 .  
YKL110C_1705  0.010367   0.004734   2.190  0.03571 *  
NLL003W_1938 -0.012060   0.005243  -2.300  0.02788 *  
NLL003W_1939        NA         NA      NA       NA    
YLR426W_2156  0.003698   0.004800   0.770  0.44657    
YPL238C_2804 -0.014885   0.004870  -3.056  0.00442 ** 
YPL235W_2805        NA         NA      NA       NA    
YPL158C_2821 -0.013241   0.004736  -2.796  0.00856 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01414 on 33 degrees of freedom
Multiple R-squared:  0.806, Adjusted R-squared:  0.7236 
F-statistic: 9.791 on 14 and 33 DF,  p-value: 4.613e-08

Warning

In practice, we should use post-selection inference procedure because we are using the same data twice (for selection and then for estimation).

8.13 References

Lever, Jake, Martin Krzywinski, and Naomi Altman. 2016. “Model Selection and Overfitting.” Nature Methods 13 (9): 703–4. https://doi.org/10.1038/nmeth.3968.

We will see later that it is recommended to let glmnet define a sequence of candidate values for $\lambda$, c.f. glmnet() function documentation.↩︎
We remove the first coefficient corresponding to the intercept in the plot.↩︎
or “calibrated model”↩︎
no relationship with the $p$ quantity used earlier to quantify the total number of predictors that we consider in the linear model↩︎