```
library(tidyverse)
library(factoextra)
library(igraph)
library(cccd)
library(mclust)
library(umap)
load(url("https://lbmc.gitbiopages.ens-lyon.fr/hub/formations/ens_m1_ml/practical_b.Rdata"), verbose = T)
```

# 4 Practice: Introduction to clustering

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

## 4.1 Introduction

The goal of single-cell transcriptomics is to measure the transcriptional states of large numbers of cells simultaneously. The input to a single-cell RNA sequencing (scRNAseq) method is a collection of cells. Formally, the desired output is a transcript or genes (\(M\)) x cells (\(N\)) matrix \(X^{N \times M}\) that describes, for each cell, the abundance of its constituent transcripts or genes. More generally, single-cell genomics methods seek to measure not just transcriptional state, but other modalities in cells, e.g., protein abundances, epigenetic states, cellular morphology, etc.

We will be analyzing the a dataset of Peripheral Blood Mononuclear Cells (PBMC) freely available from 10X Genomics. There are 2,700 single cells that were sequenced on the Illumina NextSeq 500. The raw data can be found here.

## 4.2 Loading the data

You loaded 3 variables

`data`

a single-cell RNA-Sequencing count matrix`cell_annotation`

a vector containing the cell-type of the cells`var_gene_2000`

an ordered vector of the 2000 most variable genes

Check the dimension of your data

## Solution

The scRNASeq data

`dim(data)`

`[1] 2000 2638`

The number of cell types

`length(cell_annotation)`

`[1] 2638`

`table(cell_annotation)`

```
cell_annotation
Naive CD4 T Memory CD4 T CD14+ Mono B CD8 T FCGR3A+ Mono
697 483 480 344 271 162
NK DC Platelet
155 32 14
```

`length(var_gene_2000)`

`[1] 2000`

`head(var_gene_2000)`

`[1] "PPBP" "LYZ" "S100A9" "IGLL5" "GNLY" "FTL" `

Why do you think that we need a list of the 2000 most variable genes ?

## 4.3 Distances

The clustering algorithms seen this morning rely on Gram matrices. You can compute the Euclidean distance matrices of `data`

with the `dist()`

function (but don’t try to run it on the 2000 genes)

The following code computes the cell-to-cell Euclidean distances for the 10 most variable genes and the first 100 cells

```
<- data[var_gene_2000[1:10], 1:100] %>%
c2c_dist_10 t() %>%
dist()
```

Use the following code to study the impact of the number of genes on the distances

```
%>%
c2c_dist_10 as.vector() %>%
as_tibble() %>%
ggplot() +
geom_histogram(aes(x = value))
```

What happens when the number of dimensions increases ?

## Solution

```
<- tibble(
c2c_dist_n n_var = c(
seq(from = 10, to = 200, by = 50),
seq(from = 200, to = 2000, by = 500), 2000
)%>%
) mutate(
cell_dist = purrr::map(n_var, function(n_var, data, var_gene_2000) {
1:n_var], 1:100] %>%
data[var_gene_2000[t() %>%
dist() %>%
as.vector() %>%
as_tibble()
data = data, var_gene_2000)
},
)%>%
c2c_dist_n unnest(c(cell_dist)) %>%
ggplot() +
geom_histogram(aes(x = value)) +
facet_wrap(~n_var)
```

``stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

To circumvent this problem we are going to use the PCA as a dimension reduction technique.

Use the `prcomp()`

function to compute `data_pca`

from the 600 most variable genes. You can check the results with the following code, the `cell_annotation`

variable is the cell type label of each cell in the dataset:

```
fviz_pca_ind(
data_pca,geom = "point",
col.ind = cell_annotation
)
```

## Solution

```
<- data[var_gene_2000[1:600], ] %>%
data_pca t() %>%
prcomp()
```

You can use the `fviz_eig()`

function to choose the number of dimensions that you are going to use to compute your distances matrix. Save this matrix in the `data_dist`

variable

## Solution

Check the variability explained by the axes of the PCA

`fviz_eig(data_pca)`

Compute the distance matrix on the 2 first PCs

`<- dist(data_pca$x[, 1:2]) data_dist `

## 4.4 Hierarchical Clustering

The `hclust()`

function performs a hierarchical clustering analysis from a distance matrix.

`<- hclust(data_dist) data_hclust `

You can use the `plot()`

function to plot the resulting dendrogram.

## Solution

`%>% plot() data_hclust `

Too much information can drawn the information, the function `cutree()`

can help you solve this problem.

Which choice of `k`

would you take ?

Modify the following code to display your clustering.

```
%>%
data_pca fviz_pca_ind(
geom = "point",
col.ind = cell_annotation # we want a as.factor()
)
```

## Solution

```
%>%
data_pca fviz_pca_ind(
geom = "point",
col.ind = as.factor(cutree(data_hclust, k = 9))
)
```

The adjusted Rand index can be computed to compare two classifications. This index has and expected value of zero in the case of random partitions, and it is bounded above by 1 in the case of perfect agreement between two partitions.

Use the `adjustedRandIndex()`

function to compare the `cell_annotation`

to your hierarchical clustering

## Solution

```
adjustedRandIndex(
cutree(data_hclust, k = 9), cell_annotation
)
```

`[1] 0.5204846`

Modify the following code to study the relation between the adjusted Rand index and the number of PCs used to compute the distance

```
$x[, 1:3] %>%
data_pcadist() %>%
hclust() %>%
cutree(k = 9) %>%
adjustedRandIndex(cell_annotation)
```

What can you conclude ?

## Solution

```
tibble(
n_pcs = seq(from = 3, to = 100, by = 10)
%>%
) mutate(
ari = purrr::map(n_pcs, function(n_pcs, data_pca, cell_annotation) {
$x[, 1:n_pcs] %>%
data_pcadist() %>%
hclust() %>%
cutree(k = 9) %>%
adjustedRandIndex(cell_annotation)
data_pca = data_pca, cell_annotation = cell_annotation)
}, %>%
) unnest(ari) %>%
ggplot() +
geom_line(aes(x = n_pcs, y = ari))
```

Is it a PCA problem ?

## Solution

```
tibble(
n_gene = seq(from = 3, to = 600, by = 20)
%>%
) mutate(
ari = purrr::map(n_gene, function(n_gene, data, var_gene_2000, cell_annotation) {
1:n_gene], ] %>%
data[var_gene_2000[t() %>%
dist() %>%
hclust() %>%
cutree(k = 9) %>%
adjustedRandIndex(cell_annotation)
data = data, var_gene_2000 = var_gene_2000, cell_annotation = cell_annotation)
}, %>%
) unnest(ari) %>%
ggplot() +
geom_line(aes(x = n_gene, y = ari))
```

## 4.5 k-means clustering

The `kmeans()`

function performs a k-means clustering analysis from a distance matrix.

`<- kmeans(data_dist, centers = 9) data_kmeans `

Why is the `centers`

parameter required for `kmeans()`

and not for the `hclust()`

function ?

We want to compare the cells annotation to our clustering.

Using the `str()`

function to explore the `data_kmeans`

result, make the following plot from your k-means results.

## Solution

```
%>%
data_pca fviz_pca_ind(
geom = "point",
col.ind = as.factor(data_kmeans$cluster)
)
```

Use the `adjustedRandIndex()`

function to compare the `cell_annotation`

to your k-means clustering.

## Solution

```
adjustedRandIndex(
$cluster, cell_annotation
data_kmeans )
```

`[1] 0.5206136`

Maybe the real number of clusters in the PCs data is not \(k=9\). We can use different metrics to evaluate the effect of the number of clusters on the quality of the clustering.

- WSS: the Within-Cluster-Sum of Squared Errors (each point vs the centroid) for different values of \(k\), \(\sum_{i=1}^n (x_i - c_i)^2\)
- The silhouette value measures how similar a point is to its own cluster (cohesion) compared to other clusters (separation). \(s(i) = \frac{b(i) - a(i)}{\max{a(i),b(i)}}\) with \(a(i)\) the mean distance between \(i\) and cells in the same cluster and \(b(i)\) the mean distance between \(i\) and cells in different clusters. We plot \(\frac{1}{n}\sum_{i=1}^n s(i)\)

Use the `fviz_nbclust()`

function to plot these two metrics as a function of the number of clusters.

## Solution

For the total within-cluster-sum of squared errors

`fviz_nbclust(data_pca$x[, 1:3], kmeans, method = "wss")`

For the average silhouette width

`fviz_nbclust(data_pca$x[, 1:3], kmeans, method = "silhouette")`

With `fviz_nbclust()`

you can make the same analysis for your hierarchical clustering. The function `hcut()`

allow you to perform `hclust()`

and `cutree()`

in one step.

## Solution

`fviz_nbclust(data_pca$x[, 1:3], hcut, method = "wss")`

`fviz_nbclust(data_pca$x[, 1:3], hcut, method = "silhouette")`

Explain the discrepancy between these results and \(k=9\)

## 4.6 Graph-based clustering

We are going to use the `cluster_louvain()`

function to perform a graph-based clustering. This function takes into input an undirected graph instead of a distance matrix.

The `nng()`

function computes a k-nearest neighbor graph. With the `mutual = T`

option, this graph is undirected.

Check the effect of the `mutual = T`

on the `data_knn`

with the following code

## Solution

```
<- data_dist %>%
data_knn as.matrix() %>%
nng(k = 30, mutual = T)
<- data_dist %>%
data_knn_F as.matrix() %>%
nng(k = 30, mutual = F)
str(data_knn)
str(data_knn_F)
```

Why do we need a knn ?

The `cluster_louvain()`

function implements the multi-level modularity optimization algorithm for finding community structure in a graph. Use this function on `data_knn`

to create a `data_louvain`

variable.

You can check the clustering results with `membership(data_louvain)`

.

For which `resolution`

value do you get 9 clusters ?

## Solution

```
<- data_knn %>%
data_louvain cluster_louvain(resolution = 0.41)
```

```
%>%
data_pca fviz_pca_ind(
geom = "point",
col.ind = as.factor(membership(data_louvain))
)
```

Use the `adjustedRandIndex()`

function to compare the `cell_annotation`

to your graph-based clustering.

## Solution

```
adjustedRandIndex(
membership(data_louvain), cell_annotation
)
```

`[1] 0.4506219`

## 4.7 Graph-based dimension reduction

Uniform Manifold Approximation and Projection (UMAP) is an algorithm for dimensional reduction. Its details are described by McInnes, Healy, and Melville and its official implementation is available through a python package umap-learn.

```
library(umap)
<- umap(data_pca$x[, 1:10])
data_umap $layout %>%
data_umapas_tibble(.name_repair = "universal") %>%
mutate(cell_type = cell_annotation) %>%
ggplot() +
geom_point(aes(x = ...1, y = ...2, color = cell_type))
```

```
New names:
• `` -> `...1`
• `` -> `...2`
```

What can you say about the axes of this plot ?

The .Rmd file corresponding to this page is available here under the AGPL3 Licence

## 4.8 Implementing your own \(k\)-means clustering algorithm

You are going to implement your own \(k\)-means algorithm in this section. The \(k\)-means algorithm follow the following steps:

- Assign point to the cluster with the nearest centroid
- Compute the new cluster centroids

Justify each of your function.

Think about the starting state of your algorithm and the stopping condition

Start by implementing a `kmeans_initiation(x, k)`

function, returning \(k\) centroids as a starting point.

Implement a `compute_distance(x, centroid)`

function that compute the distance of each point (row of x) to each centroid

Implement a `cluster_assignment(distance)`

function returning the assignment of each point (row of x), based on the results of your `compute_distance(x, centroid)`

function.

Implement a `centroid_update(x, cluster, k)`

function returning the updated centroid for your clusters.

Implement a `metric_example(x, k)`

function to compute a criteria of the goodness st of your clustering.

Implement a `kmeans_example(x, k)`

function, wrapping everything and test it.

```
%>%
data_pca fviz_pca_ind(
geom = "point",
col.ind = as.factor(kmeans_example(data_pca$x[, 1:2], k = 9))
)
```