library(tidyverse) # to manipule data and make plot
library(factoextra) # manipulate pca results
library(palmerpenguins) # we load the data
2 Practice: Introduction to Principal Component Analysis
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License
2.1 Introduction
One of the most widely used tools in big data analysis is the principal component analysis or PCA method. PCA applications are multiples, it can be used for data visualization, data exploration or as a preprocessing step to reduce the dimension of your data before applying other methods.
2.2 Loading the libraries
2.3 loading the data
We are going to work on the famous Palmer penguins dataset. This dataset is an integrative study of the breeding ecology and population structure of Pygoscelis penguins along the western Antarctic Peninsula. These data were collected from 2007 to 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program (part of the US Long Term Ecological Research Network).
The palmerpenguins
data contains size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica.
The palmerpenguins
library load the penguins
dataset into your R environment. If you are not familiar with tibble
, you just have to know that they are equivalent to data.frame
(but easier to work with).
penguins
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
We have 8 variables for 344 individuals:
dim(penguins)
[1] 344 8
The data is tidy:
- Each variable has its own column.
- Each observation has its own row.
- Each value must have its own cell.
Meeting these 3 criteria for your data will simplify your data processing and analysis as most of the algorithm expect this format.
summary(penguins)
species island bill_length_mm bill_depth_mm
Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
Mean :43.92 Mean :17.15
3rd Qu.:48.50 3rd Qu.:18.70
Max. :59.60 Max. :21.50
NA's :2 NA's :2
flipper_length_mm body_mass_g sex year
Min. :172.0 Min. :2700 female:165 Min. :2007
1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
Median :197.0 Median :4050 NA's : 11 Median :2008
Mean :200.9 Mean :4202 Mean :2008
3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
Max. :231.0 Max. :6300 Max. :2009
NA's :2 NA's :2
What are the continuous and categorial variables in this table ? What is going to be the problem with the raw penguins
table ?
Solution
We first are going to drop the missing values (NA
)
<- penguins %>% drop_na()
data dim(data)
[1] 333 8
%>%
operator or pipe in R: It takes the output of the function on the left and pass it as the first argument of the function on the right.
For the sake of this practical, we are going to focus on the continuous variables in the data bill_length_mm
, bill_depth_mm
, flipper_length_mm
and body_mass_g
.
The function pairs
renders scatter plots of each possible pairs of variables
%>%
data select(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)) %>%
pairs()
What can you say about the covariation structure of the data ?
To explore the PCA algorithm we are first going to focus on 2 two continuous variable in this data set: the bill length and depth (bill_length_mm
, bill_depth_mm
) for the female penguins (sex
).
%>%
data filter(sex == "female") %>%
ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm, color = species))
Using the filter
and select
functions, create a data_f
data set that meet the sex == "female"
condition and contains only the c(bill_length_mm, bill_depth_mm)
variables
Solution
<- data %>%
data_f filter(sex == "female") %>% # we select the row where sexe is female
select(c(bill_length_mm, bill_depth_mm)) # we select the two columns
2.4 Performing your first PCA
The prcomp
and princomp
functions are implementations of the PCA methods
<- prcomp(data_f, scale = T) data_f_pca
You can use the str()
function to explore the data_f_pca
object.
Compare the center
and scale
slot with the moments of the data_f
table
Solution
We display the detail of the data_f_pca
object with str()
str(data_f_pca)
List of 5
$ sdev : num [1:2] 1.194 0.757
$ rotation: num [1:2, 1:2] 0.707 -0.707 -0.707 -0.707
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "bill_length_mm" "bill_depth_mm"
.. ..$ : chr [1:2] "PC1" "PC2"
$ center : Named num [1:2] 42.1 16.4
..- attr(*, "names")= chr [1:2] "bill_length_mm" "bill_depth_mm"
$ scale : Named num [1:2] 4.9 1.8
..- attr(*, "names")= chr [1:2] "bill_length_mm" "bill_depth_mm"
$ x : num [1:165, 1:2] -0.758 -0.879 -1.91 -1.002 -0.606 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:2] "PC1" "PC2"
- attr(*, "class")= chr "prcomp"
$center data_f_pca
bill_length_mm bill_depth_mm
42.09697 16.42545
$scale data_f_pca
bill_length_mm bill_depth_mm
4.903476 1.795681
We can compare these values with the summary()
of the data_f
table
map(data_f, mean)
$bill_length_mm
[1] 42.09697
$bill_depth_mm
[1] 16.42545
map(data_f, sd)
$bill_length_mm
[1] 4.903476
$bill_depth_mm
[1] 1.795681
map
apply a function to each element of a list or vector.
The package factoextra
provides us with functions to manipulate and plot the output of the prcomp
function. The most common usage of the PCA results is to display the individuals on the first factorial plan.
You can do this with the fviz_pca_ind
function.
Compare the fviz_pca_ind
output with the bill_length_mm
and bill_depth_mm
scatter plot (you need the species variable)
<- data %>% filter(sex == "female") %>% pull(species) species_f
Solution
fviz_pca_ind(data_f_pca,
geom = "point",
col.ind = species_f
)
We have a rotation of the point coordinates
What are the percentages in the Dim1 and Dim2 axes ?
Solution
- 71.3 % of total variation is explained by the first principal component
- 28.7 % of total variation is explained by the second principal component
You can compute these percentages with the following formula
$sdev^2 / sum(data_f_pca$sdev^2) data_f_pca
[1] 0.7131902 0.2868098
You can reproduce the same plot by using the new individual’s coordinates in the data_f_pca
x
slot.
as_tibble(data_f_pca$x) %>%
bind_cols(data %>% filter(sex == "female")) %>% # we add all the other variables to color by species
ggplot() +
geom_point(aes(x = PC1, y = PC2, color = species))
2.5 Computing your own PCA
Now that we have access to the data_f_pca
results for our two variables, you are going to try to mirror the computation of prcomp
step by step as described this morning.
2.5.1 Scaling
The first step is to scale the data. With the mutate()
function create a diy_pca
tibble with the scaled version of the bill_length_mm
and bill_depth_mm
variables.
Solution
<- data_f %>%
diy_data_f mutate(
bill_length_mm = (bill_length_mm - mean(bill_length_mm)) / sd(bill_length_mm),
bill_depth_mm = (bill_depth_mm - mean(bill_depth_mm)) / sd(bill_depth_mm),
)
Explain the importance of the centering and scaling steps of the data
Solution
- centering set the empirical mean to 0, if the data are not center the distance matrix is not the covariance matrix
- scaling put every variable on the same scale
2.5.2 Data projection
Then we are going to try to manually find the best line on which to project our point (the first axis or first principal component of our PCA).
In the following code, we define a slope (line_slope
) to draw a line passing through the origin. We then compute the orthogonal projection of the points on the resulting line.
Play with the line_slope
to see how this projection works
<- 2 # to change
line_slope
<- function(line_slope, x, y){
point_projection <- first_pc_projection_code(line_slope, x, y) # to replace with your code
results return(list(x = results[1], y = results[2]))
}
<- diy_data_f %>%
diy_pca rowwise() %>% # perform the subsequent opperation row by row
mutate(
projection_x = point_projection(
line_slope = line_slope,
x = bill_length_mm,
y = bill_depth_mm)$x,
projection_y = point_projection(
line_slope = line_slope,
x = bill_length_mm,
y = bill_depth_mm)$y
)
%>%
diy_pca ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_abline(slope = line_slope, color = "red") +
geom_point(aes(x = projection_x, y = projection_y), color = "red") +
geom_segment(
aes(x = bill_length_mm,
y = bill_depth_mm,
xend = projection_x,
yend = projection_y), color = "red", linewidth = 0.1) +
coord_equal()
Now you are going to replace the
<- first_pc_projection_code(line_slope, x, y) results
with your own code.
For this you need to:
-
create a vector passing through the origin with the angle
line_slope
- scale this vector by its length
-
use the
%*%
product between the scaled vector and your point coordinate to compute the point projection - create the vector with the right angle and right length
Solution
<- function(line_slope, x, y){
first_pc_projection_code <- c(x, y)
a <- c(1, line_slope)
b <- b / c(sqrt(sum(b^2)))
scaled_b c(a %*% scaled_b) * scaled_b
}
2.5.3 Evaluation of the projection
We can minimize the distance from each point to the red line. Or we can maximize the distance from each point to the point of origin.
Explain why these two approaches are equivalent.
The following code is the same as before but with the
SS
: Sum-square of distances of the projected point to the originSR
: Sum of residuals, the distances of the point to their projection
Write the formula to compute the S_dist
and Residuals
variables in the mutate()
function
<- 0.2
line_slope <- diy_data_f %>%
diy_pca rowwise() %>% # perform the subsequent opperation row by row
mutate(
projection_x = point_projection(
line_slope = line_slope,
x = bill_length_mm,
y = bill_depth_mm)$x,
projection_y = point_projection(
line_slope = line_slope,
x = bill_length_mm,
y = bill_depth_mm)$y,
S_dist = ,# right formula
Residuals = # right formula
)
%>%
diy_pca ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_abline(slope = line_slope, color = "red") +
geom_point(aes(x = projection_x, y = projection_y), color = "red") +
geom_segment(
aes(x = bill_length_mm,
y = bill_depth_mm,
xend = projection_x,
yend = projection_y), color = "red", linewidth = 0.1) +
labs(title = str_c("SS = ", round(sum(diy_pca$S_dist), 2),
", SR = ", round(sum(diy_pca$Residuals), 2))) +
coord_equal()
Solution
= projection_x^2 + projection_y^2,
S_dist = sqrt((projection_x - bill_length_mm)^2 + (projection_y - bill_depth_mm)^2) Residuals
2.5.4 Optimal PCA projection
You have seen this morning that the best line on which to project the point can be found with the covariance matrix of your data.
You can use the cov()
function to perform this computation
Solution
<- diy_data_f %>% as.matrix() %>% cov()
diy_cov diy_cov
bill_length_mm bill_depth_mm
bill_length_mm 1.0000000 -0.4263804
bill_depth_mm -0.4263804 1.0000000
You can use the eigen()
function to get the eigenvalues and vectors of the covariance matrix.
Solution
eigen(diy_cov)
eigen() decomposition
$values
[1] 1.4263804 0.5736196
$vectors
[,1] [,2]
[1,] -0.7071068 -0.7071068
[2,] 0.7071068 -0.7071068
To obtain the following figure you will need to write the body of the point_projection
function, taking into parameters the diy_cov
results instead of the previous line_slope
.
Then you will need to compute the slope
value for the geom_abline
function from the diy_cov
results.
<- function(diy_cov, x, y){
point_projection # your code
list(x = results[1], y = results[2])
}
<- diy_data_f %>%
diy_pca rowwise() %>% # perform the subsequent opperation row by row
mutate(
pc1_x = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm)$x,
pc1_y = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm)$y,
S_dist = , # right formula
Residuals = # right formula
)
%>%
diy_pca ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_abline(slope = , color = "red") + # missing slope value here
geom_point(aes(x = pc1_x, y = pc1_y), color = "red") +
geom_segment(
aes(x = bill_length_mm,
y = bill_depth_mm,
xend = pc1_x,
yend = pc1_y), color = "red", linewidth = 0.1) +
labs(title = str_c("SS = ", round(sum(diy_pca$S_dist), 2),
", SR = ", round(sum(diy_pca$Residuals), 2))) +
coord_equal()
Solution
For the projection function:
<- function(diy_cov, x, y){
point_projection <- c(x, y)
a <- eigen(diy_cov)$vector[, 1]
b <- c(a %*% b) * b
results list(x = results[1], y = results[2])
}
For the slope value:
geom_abline(slope = eigen(diy_cov)$vector[2, 1] / eigen(diy_cov)$vector[1, 1], color = "red") +
Do you have the same results as your neighbors ?
We are now going to plot the second principal component and the projection of the data point on it
Adapt your previous code to perform the computation on the PC2
<- function(diy_cov, x, y){
point_projection # you corde
return(list(x = results[1], y = results[2]))
}
<- diy_pca %>%
diy_pca rowwise() %>% # perform the subsequent opperation row by row
mutate(
pc2_x = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm)$x,
pc2_y = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm)$y,
S_dist = , # your code
Residuals = # your code
)
%>%
diy_pca ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_abline(slope = , color = "red") + # slope of the PC1
geom_abline(slope = , color = "blue") + # slope of the PC2
geom_point(aes(x = pc2_x, y = pc2_y), color = "blue") +
geom_segment(
aes(x = bill_length_mm,
y = bill_depth_mm,
xend = pc2_x,
yend = pc2_y), color = "red", linewidth = 0.1) +
labs(title = str_c("SS = ", round(sum(diy_pca$S_dist), 2),
", SR = ", round(sum(diy_pca$Residuals), 2))) +
coord_equal()
Solution
For the projection function:
<- function(diy_cov, x, y){
point_projection <- c(x, y)
a <- eigen(diy_cov)$vector[, 2]
b <- c(a %*% b) * b
results return(list(x = results[1], y = results[2]))
}
For the slope value:
geom_abline(slope = eigen(diy_cov)$vector[2, 1] / eigen(diy_cov)$vector[1, 1], color = "red") +
geom_abline(slope = eigen(diy_cov)$vector[2, 2] / eigen(diy_cov)$vector[1, 2], color = "blue") +
For the PCA construction we want, a PC2 orthogonal to the PC1.
How many lines are compatible with the orthogonality constraint for 2 variables ? For 3 variables ?
You can merge your previous computation to plot the projection on the 2 first PCs
2.5.5 PCA base
Now that we know how to compute our first 2 PCs, we want to represent the data point in the PCs coordinates.
Modify the following code to plot your data in the PCs space.
<- function(diy_cov, x, y, PC){
point_projection # your code
}
<- diy_data_f %>%
diy_pca rowwise() %>% # perform the subsequent opperation row by row
mutate(
pc1_x = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm,
PC = 1),
pc2_y = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm,
PC = 2),
)
%>%
diy_pca bind_cols(
%>% select(-colnames(diy_pca)[1:2]) %>% filter(sex == "female")
data %>%
) ggplot() +
geom_point(aes(x = pc1_x, y = pc2_y, color = species))
Solution
<- function(diy_cov, x, y, PC){
point_projection <- c(x, y)
a <- eigen(diy_cov)$vector[, PC]
b %*% b
a }
2.5.6 Comparison with prcomp
In the prcomp
output you can directly get the coordinates in PCs space from the $x
slot.
%>%
diy_data_f rowwise() %>% # perform the subsequent opperation row by row
mutate(
pc1_x = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm,
PC = 1),
pc2_y = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm,
PC = 2),
%>%
) ungroup() %>%
mutate(
pc1_x_ref = data_f_pca$x[,1],
pc2_y_ref = data_f_pca$x[,2]
%>%
) bind_cols(
%>% select(-colnames(diy_pca)[1:2]) %>% filter(sex == "female")
data %>%
) ggplot() +
geom_point(aes(x = pc1_x, y = pc2_y, color = species), alpha = 0.5) +
geom_point(aes(x = pc1_x_ref, y = pc2_y_ref, color = species), size = 0.5)
What could be the problem when comparing these two PCA results ?
2.6 PCA evaluation
Compute the PCA (data_f_pca
) of the bill_length_mm
, bill_depth_mm
, flipper_length_mm
and body_mass_g
for the female
penguins
Solution
<- data %>%
data_f_pca filter(sex == "female") %>%
select(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)) %>%
prcomp(scale = TRUE)
<- data %>% filter(sex == "female") %>% pull(species) # we get the species variable
species_f fviz_pca_ind(data_f_pca,
geom = "point",
col.ind = species_f
)
What are the main differences with the bill_length_mm
and bill_depth_mm
PCA ?
2.6.1 Explained inertia
You have seen this morning that there are different criteria to evaluate the PCA projection.
SS(PC1)
: Eigenvalue for PC1sqrt(SS(PC1))
: Singular value for PC1SS(PC1)/(n-1)
: Variation for PC1Var(PC1)/sum(Var(PCx))
: PC1 accounts for x Variation around the PCs
From the data_f_pca
try to compute these 4 criteria
Solution
Using the get_eigenvalue
function
get_eigenvalue(data_f_pca)
eigenvalue variance.percent cumulative.variance.percent
Dim.1 3.0313927 75.784817 75.78482
Dim.2 0.6138918 15.347295 91.13211
Dim.3 0.2444716 6.111790 97.24390
Dim.4 0.1102439 2.756099 100.00000
using the data_f_pca
slot
$sdev^2 data_f_pca
[1] 3.0313927 0.6138918 0.2444716 0.1102439
%>%
data filter(sex == "female") %>%
select(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)) %>%
scale() %>%
cov() %>%
eigen()
eigen() decomposition
$values
[1] 3.0313927 0.6138918 0.2444716 0.1102439
$vectors
[,1] [,2] [,3] [,4]
[1,] 0.4109840 0.8780384 0.2450936 0.008359691
[2,] -0.4966517 0.4387233 -0.7337963 -0.149672643
[3,] 0.5438179 -0.1611183 -0.3086541 -0.763567728
[4,] 0.5373001 -0.1030113 -0.5533577 0.628086412
<- data_f_pca$sdev^2 / (ncol(data_f_pca$x) - 1)) (pc_var
[1] 1.01046422 0.20463059 0.08149053 0.03674798
/ sum(pc_var) pc_var
[1] 0.75784817 0.15347295 0.06111790 0.02756099
Why are these metric important when analyzing PCA outputs ?
2.6.2 Scree plot
The fviz_eig
function create a scree plot of your PCA.
From your previous code, create your own scree plot
Solution
tibble(
pc = 1:4,
var = pc_var / sum(pc_var)
%>%
) ggplot() +
geom_point(aes(x = pc, y = var))
2.6.3 Variable contribution
The axis of the PCA are linear combinations of the variable axis.
What does it mean to find a slope of 0.5
for PC1 in the bill_length_mm
, bill_depth_mm
plot ?
Solution
It means that if bill_depth_mm
contribute for 1 to PC1, bill_length_mm
contribute for 0.5
As the number of variables increases, so is the complexity of the linear combinations for each PC. We can represent the variable axis in the new PCA axis, this representation is called the correlation circle.
With the fviz_pca_var()
function, which are the variables that contribute the most to PC1 ? To PC2 ? To both ?
Use the str()
function to find this information in the data_f_pca
object
Finally, we can use the fviz_pca_biplot
function to display the individuals and variable information on the same plot.
Remember that we scaled all the variables before computing the PCA. What are the results of the scaling in terms of variable unit contribution ?
2.6.4 Quality of the projection
You have seen this morning different metrics to check the quality of the representation for individuals and for variables.
The get_pca_var()
and get_pca_ind()
are two symmetrical functions to get relevant information for the individuals and the variables.
Explore the results of these two functions.
What is the variable contributing the most to PC4 ?
How do you interpret the cos2
slot
Solution
# results per variables
<- get_pca_var(data_f_pca)
res_var $coord # coordinates res_var
Dim.1 Dim.2 Dim.3 Dim.4
bill_length_mm 0.7155599 0.68795402 -0.1211843 -0.002775669
bill_depth_mm -0.8647150 0.34374518 0.3628188 0.049695814
flipper_length_mm 0.9468356 -0.12623822 0.1526111 0.253527423
body_mass_g 0.9354876 -0.08071066 0.2736026 -0.208543556
$contrib # contribution to the axes res_var
Dim.1 Dim.2 Dim.3 Dim.4
bill_length_mm 16.89078 77.095140 6.007089 0.006988444
bill_depth_mm 24.66629 19.247815 53.845705 2.240189999
flipper_length_mm 29.57379 2.595912 9.526734 58.303567516
body_mass_g 28.86914 1.061133 30.620472 39.449254041
$cos2 # quality of the representation res_var
Dim.1 Dim.2 Dim.3 Dim.4
bill_length_mm 0.5120259 0.47328073 0.01468563 7.704336e-06
bill_depth_mm 0.7477321 0.11816075 0.13163746 2.469674e-03
flipper_length_mm 0.8964976 0.01593609 0.02329016 6.427615e-02
body_mass_g 0.8751370 0.00651421 0.07485836 4.349041e-02
Modify the following code to display the individuals that contribute the most to PC1 and PC2. What can you say about these individuals ?
# results per individuals
<- get_pca_ind(data_f_pca)
res_ind <- data %>% filter(sex == "female") %>% pull(species) # we get the species variable
species_f fviz_pca_ind(data_f_pca,
geom = "point",
col.ind = (res_ind$contrib[, 3])
)
With the same modifications for the following code, which species are better represented by PC1 and which by PC2 ?
# results per individuals
<- get_pca_ind(data_f_pca)
res_ind <- data %>% filter(sex == "female") %>% pull(species) # we get the species variable
species_f fviz_pca_ind(data_f_pca,
geom = "point",
col.ind = (res_ind$cos2[, 3])
)
2.7 Adding new individuals
We can compute the coordinates of new individuals as follows:
- scale the new individual variables using the scaling information from the PCA
- Compute the coordinates by multipling by the eigenvalues of the principal components.
<- data %>%
data_m_scale filter(sex == "male") %>%
select(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)) %>%
scale(
center = data_f_pca$center,
scale = data_f_pca$scale
)
<- function(ind, loadings){
coord_func <- loadings * ind
r apply(r, 2, sum)
}<- t(apply(data_m_scale, 1, coord_func, data_f_pca$rotation ))
data_m_pca
as_tibble(data_f_pca$x) %>%
bind_cols(
%>% filter(sex == "female")
data %>%
) bind_rows(
as_tibble(data_m_pca) %>%
bind_cols(
%>% filter(sex == "male")
data
)%>%
) ggplot() +
geom_point(aes(x = PC1, y = PC2, color = species, shape = sex))
2.8 Comparison with SVD decomposition
Compare the results of your data_f_pca
to the one obtained with the following code.
<- data %>%
data_f_svd filter(sex == "female") %>%
select(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)) %>%
mutate(
bill_length_mm = (bill_length_mm - mean(bill_length_mm)) / sd(bill_length_mm),
bill_depth_mm = (bill_depth_mm - mean(bill_depth_mm)) / sd(bill_depth_mm),
flipper_length_mm = (flipper_length_mm - mean(flipper_length_mm)) / sd(flipper_length_mm),
body_mass_g = (body_mass_g - mean(body_mass_g)) / sd(body_mass_g),
%>%
) svd()
The .Rmd file corresponding to this page is available here under the AGPL3 Licence