library(tidyverse) # to manipule data and make plot
library(factoextra) # manipulate pca results
library(palmerpenguins) # we load the data2 Practice: Introduction to Principal Component Analysis

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License
2.1 Introduction
One of the most widely used tools in big data analysis is the principal component analysis or PCA method. PCA applications are multiples, it can be used for data visualization, data exploration or as a preprocessing step to reduce the dimension of your data before applying other methods.
2.2 Loading the libraries
2.3 loading the data
We are going to work on the famous Palmer penguins dataset. This dataset is an integrative study of the breeding ecology and population structure of Pygoscelis penguins along the western Antarctic Peninsula. These data were collected from 2007 to 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program (part of the US Long Term Ecological Research Network).
The palmerpenguins data contains size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica.

The palmerpenguins library load the penguins dataset into your R environment. If you are not familiar with tibble, you just have to know that they are equivalent to data.frame (but easier to work with).
penguins# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
We have 8 variables for 344 individuals:
dim(penguins)[1] 344 8
The data is tidy:
- Each variable has its own column.
- Each observation has its own row.
- Each value must have its own cell.
Meeting these 3 criteria for your data will simplify your data processing and analysis as most of the algorithm expect this format.
summary(penguins) species island bill_length_mm bill_depth_mm
Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
Mean :43.92 Mean :17.15
3rd Qu.:48.50 3rd Qu.:18.70
Max. :59.60 Max. :21.50
NA's :2 NA's :2
flipper_length_mm body_mass_g sex year
Min. :172.0 Min. :2700 female:165 Min. :2007
1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
Median :197.0 Median :4050 NA's : 11 Median :2008
Mean :200.9 Mean :4202 Mean :2008
3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
Max. :231.0 Max. :6300 Max. :2009
NA's :2 NA's :2
What are the continuous and categorial variables in this table ? What is going to be the problem with the raw penguins table ?
Solution
We first are going to drop the missing values (NA)
data <- penguins %>% drop_na()
dim(data)[1] 333 8
%>% operator or pipe in R: It takes the output of the function on the left and pass it as the first argument of the function on the right.
For the sake of this practical, we are going to focus on the continuous variables in the data bill_length_mm, bill_depth_mm, flipper_length_mm and body_mass_g.

The function pairs renders scatter plots of each possible pairs of variables
data %>%
select(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)) %>%
pairs()
What can you say about the covariation structure of the data ?
To explore the PCA algorithm we are first going to focus on 2 two continuous variable in this data set: the bill length and depth (bill_length_mm, bill_depth_mm) for the female penguins (sex).
data %>%
filter(sex == "female") %>%
ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm, color = species))
Using the filter and select functions, create a data_f data set that meet the sex == "female" condition and contains only the c(bill_length_mm, bill_depth_mm) variables
Solution
data_f <- data %>%
filter(sex == "female") %>% # we select the row where sexe is female
select(c(bill_length_mm, bill_depth_mm)) # we select the two columns2.4 Performing your first PCA
The prcomp and princomp functions are implementations of the PCA methods
data_f_pca <- prcomp(data_f, scale = T)You can use the str() function to explore the data_f_pca object.
Compare the center and scale slot with the moments of the data_f table
Solution
We display the detail of the data_f_pca object with str()
str(data_f_pca)List of 5
$ sdev : num [1:2] 1.194 0.757
$ rotation: num [1:2, 1:2] 0.707 -0.707 -0.707 -0.707
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "bill_length_mm" "bill_depth_mm"
.. ..$ : chr [1:2] "PC1" "PC2"
$ center : Named num [1:2] 42.1 16.4
..- attr(*, "names")= chr [1:2] "bill_length_mm" "bill_depth_mm"
$ scale : Named num [1:2] 4.9 1.8
..- attr(*, "names")= chr [1:2] "bill_length_mm" "bill_depth_mm"
$ x : num [1:165, 1:2] -0.758 -0.879 -1.91 -1.002 -0.606 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:2] "PC1" "PC2"
- attr(*, "class")= chr "prcomp"
data_f_pca$centerbill_length_mm bill_depth_mm
42.09697 16.42545
data_f_pca$scalebill_length_mm bill_depth_mm
4.903476 1.795681
We can compare these values with the summary() of the data_f table
map(data_f, mean)$bill_length_mm
[1] 42.09697
$bill_depth_mm
[1] 16.42545
map(data_f, sd)$bill_length_mm
[1] 4.903476
$bill_depth_mm
[1] 1.795681
map apply a function to each element of a list or vector.
The package factoextra provides us with functions to manipulate and plot the output of the prcomp function. The most common usage of the PCA results is to display the individuals on the first factorial plan.
You can do this with the fviz_pca_ind function.
Compare the fviz_pca_ind output with the bill_length_mm and bill_depth_mm scatter plot (you need the species variable)
species_f <- data %>% filter(sex == "female") %>% pull(species)Solution
fviz_pca_ind(data_f_pca,
geom = "point",
col.ind = species_f
)
We have a rotation of the point coordinates
What are the percentages in the Dim1 and Dim2 axes ?
Solution
- 71.3 % of total variation is explained by the first principal component
- 28.7 % of total variation is explained by the second principal component
You can compute these percentages with the following formula
data_f_pca$sdev^2 / sum(data_f_pca$sdev^2)[1] 0.7131902 0.2868098
You can reproduce the same plot by using the new individual’s coordinates in the data_f_pca x slot.
as_tibble(data_f_pca$x) %>%
bind_cols(data %>% filter(sex == "female")) %>% # we add all the other variables to color by species
ggplot() +
geom_point(aes(x = PC1, y = PC2, color = species))
2.5 Computing your own PCA
Now that we have access to the data_f_pca results for our two variables, you are going to try to mirror the computation of prcomp step by step as described this morning.
2.5.1 Scaling
The first step is to scale the data. With the mutate() function create a diy_pca tibble with the scaled version of the bill_length_mm and bill_depth_mm variables.
Solution
diy_data_f <- data_f %>%
mutate(
bill_length_mm = (bill_length_mm - mean(bill_length_mm)) / sd(bill_length_mm),
bill_depth_mm = (bill_depth_mm - mean(bill_depth_mm)) / sd(bill_depth_mm),
)Explain the importance of the centering and scaling steps of the data
Solution
- centering set the empirical mean to 0, if the data are not center the distance matrix is not the covariance matrix
- scaling put every variable on the same scale
2.5.2 Data projection
Then we are going to try to manually find the best line on which to project our point (the first axis or first principal component of our PCA).
In the following code, we define a slope (line_slope) to draw a line passing through the origin. We then compute the orthogonal projection of the points on the resulting line.
Play with the line_slope to see how this projection works, but first you will have to compute the projection
Now you are going to replace the
results <- first_pc_projection_code(line_slope, x, y)with your own code.
For this you need to:
-
create a vector passing through the origin with the angle
line_slope - scale this vector by its length
-
use the
%*%product between the scaled vector and your point coordinate to compute the point projection - create the vector with the right angle and right length
line_slope <- 2 # to change
point_projection <- function(line_slope, x, y){
results <- first_pc_projection_code(line_slope, x, y) # to replace with your code
return(list(x = results[1], y = results[2]))
}
diy_pca <- diy_data_f %>%
rowwise() %>% # perform the subsequent opperation row by row
mutate(
projection_x = point_projection(
line_slope = line_slope,
x = bill_length_mm,
y = bill_depth_mm)$x,
projection_y = point_projection(
line_slope = line_slope,
x = bill_length_mm,
y = bill_depth_mm)$y
)
diy_pca %>%
ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_abline(slope = line_slope, color = "red") +
geom_point(aes(x = projection_x, y = projection_y), color = "red") +
geom_segment(
aes(x = bill_length_mm,
y = bill_depth_mm,
xend = projection_x,
yend = projection_y), color = "red", linewidth = 0.1) +
coord_equal()
Solution
first_pc_projection_code <- function(line_slope, x, y){
a <- c(x, y)
b <- c(1, line_slope)
scaled_b <- b / c(sqrt(sum(b^2)))
c(a %*% scaled_b) * scaled_b
}2.5.3 Evaluation of the projection
We can minimize the distance from each point to the red line. Or we can maximize the distance from each point to the point of origin.
Explain why these two approaches are equivalent.
The following code is the same as before but with the
SS: Sum-square of distances of the projected point to the originSR: Sum of residuals, the distances of the point to their projection
Write the formula to compute the S_dist and Residuals variables in the mutate() function
line_slope <- 0.2
diy_pca <- diy_data_f %>%
rowwise() %>% # perform the subsequent opperation row by row
mutate(
projection_x = point_projection(
line_slope = line_slope,
x = bill_length_mm,
y = bill_depth_mm)$x,
projection_y = point_projection(
line_slope = line_slope,
x = bill_length_mm,
y = bill_depth_mm)$y,
S_dist = ,# right formula
Residuals = # right formula
)
diy_pca %>%
ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_abline(slope = line_slope, color = "red") +
geom_point(aes(x = projection_x, y = projection_y), color = "red") +
geom_segment(
aes(x = bill_length_mm,
y = bill_depth_mm,
xend = projection_x,
yend = projection_y), color = "red", linewidth = 0.1) +
labs(title = str_c("SS = ", round(sum(diy_pca$S_dist), 2),
", SR = ", round(sum(diy_pca$Residuals), 2))) +
coord_equal()Solution
S_dist = projection_x^2 + projection_y^2,
Residuals = sqrt((projection_x - bill_length_mm)^2 + (projection_y - bill_depth_mm)^2)2.5.4 Optimal PCA projection
You have seen this morning that the best line on which to project the point can be found with the covariance matrix of your data.
You can use the cov() function to perform this computation
Solution
diy_cov <- diy_data_f %>% as.matrix() %>% cov()
diy_cov bill_length_mm bill_depth_mm
bill_length_mm 1.0000000 -0.4263804
bill_depth_mm -0.4263804 1.0000000
You can use the eigen() function to get the eigenvalues and vectors of the covariance matrix.
Solution
eigen(diy_cov)eigen() decomposition
$values
[1] 1.4263804 0.5736196
$vectors
[,1] [,2]
[1,] -0.7071068 -0.7071068
[2,] 0.7071068 -0.7071068
To obtain the following figure you will need to write the body of the point_projection function, taking into parameters the diy_cov results instead of the previous line_slope.
Then you will need to compute the slope value for the geom_abline function from the diy_cov results.

point_projection <- function(diy_cov, x, y){
# your code
list(x = results[1], y = results[2])
}
diy_pca <- diy_data_f %>%
rowwise() %>% # perform the subsequent opperation row by row
mutate(
pc1_x = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm)$x,
pc1_y = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm)$y,
S_dist = , # right formula
Residuals = # right formula
)
diy_pca %>%
ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_abline(slope = , color = "red") + # missing slope value here
geom_point(aes(x = pc1_x, y = pc1_y), color = "red") +
geom_segment(
aes(x = bill_length_mm,
y = bill_depth_mm,
xend = pc1_x,
yend = pc1_y), color = "red", linewidth = 0.1) +
labs(title = str_c("SS = ", round(sum(diy_pca$S_dist), 2),
", SR = ", round(sum(diy_pca$Residuals), 2))) +
coord_equal()Solution
For the projection function:
point_projection <- function(diy_cov, x, y){
a <- c(x, y)
b <- eigen(diy_cov)$vector[, 1]
results <- c(a %*% b) * b
list(x = results[1], y = results[2])
}For the slope value:
geom_abline(slope = eigen(diy_cov)$vector[2, 1] / eigen(diy_cov)$vector[1, 1], color = "red") +Do you have the same results as your neighbors ?
We are now going to plot the second principal component and the projection of the data point on it
Adapt your previous code to perform the computation on the PC2

point_projection <- function(diy_cov, x, y){
# your code
return(list(x = results[1], y = results[2]))
}
diy_pca <- diy_pca %>%
rowwise() %>% # perform the subsequent opperation row by row
mutate(
pc2_x = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm)$x,
pc2_y = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm)$y,
S_dist = , # your code
Residuals = # your code
)
diy_pca %>%
ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_abline(slope = , color = "red") + # slope of the PC1
geom_abline(slope = , color = "blue") + # slope of the PC2
geom_point(aes(x = pc2_x, y = pc2_y), color = "blue") +
geom_segment(
aes(x = bill_length_mm,
y = bill_depth_mm,
xend = pc2_x,
yend = pc2_y), color = "red", linewidth = 0.1) +
labs(title = str_c("SS = ", round(sum(diy_pca$S_dist), 2),
", SR = ", round(sum(diy_pca$Residuals), 2))) +
coord_equal()Solution
For the projection function:
point_projection <- function(diy_cov, x, y){
a <- c(x, y)
b <- eigen(diy_cov)$vector[, 2]
results <- c(a %*% b) * b
return(list(x = results[1], y = results[2]))
}For the slope value:
geom_abline(slope = eigen(diy_cov)$vector[2, 1] / eigen(diy_cov)$vector[1, 1], color = "red") +
geom_abline(slope = eigen(diy_cov)$vector[2, 2] / eigen(diy_cov)$vector[1, 2], color = "blue") +For the PCA construction we want, a PC2 orthogonal to the PC1.
How many lines are compatible with the orthogonality constraint for 2 variables ? For 3 variables ?
You can merge your previous computation to plot the projection on the 2 first PCs

2.5.5 PCA base
Now that we know how to compute our first 2 PCs, we want to represent the data point in the PCs coordinates.
Modify the following code to plot your data in the PCs space.
point_projection <- function(diy_cov, x, y, PC) {
# your code
}
diy_pca <- diy_data_f %>%
rowwise() %>% # perform the subsequent opperation row by row
mutate(
pc1_x = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm,
PC = 1
),
pc2_y = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm,
PC = 2
),
)
diy_pca %>%
bind_cols(
data %>% select(-colnames(diy_pca)[1:2]) %>% filter(sex == "female")
) %>%
ggplot() +
geom_point(aes(x = pc1_x, y = pc2_y, color = species))
Solution
point_projection <- function(diy_cov, x, y, PC) {
a <- c(x, y)
b <- eigen(diy_cov)$vector[, PC]
a %*% b
}2.5.6 Comparison with prcomp
In the prcomp output you can directly get the coordinates in PCs space from the $x slot.
diy_data_f %>%
rowwise() %>% # perform the subsequent opperation row by row
mutate(
pc1_x = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm,
PC = 1
),
pc2_y = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm,
PC = 2
),
) %>%
ungroup() %>%
mutate(
pc1_x_ref = data_f_pca$x[, 1],
pc2_y_ref = data_f_pca$x[, 2]
) %>%
bind_cols(
data %>% select(-colnames(diy_pca)[1:2]) %>% filter(sex == "female")
) %>%
ggplot() +
geom_point(aes(x = pc1_x, y = pc2_y, color = species), alpha = 0.5) +
geom_point(aes(x = pc1_x_ref, y = pc2_y_ref, color = species), size = 0.5)
What could be the problem when comparing these two PCA results ?
2.6 PCA evaluation
Compute the PCA (data_f_pca) of the bill_length_mm, bill_depth_mm, flipper_length_mm and body_mass_g for the female penguins
Solution
data_f_pca <- data %>%
filter(sex == "female") %>%
select(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)) %>%
prcomp(scale = TRUE)species_f <- data %>%
filter(sex == "female") %>%
pull(species) # we get the species variable
fviz_pca_ind(data_f_pca,
geom = "point",
col.ind = species_f
)
What are the main differences with the bill_length_mm and bill_depth_mm PCA ?
2.6.1 Explained inertia
You have seen this morning that there are different criteria to evaluate the PCA projection.
SS(PC1): Eigenvalue for PC1sqrt(SS(PC1)): Singular value for PC1SS(PC1)/(n-1): Variation for PC1Var(PC1)/sum(Var(PCx)): PC1 accounts for x Variation around the PCs
From the data_f_pca try to compute these 4 criteria
Solution
Using the get_eigenvalue function
get_eigenvalue(data_f_pca) eigenvalue variance.percent cumulative.variance.percent
Dim.1 3.0313927 75.784817 75.78482
Dim.2 0.6138918 15.347295 91.13211
Dim.3 0.2444716 6.111790 97.24390
Dim.4 0.1102439 2.756099 100.00000
using the data_f_pca slot
data_f_pca$sdev^2[1] 3.0313927 0.6138918 0.2444716 0.1102439
data %>%
filter(sex == "female") %>%
select(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)) %>%
scale() %>%
cov() %>%
eigen()eigen() decomposition
$values
[1] 3.0313927 0.6138918 0.2444716 0.1102439
$vectors
[,1] [,2] [,3] [,4]
[1,] 0.4109840 -0.8780384 0.2450936 0.008359691
[2,] -0.4966517 -0.4387233 -0.7337963 -0.149672643
[3,] 0.5438179 0.1611183 -0.3086541 -0.763567728
[4,] 0.5373001 0.1030113 -0.5533577 0.628086412
(pc_var <- data_f_pca$sdev^2 / (ncol(data_f_pca$x) - 1))[1] 1.01046422 0.20463059 0.08149053 0.03674798
pc_var / sum(pc_var)[1] 0.75784817 0.15347295 0.06111790 0.02756099
Why are these metric important when analyzing PCA outputs ?
2.6.2 Scree plot
The fviz_eig function create a scree plot of your PCA.
Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
Ignoring empty aesthetic: `width`.

From your previous code, create your own scree plot
Solution
tibble(
pc = 1:4,
var = pc_var / sum(pc_var)
) %>%
ggplot() +
geom_point(aes(x = pc, y = var))
2.6.3 Variable contribution
The axis of the PCA are linear combinations of the variable axis.
What does it mean to find a slope of 0.5 for PC1 in the bill_length_mm, bill_depth_mm plot ?
Solution
It means that if bill_depth_mm contribute for 1 to PC1, bill_length_mm contribute for 0.5
As the number of variables increases, so is the complexity of the linear combinations for each PC. We can represent the variable axis in the new PCA axis, this representation is called the correlation circle.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggpubr package.
Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the factoextra package.
Please report the issue at <https://github.com/kassambara/factoextra/issues>.

With the fviz_pca_var() function, which are the variables that contribute the most to PC1 ? To PC2 ? To both ?
Use the str() function to find this information in the data_f_pca object
Finally, we can use the fviz_pca_biplot function to display the individuals and variable information on the same plot.

Remember that we scaled all the variables before computing the PCA. What are the results of the scaling in terms of variable unit contribution ?
2.6.4 Quality of the projection
You have seen this morning different metrics to check the quality of the representation for individuals and for variables.
The get_pca_var() and get_pca_ind() are two symmetrical functions to get relevant information for the individuals and the variables.
Explore the results of these two functions.
What is the variable contributing the most to PC4 ?
How do you interpret the cos2 slot
Solution
# results per variables
res_var <- get_pca_var(data_f_pca)
res_var$coord # coordinates Dim.1 Dim.2 Dim.3 Dim.4
bill_length_mm 0.7155599 0.68795402 -0.1211843 -0.002775669
bill_depth_mm -0.8647150 0.34374518 0.3628188 0.049695814
flipper_length_mm 0.9468356 -0.12623822 0.1526111 0.253527423
body_mass_g 0.9354876 -0.08071066 0.2736026 -0.208543556
res_var$contrib # contribution to the axes Dim.1 Dim.2 Dim.3 Dim.4
bill_length_mm 16.89078 77.095140 6.007089 0.006988444
bill_depth_mm 24.66629 19.247815 53.845705 2.240189999
flipper_length_mm 29.57379 2.595912 9.526734 58.303567516
body_mass_g 28.86914 1.061133 30.620472 39.449254041
res_var$cos2 # quality of the representation Dim.1 Dim.2 Dim.3 Dim.4
bill_length_mm 0.5120259 0.47328073 0.01468563 7.704336e-06
bill_depth_mm 0.7477321 0.11816075 0.13163746 2.469674e-03
flipper_length_mm 0.8964976 0.01593609 0.02329016 6.427615e-02
body_mass_g 0.8751370 0.00651421 0.07485836 4.349041e-02
Modify the following code to display the individuals that contribute the most to PC1 and PC2. What can you say about these individuals ?
# results per individuals
res_ind <- get_pca_ind(data_f_pca)
species_f <- data %>%
filter(sex == "female") %>%
pull(species) # we get the species variable
fviz_pca_ind(data_f_pca,
geom = "point",
col.ind = (res_ind$contrib[, 3])
)
With the same modifications for the following code, which species are better represented by PC1 and which by PC2 ?
# results per individuals
res_ind <- get_pca_ind(data_f_pca)
species_f <- data %>%
filter(sex == "female") %>%
pull(species) # we get the species variable
fviz_pca_ind(data_f_pca,
geom = "point",
col.ind = (res_ind$cos2[, 3])
)
2.7 Adding new individuals
We can compute the coordinates of new individuals as follows:
- scale the new individual variables using the scaling information from the PCA
- Compute the coordinates by multipling by the eigenvalues of the principal components.
data_m_scale <- data %>%
filter(sex == "male") %>%
select(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)) %>%
scale(
center = data_f_pca$center,
scale = data_f_pca$scale
)coord_func <- function(ind, loadings) {
r <- loadings * ind
apply(r, 2, sum)
}
data_m_pca <- t(apply(data_m_scale, 1, coord_func, data_f_pca$rotation))
as_tibble(data_f_pca$x) %>%
bind_cols(
data %>% filter(sex == "female")
) %>%
bind_rows(
as_tibble(data_m_pca) %>%
bind_cols(
data %>% filter(sex == "male")
)
) %>%
ggplot() +
geom_point(aes(x = PC1, y = PC2, color = species, shape = sex))
2.8 Comparison with SVD decomposition
Compare the results of your data_f_pca to the one obtained with the following code.
data_f_svd <- data %>%
filter(sex == "female") %>%
select(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)) %>%
mutate(
bill_length_mm = (bill_length_mm - mean(bill_length_mm)) / sd(bill_length_mm),
bill_depth_mm = (bill_depth_mm - mean(bill_depth_mm)) / sd(bill_depth_mm),
flipper_length_mm = (flipper_length_mm - mean(flipper_length_mm)) / sd(flipper_length_mm),
body_mass_g = (body_mass_g - mean(body_mass_g)) / sd(body_mass_g),
) %>%
svd()The .Rmd file corresponding to this page is available here under the AGPL3 Licence