2 Practice: Introduction to Principal Component Analysis

Author

Ghislain Durif, Laurent Modolo, Franck Picard

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

2.1 Introduction

One of the most widely used tools in big data analysis is the principal component analysis or PCA method. PCA applications are multiples, it can be used for data visualization, data exploration or as a preprocessing step to reduce the dimension of your data before applying other methods.

2.2 Loading the libraries

library(tidyverse) # to manipule data and make plot
library(factoextra) # manipulate pca results
library(palmerpenguins) # we load the data

2.3 loading the data

We are going to work on the famous Palmer penguins dataset. This dataset is an integrative study of the breeding ecology and population structure of Pygoscelis penguins along the western Antarctic Peninsula. These data were collected from 2007 to 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program (part of the US Long Term Ecological Research Network).

The palmerpenguins data contains size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica.

The palmerpenguins library load the penguins dataset into your R environment. If you are not familiar with tibble, you just have to know that they are equivalent to data.frame (but easier to work with).

penguins

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

We have 8 variables for 344 individuals:

dim(penguins)

[1] 344   8

The data is tidy:

Each variable has its own column.
Each observation has its own row.
Each value must have its own cell.

Meeting these 3 criteria for your data will simplify your data processing and analysis as most of the algorithm expect this format.

summary(penguins)

      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2

What are the continuous and categorial variables in this table ? What is going to be the problem with the raw penguins table ?

Solution

We first are going to drop the missing values (NA)

data <- penguins %>% drop_na()
dim(data)

[1] 333   8

If you are not familiar with the %>% operator or pipe in R: It takes the output of the function on the left and pass it as the first argument of the function on the right.

For the sake of this practical, we are going to focus on the continuous variables in the data bill_length_mm, bill_depth_mm, flipper_length_mm and body_mass_g.

The function pairs renders scatter plots of each possible pairs of variables

data %>%
  select(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)) %>% 
  pairs()

What can you say about the covariation structure of the data ?

To explore the PCA algorithm we are first going to focus on 2 two continuous variable in this data set: the bill length and depth (bill_length_mm, bill_depth_mm) for the female penguins (sex).

data %>% 
  filter(sex == "female") %>% 
  ggplot() +
  geom_point(aes(x = bill_length_mm, y = bill_depth_mm, color = species))

Using the filter and select functions, create a data_f data set that meet the sex == "female" condition and contains only the c(bill_length_mm, bill_depth_mm) variables

Solution

data_f <- data %>%
  filter(sex == "female") %>% # we select the row where sexe is female
  select(c(bill_length_mm, bill_depth_mm)) # we select the two columns

2.4 Performing your first PCA

The prcomp and princomp functions are implementations of the PCA methods

data_f_pca <- prcomp(data_f, scale = T)

You can use the str() function to explore the data_f_pca object.

Compare the center and scale slot with the moments of the data_f table

Solution

We display the detail of the data_f_pca object with str()

str(data_f_pca)

List of 5
 $ sdev    : num [1:2] 1.194 0.757
 $ rotation: num [1:2, 1:2] 0.707 -0.707 -0.707 -0.707
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "bill_length_mm" "bill_depth_mm"
  .. ..$ : chr [1:2] "PC1" "PC2"
 $ center  : Named num [1:2] 42.1 16.4
  ..- attr(*, "names")= chr [1:2] "bill_length_mm" "bill_depth_mm"
 $ scale   : Named num [1:2] 4.9 1.8
  ..- attr(*, "names")= chr [1:2] "bill_length_mm" "bill_depth_mm"
 $ x       : num [1:165, 1:2] -0.758 -0.879 -1.91 -1.002 -0.606 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:2] "PC1" "PC2"
 - attr(*, "class")= chr "prcomp"

data_f_pca$center

bill_length_mm  bill_depth_mm 
      42.09697       16.42545

data_f_pca$scale

bill_length_mm  bill_depth_mm 
      4.903476       1.795681

We can compare these values with the summary() of the data_f table

map(data_f, mean)

$bill_length_mm
[1] 42.09697

$bill_depth_mm
[1] 16.42545

map(data_f, sd)

$bill_length_mm
[1] 4.903476

$bill_depth_mm
[1] 1.795681

map apply a function to each element of a list or vector.

The package factoextra provides us with functions to manipulate and plot the output of the prcomp function. The most common usage of the PCA results is to display the individuals on the first factorial plan.

You can do this with the fviz_pca_ind function.

Compare the fviz_pca_ind output with the bill_length_mm and bill_depth_mm scatter plot (you need the species variable)

species_f <- data %>% filter(sex == "female") %>% pull(species)

Solution

fviz_pca_ind(data_f_pca,
             geom = "point",
             col.ind = species_f
             )

We have a rotation of the point coordinates

What are the percentages in the Dim1 and Dim2 axes ?

Solution

71.3 % of total variation is explained by the first principal component
28.7 % of total variation is explained by the second principal component

You can compute these percentages with the following formula

data_f_pca$sdev^2 / sum(data_f_pca$sdev^2)

[1] 0.7131902 0.2868098

You can reproduce the same plot by using the new individual’s coordinates in the data_f_pca x slot.

as_tibble(data_f_pca$x) %>% 
  bind_cols(data %>% filter(sex == "female")) %>% # we add all the other variables to color by species
  ggplot() +
  geom_point(aes(x = PC1, y = PC2, color = species))

2.5 Computing your own PCA

Now that we have access to the data_f_pca results for our two variables, you are going to try to mirror the computation of prcomp step by step as described this morning.

2.5.1 Scaling

The first step is to scale the data. With the mutate() function create a diy_pca tibble with the scaled version of the bill_length_mm and bill_depth_mm variables.

Solution

diy_data_f <- data_f %>% 
  mutate(
    bill_length_mm = (bill_length_mm - mean(bill_length_mm)) / sd(bill_length_mm),
    bill_depth_mm = (bill_depth_mm - mean(bill_depth_mm)) / sd(bill_depth_mm),
  )

Explain the importance of the centering and scaling steps of the data

Solution

centering set the empirical mean to 0, if the data are not center the distance matrix is not the covariance matrix
scaling put every variable on the same scale

2.5.2 Data projection

Then we are going to try to manually find the best line on which to project our point (the first axis or first principal component of our PCA).

In the following code, we define a slope (line_slope) to draw a line passing through the origin. We then compute the orthogonal projection of the points on the resulting line.

Play with the line_slope to see how this projection works

line_slope <- 2 # to change

point_projection <- function(line_slope, x, y){
  results <- first_pc_projection_code(line_slope, x, y) # to replace with your code
  return(list(x = results[1], y = results[2]))
}

diy_pca <- diy_data_f %>% 
  rowwise() %>% # perform the subsequent opperation row by row
  mutate(
    projection_x = point_projection(
      line_slope = line_slope,
      x = bill_length_mm,
      y = bill_depth_mm)$x,
    projection_y = point_projection(
      line_slope = line_slope,
      x = bill_length_mm,
      y = bill_depth_mm)$y
  )

diy_pca %>% 
  ggplot() +
  geom_point(aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_abline(slope = line_slope, color = "red") +
  geom_point(aes(x = projection_x, y = projection_y), color = "red") +
  geom_segment(
    aes(x = bill_length_mm,
        y = bill_depth_mm,
        xend = projection_x,
        yend = projection_y), color = "red", linewidth = 0.1) +
  coord_equal()

Now you are going to replace the

results <- first_pc_projection_code(line_slope, x, y)

with your own code.

For this you need to:

create a vector passing through the origin with the angle line_slope
scale this vector by its length
use the %*% product between the scaled vector and your point coordinate to compute the point projection
create the vector with the right angle and right length

Solution

first_pc_projection_code <- function(line_slope, x, y){
  a <- c(x, y)
  b <- c(1, line_slope)
  scaled_b <- b / c(sqrt(sum(b^2)))
  c(a %*% scaled_b) * scaled_b
}

2.5.3 Evaluation of the projection

We can minimize the distance from each point to the red line. Or we can maximize the distance from each point to the point of origin.

Explain why these two approaches are equivalent.

The following code is the same as before but with the

SS : Sum-square of distances of the projected point to the origin
SR : Sum of residuals, the distances of the point to their projection

Write the formula to compute the S_dist and Residuals variables in the mutate() function

line_slope <- 0.2 
diy_pca <- diy_data_f %>% 
  rowwise() %>% # perform the subsequent opperation row by row
  mutate(
    projection_x = point_projection(
      line_slope = line_slope,
      x = bill_length_mm,
      y = bill_depth_mm)$x,
    projection_y = point_projection(
      line_slope = line_slope,
      x = bill_length_mm,
      y = bill_depth_mm)$y,
    S_dist = ,# right formula
    Residuals = # right formula
  )

diy_pca %>% 
  ggplot() +
  geom_point(aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_abline(slope = line_slope, color = "red") +
  geom_point(aes(x = projection_x, y = projection_y), color = "red") +
  geom_segment(
    aes(x = bill_length_mm,
        y = bill_depth_mm,
        xend = projection_x,
        yend = projection_y), color = "red", linewidth = 0.1) +
  labs(title = str_c("SS = ", round(sum(diy_pca$S_dist), 2),
                     ", SR = ", round(sum(diy_pca$Residuals), 2))) +
  coord_equal()

Solution

S_dist = projection_x^2 + projection_y^2,
Residuals = sqrt((projection_x - bill_length_mm)^2 + (projection_y - bill_depth_mm)^2)

2.5.4 Optimal PCA projection

You have seen this morning that the best line on which to project the point can be found with the covariance matrix of your data.

You can use the cov() function to perform this computation

Solution

diy_cov <- diy_data_f %>% as.matrix() %>% cov()
diy_cov

               bill_length_mm bill_depth_mm
bill_length_mm      1.0000000    -0.4263804
bill_depth_mm      -0.4263804     1.0000000

You can use the eigen() function to get the eigenvalues and vectors of the covariance matrix.

Solution

eigen(diy_cov)

eigen() decomposition
$values
[1] 1.4263804 0.5736196

$vectors
           [,1]       [,2]
[1,] -0.7071068 -0.7071068
[2,]  0.7071068 -0.7071068

To obtain the following figure you will need to write the body of the point_projection function, taking into parameters the diy_cov results instead of the previous line_slope.

Then you will need to compute the slope value for the geom_abline function from the diy_cov results.

point_projection <- function(diy_cov, x, y){
  # your code
  list(x = results[1], y = results[2])
}

diy_pca <- diy_data_f %>% 
  rowwise() %>% # perform the subsequent opperation row by row
  mutate(
    pc1_x = point_projection(
      diy_cov = diy_cov,
      x = bill_length_mm,
      y = bill_depth_mm)$x,
    pc1_y = point_projection(
      diy_cov = diy_cov,
      x = bill_length_mm,
      y = bill_depth_mm)$y,
    S_dist = , # right formula
    Residuals = # right formula
  )

diy_pca %>% 
  ggplot() +
  geom_point(aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_abline(slope = , color = "red") + # missing slope value here
  geom_point(aes(x = pc1_x, y = pc1_y), color = "red") +
  geom_segment(
    aes(x = bill_length_mm,
        y = bill_depth_mm,
        xend = pc1_x,
        yend = pc1_y), color = "red", linewidth = 0.1) +
  labs(title = str_c("SS = ", round(sum(diy_pca$S_dist), 2),
                     ", SR = ", round(sum(diy_pca$Residuals), 2))) +
  coord_equal()

Solution

For the projection function:

point_projection <- function(diy_cov, x, y){
  a <- c(x, y)
  b <- eigen(diy_cov)$vector[, 1]
  results <- c(a %*% b) * b 
  list(x = results[1], y = results[2])
}

For the slope value:

geom_abline(slope = eigen(diy_cov)$vector[2, 1] / eigen(diy_cov)$vector[1, 1], color = "red") +

Do you have the same results as your neighbors ?

We are now going to plot the second principal component and the projection of the data point on it

Adapt your previous code to perform the computation on the PC2

point_projection <- function(diy_cov, x, y){
  # you corde
  return(list(x = results[1], y = results[2]))
}

diy_pca <- diy_pca %>% 
  rowwise() %>% # perform the subsequent opperation row by row
  mutate(
    pc2_x = point_projection(
      diy_cov = diy_cov,
      x = bill_length_mm,
      y = bill_depth_mm)$x,
    pc2_y = point_projection(
      diy_cov = diy_cov,
      x = bill_length_mm,
      y = bill_depth_mm)$y,
    S_dist = , # your code
    Residuals = # your code
  )

diy_pca %>% 
  ggplot() +
  geom_point(aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_abline(slope = , color = "red") + # slope of the PC1
  geom_abline(slope = , color = "blue") + # slope of the PC2
  geom_point(aes(x = pc2_x, y = pc2_y), color = "blue") +
  geom_segment(
    aes(x = bill_length_mm,
        y = bill_depth_mm,
        xend = pc2_x,
        yend = pc2_y), color = "red", linewidth = 0.1) +
  labs(title = str_c("SS = ", round(sum(diy_pca$S_dist), 2),
                     ", SR = ", round(sum(diy_pca$Residuals), 2))) +
  coord_equal()

Solution

For the projection function:

point_projection <- function(diy_cov, x, y){
  a <- c(x, y)
  b <- eigen(diy_cov)$vector[, 2]
  results <- c(a %*% b) * b
  return(list(x = results[1], y = results[2]))
}

For the slope value:

  geom_abline(slope = eigen(diy_cov)$vector[2, 1] / eigen(diy_cov)$vector[1, 1], color = "red") +
  geom_abline(slope = eigen(diy_cov)$vector[2, 2] / eigen(diy_cov)$vector[1, 2], color = "blue") +

For the PCA construction we want, a PC2 orthogonal to the PC1.

How many lines are compatible with the orthogonality constraint for 2 variables ? For 3 variables ?

You can merge your previous computation to plot the projection on the 2 first PCs

2.5.5 PCA base

Now that we know how to compute our first 2 PCs, we want to represent the data point in the PCs coordinates.

Modify the following code to plot your data in the PCs space.

point_projection <- function(diy_cov, x, y, PC){
  # your code
}

diy_pca <- diy_data_f %>% 
  rowwise() %>% # perform the subsequent opperation row by row
  mutate(
    pc1_x = point_projection(
      diy_cov = diy_cov,
      x = bill_length_mm,
      y = bill_depth_mm,
      PC = 1),
    pc2_y = point_projection(
      diy_cov = diy_cov,
      x = bill_length_mm,
      y = bill_depth_mm,
      PC = 2),
  )

diy_pca %>% 
  bind_cols(
    data %>% select(-colnames(diy_pca)[1:2]) %>%  filter(sex == "female")
  ) %>% 
  ggplot() +
  geom_point(aes(x = pc1_x, y = pc2_y, color = species))

Solution

point_projection <- function(diy_cov, x, y, PC){
  a <- c(x, y)
  b <- eigen(diy_cov)$vector[, PC]
  a %*% b
}

2.5.6 Comparison with `prcomp`

In the prcomp output you can directly get the coordinates in PCs space from the $x slot.

diy_data_f %>% 
  rowwise() %>% # perform the subsequent opperation row by row
  mutate(
    pc1_x = point_projection(
      diy_cov = diy_cov,
      x = bill_length_mm,
      y = bill_depth_mm,
      PC = 1),
    pc2_y = point_projection(
      diy_cov = diy_cov,
      x = bill_length_mm,
      y = bill_depth_mm,
      PC = 2),
  ) %>% 
  ungroup() %>% 
  mutate(
    pc1_x_ref = data_f_pca$x[,1],
    pc2_y_ref = data_f_pca$x[,2]
  ) %>% 
  bind_cols(
    data %>% select(-colnames(diy_pca)[1:2]) %>%  filter(sex == "female")
  ) %>% 
  ggplot() +
  geom_point(aes(x = pc1_x, y = pc2_y, color = species), alpha = 0.5) +
  geom_point(aes(x = pc1_x_ref, y = pc2_y_ref, color = species), size = 0.5)

What could be the problem when comparing these two PCA results ?

2.6 PCA evaluation

Compute the PCA (data_f_pca) of the bill_length_mm, bill_depth_mm, flipper_length_mm and body_mass_g for the female penguins

Solution

data_f_pca <- data %>%
  filter(sex == "female") %>% 
  select(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)) %>% 
  prcomp(scale = TRUE)

species_f <- data %>% filter(sex == "female") %>% pull(species) # we get the species variable
fviz_pca_ind(data_f_pca,
             geom = "point",
             col.ind = species_f
             )

What are the main differences with the bill_length_mm and bill_depth_mm PCA ?

2.6.1 Explained inertia

You have seen this morning that there are different criteria to evaluate the PCA projection.

SS(PC1): Eigenvalue for PC1
sqrt(SS(PC1)): Singular value for PC1
SS(PC1)/(n-1): Variation for PC1
Var(PC1)/sum(Var(PCx)): PC1 accounts for x Variation around the PCs

From the data_f_pca try to compute these 4 criteria

Solution

Using the get_eigenvalue function

get_eigenvalue(data_f_pca)

      eigenvalue variance.percent cumulative.variance.percent
Dim.1  3.0313927        75.784817                    75.78482
Dim.2  0.6138918        15.347295                    91.13211
Dim.3  0.2444716         6.111790                    97.24390
Dim.4  0.1102439         2.756099                   100.00000

using the data_f_pca slot

data_f_pca$sdev^2

[1] 3.0313927 0.6138918 0.2444716 0.1102439

data %>%
  filter(sex == "female") %>% 
  select(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)) %>% 
  scale() %>% 
  cov() %>% 
  eigen()

eigen() decomposition
$values
[1] 3.0313927 0.6138918 0.2444716 0.1102439

$vectors
           [,1]       [,2]       [,3]         [,4]
[1,]  0.4109840  0.8780384  0.2450936  0.008359691
[2,] -0.4966517  0.4387233 -0.7337963 -0.149672643
[3,]  0.5438179 -0.1611183 -0.3086541 -0.763567728
[4,]  0.5373001 -0.1030113 -0.5533577  0.628086412

(pc_var <- data_f_pca$sdev^2 / (ncol(data_f_pca$x) - 1))

[1] 1.01046422 0.20463059 0.08149053 0.03674798

pc_var / sum(pc_var)

[1] 0.75784817 0.15347295 0.06111790 0.02756099

Why are these metric important when analyzing PCA outputs ?

2.6.2 Scree plot

The fviz_eig function create a scree plot of your PCA.

From your previous code, create your own scree plot

Solution

tibble(
  pc = 1:4,
  var = pc_var / sum(pc_var)
) %>% 
  ggplot() +
  geom_point(aes(x = pc, y = var))

2.6.3 Variable contribution

The axis of the PCA are linear combinations of the variable axis.

What does it mean to find a slope of 0.5 for PC1 in the bill_length_mm, bill_depth_mm plot ?

Solution

It means that if bill_depth_mm contribute for 1 to PC1, bill_length_mm contribute for 0.5

As the number of variables increases, so is the complexity of the linear combinations for each PC. We can represent the variable axis in the new PCA axis, this representation is called the correlation circle.

With the fviz_pca_var() function, which are the variables that contribute the most to PC1 ? To PC2 ? To both ?

Use the str() function to find this information in the data_f_pca object

Finally, we can use the fviz_pca_biplot function to display the individuals and variable information on the same plot.

Remember that we scaled all the variables before computing the PCA. What are the results of the scaling in terms of variable unit contribution ?

2.6.4 Quality of the projection

You have seen this morning different metrics to check the quality of the representation for individuals and for variables.

The get_pca_var() and get_pca_ind() are two symmetrical functions to get relevant information for the individuals and the variables.

Explore the results of these two functions.

What is the variable contributing the most to PC4 ?

How do you interpret the cos2 slot

Solution

# results per variables
res_var <- get_pca_var(data_f_pca)
res_var$coord          # coordinates

                       Dim.1       Dim.2      Dim.3        Dim.4
bill_length_mm     0.7155599  0.68795402 -0.1211843 -0.002775669
bill_depth_mm     -0.8647150  0.34374518  0.3628188  0.049695814
flipper_length_mm  0.9468356 -0.12623822  0.1526111  0.253527423
body_mass_g        0.9354876 -0.08071066  0.2736026 -0.208543556

res_var$contrib        # contribution to the axes

                     Dim.1     Dim.2     Dim.3        Dim.4
bill_length_mm    16.89078 77.095140  6.007089  0.006988444
bill_depth_mm     24.66629 19.247815 53.845705  2.240189999
flipper_length_mm 29.57379  2.595912  9.526734 58.303567516
body_mass_g       28.86914  1.061133 30.620472 39.449254041

res_var$cos2           # quality of the representation

                      Dim.1      Dim.2      Dim.3        Dim.4
bill_length_mm    0.5120259 0.47328073 0.01468563 7.704336e-06
bill_depth_mm     0.7477321 0.11816075 0.13163746 2.469674e-03
flipper_length_mm 0.8964976 0.01593609 0.02329016 6.427615e-02
body_mass_g       0.8751370 0.00651421 0.07485836 4.349041e-02

Modify the following code to display the individuals that contribute the most to PC1 and PC2. What can you say about these individuals ?

# results per individuals
res_ind <- get_pca_ind(data_f_pca)
species_f <- data %>% filter(sex == "female") %>% pull(species) # we get the species variable
fviz_pca_ind(data_f_pca,
             geom = "point",
             col.ind = (res_ind$contrib[, 3])
             )

With the same modifications for the following code, which species are better represented by PC1 and which by PC2 ?

# results per individuals
res_ind <- get_pca_ind(data_f_pca)
species_f <- data %>% filter(sex == "female") %>% pull(species) # we get the species variable
fviz_pca_ind(data_f_pca,
             geom = "point",
             col.ind = (res_ind$cos2[, 3])
             )

2.7 Adding new individuals

We can compute the coordinates of new individuals as follows:

scale the new individual variables using the scaling information from the PCA
Compute the coordinates by multipling by the eigenvalues of the principal components.

data_m_scale <- data %>%
  filter(sex == "male") %>%
  select(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)) %>% 
   scale(
      center = data_f_pca$center,
      scale = data_f_pca$scale
    )

coord_func <- function(ind, loadings){
  r <- loadings * ind
  apply(r, 2, sum)
}
data_m_pca <- t(apply(data_m_scale, 1, coord_func, data_f_pca$rotation ))

as_tibble(data_f_pca$x) %>% 
  bind_cols(
    data %>% filter(sex == "female")
  ) %>% 
  bind_rows(
    as_tibble(data_m_pca) %>% 
    bind_cols(
      data %>% filter(sex == "male")
    )
  ) %>% 
  ggplot() +
  geom_point(aes(x = PC1, y = PC2, color = species, shape = sex))

2.8 Comparison with SVD decomposition

Compare the results of your data_f_pca to the one obtained with the following code.

data_f_svd <- data %>%
  filter(sex == "female") %>% 
  select(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)) %>% 
  mutate(
    bill_length_mm = (bill_length_mm - mean(bill_length_mm)) / sd(bill_length_mm),
    bill_depth_mm = (bill_depth_mm - mean(bill_depth_mm)) / sd(bill_depth_mm),
    flipper_length_mm = (flipper_length_mm - mean(flipper_length_mm)) / sd(flipper_length_mm),
    body_mass_g = (body_mass_g - mean(body_mass_g)) / sd(body_mass_g),
  ) %>% 
  svd()

The .Rmd file corresponding to this page is available here under the AGPL3 Licence

2.1 Introduction

2.2 Loading the libraries

2.3 loading the data

2.4 Performing your first PCA

2.5 Computing your own PCA

2.5.1 Scaling

2.5.2 Data projection

2.5.3 Evaluation of the projection

2.5.4 Optimal PCA projection

2.5.5 PCA base

2.5.6 Comparison with prcomp

2.6 PCA evaluation

2.6.1 Explained inertia

2.6.2 Scree plot

2.6.3 Variable contribution

2.6.4 Quality of the projection

2.7 Adding new individuals

2.8 Comparison with SVD decomposition

2.5.6 Comparison with `prcomp`