Classic methods for selecting PCs
Recall that the eigenvalue of a PC represents the amount of variance in the original data “explained” by that PC, and that maximizing variance is important because it provides the most “information” about the original data. As such, one of the simplest techniques for selection of a subset of PCs is to just select the first k number of components with the largest eigenvalues, where you choose k without a clear criterion. There’s no bigger underlying reason to this beyond the fact that these are the PCs with the most explained variance from the data.
Stepping up just a bit in complexity, the next selection method involves retaining all of the PCs with eigenvalues greater than 1. This is sometimes called the “Kaiser rule”, the “Kaiser criterion”, or the “Kaiser-Guttman rule”. The motivating factor here goes back to the fact that - with standardized data - the variance of each of the original variables is equal to 1. Thus, a PC with an eigenvalue greater than 1 explains more variance than a single variable in the original data. This method is logical, but fails to account for the fact that even with random data (noise), PCA will define components with eigenvalues greater than 1. In these situations, the variance explained by the components is not actually useful because it’s simply variance due to random error or noise. Parallel analysis uses repeated data simulation to overcome this challenge.
The other classic method for selecting PCs involves looking at the percentage of total variance explained by each component. The eigenvalue of a PC represents the amount of variance explained by that component, and the total variance in the data can be given by the sum of the eigenvalues for all PCs. Thus, it’s possible to calculate the percentage of variance that each component explains by dividing its eigenvalue by the sum of all eigenvalues. In math terms:
% Explained variance for PCn = [(eigenvalue of PCn)/(sum of all eigenvalues)]*100
In our example used elsewhere in this guide, we had a total of two PCs with eigenvalues for PC1 and PC2 of 1.902 and 0.098, respectively. Using this formula, the percent of explained variance for PC1 and PC2 is 95.11% and 4.89% (cumulatively, these two components explain 100% of the total variance). By setting a predetermined threshold (typically 75% or 80% of total explained variance), the first k PCs that cumulatively explain at least this much of the variance can be selected as the subset of components. However, like the other classic methods, this selection method can not account for variance in the data likely due to random error or noise.