Principal Component Analysis or PCA is a method for finding low-dimensional representations of a dataset that retains as much of the original variations as possible. It is generally used in Data Science for reducing the inputs so as to train the model faster (generally done in the case of Neural Networks as they take a lot of time to train). Another use case is for plotting the clusters of n features that else was not possible to visualize. In PCA each of the new dimensions is a linear combination of the original n features.
Dimensionality reduction methods, such as PCA focus on reducing the feature space, allowing most of the information or variability in the dataset to be explained with fewer features. The idea behind PCA is that the newly generated features must not be correlated as having a high correlation means similarity between them which means unnecessary information.
The PCA method was first published in 1901 (Pearson 1901) and has been a staple procedure for dimension reduction for decades. PCA examines the covariance among features and combines multiple features into a smaller set of uncorrelated variables. These new features, which are weighted combinations of the original predictor set, are called principal components (PCs), and hopefully, a small subset of them explain most of the variability of the full feature set. The weights used to form the PCs reveal the relative contributions of the original features to the new PCs.
What do Principal Components Mean?
Principal Components are the components generated by a linear combination of all the features by multiplying the features by a loading vector ϕ. Each component has to follow only one thing, that is to be must uncorrelated to all the previous components.
The first principal component of a set of Features X1 ,X2 , …, Xp is the linear combination of the features Z1 = ϕ11X1+ϕ21X2+…+ϕp1Xp, that has the largest variance. Here ϕ1=(ϕ11,ϕ21,…,ϕp1) is the loading vector.
The second principal component is the linear combination of X1 ,X2 , …, Xp that has the maximum variance out of all linear combinations that are uncorrelated with Z1: Z2 = ϕ12X1+ϕ22X2+…+ϕp2Xp, that has the largest variance. Here ϕ1=(ϕ11,ϕ21,…,ϕp1) as the loading vector for the second principal component. This process proceeds until all p principal components are computed.
Performing PCA in Python
Performing PCA is made easy using the sklearn library. The code for doing PCA on a dataset is as follows:
Selecting the numbers of Principal Components
The primary goal of PCA is dimension reduction (in this case, feature reduction). In essence, we want to come out of PCA with fewer components than original features, and with the caveat that these components explain us as much variation as possible about our data.
But how do we decide how many PCs to keep? Do we keep the first 10, 20, or 40 PCs?
At a very basic level, we can use loops and get the eigen values for each corresponding number of factors. After plotting the points you will see that there is an elbow-like graph which will tell a small subset of points where the graph starts to flatten.
Weakness of PCA
PCA’s main weakness is that it tends to be highly affected by outliers in the data. For this reason, many robust variants of PCA have been developed, many of which act to iteratively discard data points that are poorly described by the initial components. Scikit-Learn contains a couple of interesting variants on PCA, including RandomizedPCA and SparsePCA, both also in the sklearn.decomposition submodule. RandomizedPCA, which we saw earlier, uses a nondeterministic method to quickly approximate the first few principal components in very high-dimensional data, while SparsePCA introduces a regularization term that serves to enforce sparsity of the components.