Reading time: 15 minutes
Principal component analysis (PCA) is a technique to bring out strong patterns in a dataset by supressing variations. It is used to clean data sets to make it easy to explore and analyse.
There are several features in a data sets which we call dimensions of the data set. For most cases, we are only concerned with a small set of dimensions for various reasons such as:
- Avoid overfitting (making model very accurate for a small dataset and worse for others)
- Other features/ dimensions are not related to the study at hand
- Reduce the size of data set
There are many ways to achieve dimensionality reduction which fall under two broad categories:
- Feature Elimination
- Feature Extraction
Feature elimination is reducing the feature space by eliminating features. The disadvantage is that we lose the information associated with the features.
Feature extraction does not run into this problem. In feature extraction, we use the old independent features to create new independent features which is a combination of the old features. We order the new independent features and drop the least important features. As the new features are a combination of the old features, we have all the benefits/ information that we had initially.
Principal component analysis is a technique for feature extraction so it combines our input variables in a specific way, then we can drop the “least important” variables while still retaining the most valuable parts of all of the variables.
17 dimensional Example
Consider we have a data set of 17 dimensions which denotes the eating habits of four countries namely England, North Ireland, Scotland and Wales.
We can visualize the data set as follows:
When we apply Principal Component Analysis on the above dataset, we will get a couple of features which we call Principal Components. Let us consider the top two principal components denoted by PC1 and PC2.
Let us visualize the data set along PC1 feature:
Now, let us visualize the data set along PC1 and PC2 features:
We can see that North Ireland is an outlier that is it does not follow the patterns of the other 3 data points of England, Scotland and Wales. We can now eliminate North Ireland or consider it as a different case.
If we take a look at the data closely, we can see that North Ireland consumes more fresh potatoes and less fresh fruits, cheese. fish and alcohol drinks compared to other three countries.
Hence, Principal Component Analysis have helped us spot North Ireland in just two features which is actually a combination of the previous 17 features.
When to use PCA?
If your answer is yes to the following three questions, then you should definitely try out Principal Component Analysis:
- Do you want to reduce the number of variables, but aren’t able to identify variables to completely remove from consideration?
- Do you want to ensure your variables are independent of one another?
- Are you comfortable making your independent variables less interpretable?