A quick introduction into Dimension Reduction, including a few widely used techniques, linear discriminant analysis, principal component analysis, kernel principal component analysis, and more.

**Why dimension reduction in machine learning?**

We have access to a large amounts of data. Occasionally we gather data for our machine learning project and end up gathering a large set of features. Some of these features are not as important as others. Sometimes the features themselves are correlated with each other. These large number of features could cause many problems:

- over-fitting the problem by introducing too many features.
- The large number of features make the data set sparse.
- It takes a much larger space to store a data set with a large number of features.
- It can get very difficult to analyze and visualize a data set with a large number of dimensions.

Dimension reduction is the techniques to reduce the time that is required to train our machine learning model and it can also benefit in eliminating over-fitting. It is crucial for every data scientist and machine learning engineers to understand what dimension reduction techniques are and when to use them. We all are familiar with data compressing (or zip the data), if the file is too large to send, we can zip it. Dimension reduction is the same principal as zipping the data. by compresses large set of features onto a new feature subspace of lower dimensional without losing the important information.

**What are dimension reduction techniques?**

There are two main algorithms for dimensionality reduction: Linear Discriminant Analysis ( LDA ) and Principal Component Analysis ( PCA ). The basic difference between these two is that LDA uses information of classes to find new features in order to maximize class separability while PCA uses the variance of each feature to do the same. In this context, LDA can be consider a supervised algorithm and PCA an unsupervised algorithm. Let’s look deeper on both techniques:

- Linear Discriminant Analysis (LDA):

LDA is used for compressing supervised data. When we have a large set of features (classes), and our data is normally distributed and the features are not correlated with each other then we can use LDA to reduce the number of dimensions. LDA is a generalized version of Fisher’s linear discriminant. *Calculate z-score to normalize the features that are highly skewed.*

- Principal component analysis (PCA):

PCA is mainly used for compressing unsupervised data. it can help de-noise and detect patterns in data. PCA is used in reducing dimensions in images, textual contents and in speech recognition systems.

1. PCA technique analyses the entire data set and then finds the points with maximum variance.

2. It creates new variables such that there is a linear relationship between the new and original variables such that the variance is maximized.

3. Covariance matrix is then created for the features to understand their multi-collinearity.

4. Once the variance-covariance matrix is computed, PCA then uses the gathered information to reduce the dimensions. It computes orthogonal axes from the original feature axes. These are the axes of directions with maximum variance.

- Kernel principal component analysis (KDA):

KDA is used for Nonlinear dimensionality reduction. When we have non-linear features then we can project them onto a larger feature set to remove their correlations and to make them linear. Essentially, non-linear data is mapped and transformed onto a higher-dimensional space. Then PCA is used to reduce the dimensions. However, one downside of this approach is that it is computationally very expensive.

Just like in PCA, we first compute variance-covariance matrix and then eigen vectors and eigen values are prepared with the highest variance to compute principal components.

post with permission. source.