Introduction to Projection Pursuit

Projection Pursuit is a statistical technique to find the most "interesting" possible projections of multidimensional data. Generally, projections with more deviation from normal distribution are preferred and considered as more interesting.

William Henry Kruskal was the first who proposed the concept of Projection Pursuit techniques. Jerome H. Friedman and John Tukey implemented it successfully.

Years later, Friedman and Stuetzle extended the idea behind projection pursuit and added projection pursuit regression (PPR), projection pursuit classification (PPC), and projection pursuit density estimation (PPDE).

purpose

Projection pursuit is used for dimensionality reduction that is used to find non-linear projections with lesser dimension of high-dimensional data. Main purpose is searching for non-linear patterns in the data and find projections that highlight these patterns and reveal hidden structures and relationships in the data.

Projection Index

Projection Index measures the degree of non-linearity in the projection. It defines quality of the projections. It helps to choose desired projection with maximum non-linearity. Value of the projection index of the optimum desired projection should be higher or lower depends on the goal. For maximizing variance, separability of classes, or cluster compactness higher value of the projection index is consider to be better. In the same way, to minimize within-cluster variance or projection error higher value of the projection index is more preferred. In practical scenarios, a balanced criteria of maximization and minimization of various properties. The projection index is typically a non-linear, non-convex function.

Method

Let us consider that we have a two dimensional data set with two features, x1, and x2. f is the projection of the data on to the one-dimensional space.

Step-1: Projection_Index = sum of the 2nd order partial derivatives of the                                      projection with respect to x1 and x2
        = |∂²f/∂x1²| + |∂²f/∂x1∂x2| + |∂²f/∂x2∂x1| + |∂²f/∂x2²|

Gradient descent is the one of the optimization algorithms to find the projection with maximum projection index. If we want to implement it, first we have to compute gradient of the projection index with respect to the projection f.

Step-2: ∇f projection_index = ∂/∂f (Projection_Index)
= [∂/∂f (|∂²f/∂x1²| + |∂²f/∂x1∂x2| + |∂²f/∂x2∂x1| + |∂²f/∂x2²|)]

if we plot projection index in z axis x1 in x axis and x2 in y axis, gradient of the projection index with respect to the projection is a vector which have the direction towards the maximum slope and the value of slop which is maximum in that directions.

Now we will update the projection by moving it towards the direction of the gradient with a small step, n in each iteration.

Step-3: f = f - η * ∇f projection_index

This steps will continue in every iteration until we find the maximum projection index or a predefined convergence criteria is met.

Implementation in python

Projection pursuit can be implemented using the skpp(Scikit-PP) library in python. It have a Projection Pursuit Regressor to perform projection pursuit regression.

Command to install sklearn and skpp modules from the repository Python Package Index (PyPI).

pip install projection-pursuit

Code to perform projection pursuit regression on datasets

    from skpp import ProjectionPursuitRegressor 
    from sklearn.datasets import make_regression 
    #Generate a simple regression data set 
    X, y = make_regression(n_samples=100, 
	                       n_features=10,
                           random_state=0) 
                           
    #Create a Projection Pursuit Regressor
    #model and fit it to the data 
    model = ProjectionPursuitRegressor() 
    model.fit(X, y) 
    
    #Use the fitted model to make predictions 
    predictions = model.predict(X)

This code will give a data set as output and fit a projection pursuit regression model to it. Predictions can be made using the same data set.

Advantages

Projection pursuit have some advantages over other dimensionality reduction techniques

It is a flexible approach to customize the projection to fulfill different criteria, according to the goal of analysis.
It can capture nonlinear projections of data. There is no assumptions of linearity.
It is more useful for multimodal distributions than other dimension reduction techniques.

limitations

There are also some limitations of this technique. The limitations are -

Computation of this technique is complex as optimization techniques like gradient descent is required.
Sometimes the result of the projection pursuit depends on the initialization of the projection, as the optimization algorithm may or may not converge. If it converges it may converge to a local maximum of the projection index instead of the global maximum. This means that different initializations of the projection may lead to different results and that the optimization algorithms will not always find the best projection of the data.
As user can specify the projection index there are enough flexibility in this technique. But this can lead to overfitting of data. In that case, interpreting the data will be difficult.
Projection pursuit is not effective for data with vary high dimensions as computation of the optimization of the projection index becomes infeasible.

Applications in real world

Projection Pursuit is used for various purposes. it is more suitable technique for Exploratory analysis, Feature selection. It is also used for image compression, Feature extraction, gene expression analysis, protein structure prediction, economic forecasting, market segmentation, environmental monitoring. Overall, projection pursuit is a useful tool in statistics and data analysis.

Projection Pursuit (PP)

List of Mathematical Algorithms