Get this book -> Problems on Array: For Interviews and Competitive Programming

Reading time: 35 minutes

In case you haven't read about **the K Means Algorithm** yet, it is highly recommended to first go through it, in order to understand best, how **K+ means works**.

K-means is one of the most popular and simplest unsupervised learning algorithms

that solve the well-known clustering problem. The procedure follows a simple and easy way to classify a given data set to a predefined, say K number of clusters.

However, a major drawback of K Means is, that **we need to determine the optimal values of K**, and this is a very **difficult** job. Prior to solving the problem, it is difficult to know which value of K would do best for our particular problem.

**To overcome this difficult problem, we use the K+ Means algorithm**, which incorporates a method to calculate a good value of K for our problems. K+ Means algorithm builds onto K Means algorithm, and is an enhancement to it.

K+ Means is also **great at detecting outliers**, and **forming new clusters** for them. Therefore, **K+ Means is not sensitive to outliers**, unlike the K Means algorithm.

### The K+ Means algorithm

**STEP 1:** Find K clusters using the K-Means algorithm

**STEP 2:** For each of the clusters, calculate MIN, AVG and MAX intra cluster dissimilarity

**STEP 3:** If, for any cluster, AVG value is greater than others, then MIN and MAX values are checked. If the MAX value is higher too, we assign the corresponding data point as a new cluster and go back to step 1, with K+1 clusters.

**STEP 4:** Repeat STEPS 1 to 3 until no new clusters are formed.

### What is intra-cluster dissimilarity?

Intra-cluster dissimilarity, or intra-cluster distance, is a metric to find the distances of each data point from the centroid of the cluster, in order to detect outliers.

**MIN:** Related to point closest to centroid

**MAX:** Related to point furthest away from centroid (potential outlier)

**AVG:** Average of distances of all points with centroid respectively.

*All distances will be calculated as Euclidean distances*

## Example

Let's assume 10 input data points, with points labeled from P1 to P10. We will apply our K+ means algorithm on this input data, in order to understand the algorithm better:

Point | X | Y |
---|---|---|

P1 | 1 | 4 |

P2 | 1 | 3 |

P3 | 2 | 2 |

P4 | 7 | 2 |

P5 | 8 | 3 |

P6 | 9 | 2 |

P7 | 5 | 6 |

P8 | 6 | 7 |

P9 | 7 | 6 |

P10 | 8 | 7 |

Here's how a scatter plot of our input data looks:

We will assume K=2 for this example.

Our initial centroids will be P1 and P5

STEP 1: Find K clusters using the K means algorithm.

We will initially run K-means algorithm, to get the following as the 2 clusters:

Clusters | Points |
---|---|

Cluster1 | {P1,P2,P3} |

Cluster2 | {P4,P5,P6,P7,P8,P9,P10} |

STEP 2: For each of the clusters, calculate MIN, AVG and MAX intra cluster dissimilarity.

**Cluster 1:**

For finding intra-cluster distances, we first will need to find the centroid G1.

G1 = ( 1.33 , 3 )

Point | X | Y | Distance from G1 |
---|---|---|---|

P1 | 1 | 4 | 1.05 |

P2 | 1 | 3 | 0.33 |

P3 | 2 | 2 | 1.20 |

Hence,

MIN = 0.33

MAX = 1.20

AVG = 0.86

**Cluster 2:**

For finding intra-cluster distances, we first will need to find the centroid G2.

G2 = ( 7.14 , 4.71 )

Point | X | Y | Distance from G1 |
---|---|---|---|

P4 | 7 | 2 | 2.71 |

P5 | 8 | 3 | 1.91 |

P6 | 9 | 2 | 3.29 |

P7 | 5 | 6 | 2.50 |

P8 | 6 | 7 | 2.56 |

P9 | 7 | 6 | 1.30 |

P10 | 8 | 7 | 2.45 |

Hence,

MIN = 1.30

MAX = 3.29

AVG = 2.39

STEP 3: If, for any cluster, AVG value is greater than others, then MIN and MAXvalues are checked. If the MAX value is higher too, we assign the correspondingdata point as a new cluster and go back to step 1, with K+1 clusters.

Here, we can clearly see that AVG value of Cluster 2 is much greater than Cluster 1, so we'll compare the MAX values now:

Since 3.29 > 1.20 by a significant margin, we can confirm that Cluster 2 has the presence of an outlier.

MAX value of 3.29 is present in the P6 data point, so that will be the point we can remove from Cluster 2.

STEP 4: Repeat STEPS 1 to 3 until no new clusters are formed.

Clusters | Points |
---|---|

Cluster1 | {P1,P2,P3} |

Cluster2 | {P4,P5,P7,P8,P9,P10} |

Cluster2 | {P6} |

Now, we perform STEP 1 again, to form the following clusters:

Clusters | Points |
---|---|

Cluster1 | {P1,P2,P3} |

Cluster2 | {P7,P8,P9,P10} |

Cluster2 | {P6,P4,P5} |

Again, we perform STEP 2:

**Cluster 1 (unchanged):**

MIN = 0.33

MAX = 1.20

AVG = 0.86

**Cluster 2:**

New G2 = (6.5 , 6.5)

Point | X | Y | Distance from G2 |
---|---|---|---|

P7 | 5 | 6 | 1.58 |

P8 | 6 | 7 | 0.71 |

P9 | 7 | 6 | 0.71 |

P10 | 8 | 7 | 1.58 |

MIN = 0.71

MAX = 1.58

AVG = 1.145

**Cluster 3:**

G3 = (8,2.33)

Point | X | Y | Distance from G1 |
---|---|---|---|

P4 | 7 | 2 | 1.05 |

P5 | 8 | 3 | 0.67 |

P6 | 9 | 2 | 1.05 |

MIN = 0.67

MAX = 1.05

AVG = 0.92

STEP 3:

Cluster 1,2 and 3, we find that the AVG inter-cluster distances are quite similar. Hence, we can safely say that there are no outliers, and that these are our final 3 clusters.

**FINAL CLUSTERS:**

Clusters | Points |
---|---|

Cluster1 | {P1,P2,P3} |

Cluster2 | {P7,P8,P9,P10} |

Cluster2 | {P6,P4,P5} |

### Complexity analysis

Complexity of K Mean algorithm = **O(tkn)**, where:

*t*is the number of iterations*k*is the number of clusters*n*is the number of data points

**K+ Mean algorithm is computationally more expensive as compared to K Means**, since it performs a few extra steps in order to find the better value of K.

However, the simplicity and effectiveness of K+ Means more than makes up for the time complexity.

Complexity of K+ Mean algorithm = **O(t * (k^2) * n)**.

## Conclusion

The K+ Means algorithm takes all the strengths of the widely popular K Means algorithm and remains simple and easy to use and understand. In addition, it removes the weaknesses of the K-Means algorithm as well.