KMeans Clustering using Manual Approach

Created By: Debasis Das (23-Feb-2021)

In this post we will see how KMeans clustering works by using sklearn KMeans Cluster and by manually writing the KMeans Clustering logic and comparing the results.

The manual approach code is a sample to demonstrate the iterative process of starting with random centroids and updating the cluster labels for each point based on its proximity to the centroids and then updating the centroids as the mean of all the points in the cluster for the given centroid and repeat the process (update cluster label, update centroids) for a given number of iterations

The manual approach is for concept demonstration purpose only and is not optimized to be used in any project.

Using sklearn KMeans

Implementing the KMeans Logic manually

For implementing the KMeans Logic ourselves we will implement the following methods

  1. Updating the Cluster point labels based on its proximity to the centroids. Each point is checked against each centroid to find the shortest distance and the label for the point is updated to the centroid it is closest to

  2. Get the new centroid positons as mean of all the cluster points for the given centroid.

Based on the number of iterations the labels will get updated and the new centroids will get calculated.

Finally a function to plot the cluster points and centroids in a scatter plot

First Pass using random initial centroid points

Second Pass using the updated cluster labels and centroids

Third Pass using the updated cluster labels and centroids

4th pass

5th pass

6th Pass

We could continue with the iterations and see the centroids updating themselves. it can be seen that sklearn cluster centroids and manual approach are close even though they are not exactly the same. it might converge closer based on the number of iterations in the Manual approach.

One of the key challenges of KMeans clustering is to identify the initial cluster centroids, based on the choice of cluster centroids the cluster formation will vary