1

ML Algos from Scratch – KMeans Clustering

Share this article!

Introduction to K-means Clustering

 

K-means clustering is an unsupervised machine learning algorithm. An unsupervised learning algorithm is applied on unlabeled data(data without defined categories or groups). The purpose of this algorithm is to find clusters(groups) in the data with the number of clusters being equal to K.

The K- means clustering works by randomnly initialisinsg ‘k’ cluster centers from all the data points.
If we have m training examples in our data set,  (x(1), x(2),……….., x(m)), then randomly initialise k cluster centroids choosing any k points from m points.

Cluster centers:- (μ(1), μ(2),……….., μ(K))

Then, the algorithm iterates between these two steps –

1. Cluster Assignment Step

It finds the distance of each point from all of these K clusters and assigns to each point a number which corresponds to the cluster which has the minumum distance from that point

for i = 1 to m
{
assign c(i) as index(1 to K) of cluster center closest to x(i)
}

 
       

2. Update Cluster Centers Step

In this step,the cluster centers of each cluster are updated by taking the mean of all the points assigned to that cluster.

for k = 1 to K
{
assign μ(k) as the mean of all data points assigned to μ(k)
}

The algorithm iterates between steps and stops when no data points change clusters, the sum of the distances is minimized, or some maximum number of iterations is reached. The algorithm will always converge but sometimes a local minumumn is reached and best results are not obtained due to poor initialisation. Hence it is recommended to run the algorithm several times with different random initialisations.

Choosing K

To find the number of clusters in the data, the user needs to run the K-means clustering algorithm for a range of K values and compare the results. In general, there is no method for determining exact value of K, but an accurate estimate may be obtained by using Elbow method.

In elbow method, values of K are tried starting from 1 and the cost function which is the sum of distances between points and their cluster centers is plotted. As the value of K increases this cost will always decrease which means that when K is very high this cost will be minimum but choosing the maximum K defeats the purpose of clustering. Instead K is choosen at the point where this cost decreases sharply also known as the ‘elbow point’.

Another way of choosing K is according to your purpose or what you expect your data to show you. For example, in a problem where we have a data set of different T-shirts we can choose K=3 for small, medium and large sized T-shirts.

That is all about K-means clustering. See how you can use this K-Means clustering for image compression in this article.

Any doubts and suggestions are welcome in the comments. Happy learning.

Share this article!

Saksham Malhotra

Saksham Malhotra

After learning python 2 years ago and dabbling in web development, I encountered data science and felt 'Yes, this is what I want to do.' I strongly believe that the world can be changed through the power of data. Among my other interests, I love reading fiction and owning more books than I can possibly read at once.

Linkedin - www.linkedin.com/in/saksham-malhotra-9bb69513b
Saksham Malhotra

Latest posts by Saksham Malhotra (see all)

Saksham Malhotra

After learning python 2 years ago and dabbling in web development, I encountered data science and felt 'Yes, this is what I want to do.' I strongly believe that the world can be changed through the power of data. Among my other interests, I love reading fiction and owning more books than I can possibly read at once. Linkedin - www.linkedin.com/in/saksham-malhotra-9bb69513b

One Comment

Leave a Reply

Your email address will not be published. Required fields are marked *