Introduction to K-means Clustering
K-means clustering is an unsupervised machine learning algorithm. An unsupervised learning algorithm is applied on unlabeled data(data without defined categories or groups). The purpose of this algorithm is to find clusters(groups) in the data with the number of clusters being equal to K.
The K- means clustering works by randomnly initialisinsg ‘k’ cluster centers from all the data points.
If we have m training examples in our data set, (x(1), x(2),……….., x(m)), then randomly initialise k cluster centroids choosing any k points from m points.
Cluster centers:- (μ(1), μ(2),……….., μ(K))
Then, the algorithm iterates between these two steps –
1. Cluster Assignment Step
It finds the distance of each point from all of these K clusters and assigns to each point a number which corresponds to the cluster which has the minumum distance from that point
for i = 1 to m
assign c(i) as index(1 to K) of cluster center closest to x(i)
2. Update Cluster Centers Step
In this step,the cluster centers of each cluster are updated by taking the mean of all the points assigned to that cluster.
for k = 1 to K
assign μ(k) as the mean of all data points assigned to μ(k)
The algorithm iterates between steps and stops when no data points change clusters, the sum of the distances is minimized, or some maximum number of iterations is reached. The algorithm will always converge but sometimes a local minumumn is reached and best results are not obtained due to poor initialisation. Hence it is recommended to run the algorithm several times with different random initialisations.
In elbow method, values of K are tried starting from 1 and the cost function which is the sum of distances between points and their cluster centers is plotted. As the value of K increases this cost will always decrease which means that when K is very high this cost will be minimum but choosing the maximum K defeats the purpose of clustering. Instead K is choosen at the point where this cost decreases sharply also known as the ‘elbow point’.
Another way of choosing K is according to your purpose or what you expect your data to show you. For example, in a problem where we have a data set of different T-shirts we can choose K=3 for small, medium and large sized T-shirts.
That is all about K-means clustering. See how you can use this K-Means clustering for image compression in this article.
Any doubts and suggestions are welcome in the comments. Happy learning.
Linkedin - www.linkedin.com/in/saksham-malhotra-9bb69513b
Latest posts by Saksham Malhotra (see all)
- ML Algos from Scratch – KMeans Clustering - November 22, 2017
- Compressing Images using K-Means Clustering - November 20, 2017
- Introduction to SoftMax Regression (with codes in Python) - September 24, 2017