This is part three of the Scikit-learn series, which is as follows:
- Part 1 – Introduction
- Part 2 – Supervised learning in Scikit-learn
- Part 3 – Unsupervised Learning in Scikit-learn (this article)
A quick recap :
So, Unsupervised learning is a type of machine learning algorithm whose goal is to discover groups of similar examples within the datasets consisting of input data without labeled responses/target values.
What Scikit-Learn has in its unsupervised package?
As we have already seen what scikit-learn offers us in terms of unsupervised learning let us again see which varieties of algorithms are available with us to use :
1.Gaussian mixture models
2.Manifold learning (An approach to non-linear dimensionality reduction)
4.Principal component analysis (PCA)
We are discussing only those algorithms which involves code and need implementation and remaining only needs mathematical explanation.
from sklearn import mixture # importing statement clf = mixture.GaussianMixture(n_components=2, covariance_type='full') # you can choose components to be used on your own clf.fit() # fit the model on required training data
Though there are many clustering algorithms that we can choose from, we will discuss the most used algorithm and that is k-means clustering.
You can read more about K-Means Clustering here.
Also, check out how it can be used to compress images here.
The main idea is to define k centroids, one for each cluster. These centroids should be placed very carefully because of different location it causes a different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed. At this point, we need to re-calculate k new centroids and so on which will finally lead to the final clusters.
from sklearn.cluster import KMeans # import statement import numpy as np # importing numpy for arrays X = np.array([[1, 2], [1, 4], [1, 0], #training data [4, 2], [4, 4], [4, 0]]) kmeans = KMeans(n_clusters=2, random_state=0).fit(X) # only 2 clusters are used kmeans.labels_ #Labels of each point
array([0, 0, 0, 1, 1, 1])
kmeans.predict([[1, 1], [4, 0]])
kmeans.cluster_centers_ # centres of clusters are given
array([[ 1., 2.], [ 4., 2.]])
import numpy as np from sklearn.decomposition import PCA X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) pca = PCA(n_components=2) # only 1 parameter used for basic understanding pca.fit(X)
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None, svd_solver='auto', tol=0.0, whiten=False)
print(pca.explained_variance_ratio_)# Percentage of variance explained by each of the selected components
[ 0.99244289 0.00755711]
print(pca.singular_values_) #The singular values corresponding to each of the selected components.
[ 6.30061232 0.54980396]
Take it all with you guys !
I hope you all had a great time reading all of this scikit-learn series. Obviously, this series will not make you a machine learning god but the practice can! Also, this series will definitely make you keep your first foot on your path.
Those who didn’t know anything about scikit-learn can now at least write some code on their own.
See you soon with something more interesting till then practice what you learned.
Happy to help you all!
Loves to learn new technologies and this attitude keeps me going.
Strong foundation in data structures & algorithms.