This is part three of the Scikit-learn series, which is as follows:
- Part 1 – Introduction
- Part 2 – Supervised learning in Scikit-learn
- Part 3 – Unsupervised Learning in Scikit-learn (this article)
A quick recap :
So, Unsupervised learning is a type of machine learning algorithm whose goal is to discover groups of similar examples within the datasets consisting of input data without labeled responses/target values.
What Scikit-Learn has in its unsupervised package?
As we have already seen what scikit-learn offers us in terms of unsupervised learning let us again see which varieties of algorithms are available with us to use :
1.Gaussian mixture models
2.Manifold learning (An approach to non-linear dimensionality reduction)
3.Clustering
4.Principal component analysis (PCA)
We are discussing only those algorithms which involves code and need implementation and remaining only needs mathematical explanation.
from sklearn import mixture # importing statement
clf = mixture.GaussianMixture(n_components=2, covariance_type='full') # you can choose components to be used on your own
clf.fit() # fit the model on required training data
Clustering
Though there are many clustering algorithms that we can choose from, we will discuss the most used algorithm and that is k-means clustering.
You can read more about K-Means Clustering here.
Also, check out how it can be used to compress images here.
The main idea is to define k centroids, one for each cluster. These centroids should be placed very carefully because of different location it causes a different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed. At this point, we need to re-calculate k new centroids and so on which will finally lead to the final clusters.
from sklearn.cluster import KMeans # import statement
import numpy as np # importing numpy for arrays
X = np.array([[1, 2], [1, 4], [1, 0], #training data
[4, 2], [4, 4], [4, 0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X) # only 2 clusters are used
kmeans.labels_ #Labels of each point
kmeans.predict([[1, 1], [4, 0]])
kmeans.cluster_centers_ # centres of clusters are given
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA(n_components=2) # only 1 parameter used for basic understanding
pca.fit(X)
print(pca.explained_variance_ratio_)# Percentage of variance explained by each of the selected components
print(pca.singular_values_) #The singular values corresponding to each of the selected components.
Take it all with you guys !
I hope you all had a great time reading all of this scikit-learn series. Obviously, this series will not make you a machine learning god but the practice can! Also, this series will definitely make you keep your first foot on your path.
Those who didn’t know anything about scikit-learn can now at least write some code on their own.
See you soon with something more interesting till then practice what you learned.
Happy to help you all!
Deepanshu Gaur
Loves to learn new technologies and this attitude keeps me going.
Fast learner.
Strong foundation in data structures & algorithms.
https://www.linkedin.com/in/deepanshu-g-37a42899
Latest posts by Deepanshu Gaur (see all)
- Scikit Learn – Part 1 – Introduction - February 20, 2019
- Scikit Learn – Part 2 – Supervised Learning - January 8, 2019
- Scikit Learn – Part 3 – Unsupervised Learning - December 8, 2018
wonderful article
keep posting more
Thanks for the feedback.
Will keep posting more!