Welcome To The World Of Scikit Learn!
This is part one of the Scikit-learn series, which is as follows:
- Part 1 – Introduction (this article)
- Part 2 – Supervised learning in Scikit-learn
- Part 3 – Unsupervised Learning in Scikit-learn
New to machine learning? Don’t know how to get started with this amazing library? Then hang on as you are about to get started with this free library that will help you boost your skills.
Before moving on to the different features it offers, let us understand what actually is scikit-learn!
So, scikit-learn is a machine learning library for Python programming language which offers various important features for machine learning such as classification, regression, and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the python numerical and scientific libraries like NumPy and SciPy.
We will discuss each algorithm and its implementation with codes in detail later in the second part of this series.
Supervised Algorithms In Scikit-Learn
Since you are familiar with machine learning you already know that there are 2 types of algorithms i.e supervised and unsupervised.
So, let’s see what scikit-learn offers us in supervised algorithms.
The problem of supervised learning can be broken into 2 :
Classification: Samples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data. An example of classification problem would be the handwritten digit recognition example, in which the aim is to assign each input vector to one of a finite number of discrete categories. Another way to think of classification is as a discrete (as opposed to continuous) form of supervised learning where one has a limited number of categories and for each of the n samples provided, one is to try to label them with the correct category or class.
Regression: If the desired output consists of one or more continuous variables, then the task is called regression. An example of a regression problem would be the prediction of the yield in a chemical manufacturing process in which input consists of the concentration of reactants, temperature, and the pressure.
Scikit Learn supports following models :
- Generalized Linear Models
- Linear and Quadratic Discriminant Analysis
- Kernel ridge regression
- Support Vector Machines
- Stochastic Gradient Descent
- Nearest Neighbors
- Gaussian Processes
- Cross decomposition
- Naive Bayes
- Decision Trees
- Ensemble methods
- Multiclass and multilabel algorithms
- Feature selection
- Isotonic regression
- Probability calibration
- Neural network models (supervised)
Unsupervised Algorithms In Scikit-Learn
Now, let’s see what scikit-learn offers us in unsupervised algorithms.
In this, the training data consists of a set of input vectors x without any corresponding target values. The goal in such problems may be to discover groups of similar examples within the data, where it is called clustering, or to determine the distribution of data within the input space, known as density estimation, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization.
Scikit Learn supports these models :
- Gaussian mixture models
- Manifold learning
- Decomposing signals in components (matrix factorization problems)
- Covariance estimation
- Novelty and Outlier Detection
- Density Estimation
- Neural network models (unsupervised)
Model Selection and Evaluation
As we know learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test.
Model selection contains the following :
- Cross-validation: evaluating estimator performance
- Tuning the hyper-parameters of an estimator
- Model evaluation: quantifying the quality of predictions
- Model persistence
- Validation curves: plotting scores to evaluate models
Note: How it is implemented will be discussed later in this series.
These are represented by classes with a fit method, which learns model parameters (e.g. mean and standard deviation for normalization) from a training set, and a transform method which applies this transformation model to unseen data. fit_transform may be more convenient and efficient for modeling and transforming the training data simultaneously.
It has following sub-categories :
- Pipeline and FeatureUnion: combining estimators
- Feature extraction
- Preprocessing data
- Unsupervised dimensionality reduction
- Random Projection
- Kernel Approximation
- Pairwise metrics, Affinities, and Kernels
- Transforming the prediction target (y)
Dataset Loading Utilities
The sklearn.datasets package embeds some small toy datasets.
Scikit Learn offers various toy datasets some of them are :
- The Olivetti faces dataset
- The 20 newsgroups text dataset
- Downloading datasets from the mldata.org repository
- The Labeled Faces in the Wild face recognition dataset
- Forest cover types
- RCV1 dataset
- Boston House Prices dataset
- Breast Cancer Wisconsin (Diagnostic) Database
- Diabetes dataset
- Optical Recognition of Handwritten Digits Data Set
- Iris Plants Database
and many more…..
Strategies to scale computationally: bigger data
For some applications, the number of examples, features and/or the speed at which they need to be processed are challenging for traditional approaches. In these cases, scikit-learn has a number of options you can consider to make your system scale.
For some applications the performance (mainly latency and throughput at prediction time) of estimators is crucial.
So, scikit-learn offers various features that make it easy for us to complete our task with great score.
It gives us the chance to use the following functions in it :
- Prediction Latency
- Prediction Throughput
More coming soon….
As we have covered almost every feature that scikit-learn offers us and by the end of this, you must have understood the importance of this wonderful & easy yet powerful to use library. Its easy to get intimidated by seeing so much at one glance but dont worry you will get to learn in the most easiest way so stick with us in this journey.
Get ready to dive deep into implementations of these features here.
Loves to learn new technologies and this attitude keeps me going.
Strong foundation in data structures & algorithms.