1

Scikit Learn – Part 1 – Introduction

Share this article!

Welcome To The World Of Scikit Learn!

This is part one of the Scikit-learn series, which is as follows:

Introduction

New to machine learning? Don’t know how to get started with this amazing library? Then hang on as you are about to get started with this free library that will help you boost your skills.
Before moving on to the different features it offers, let us understand what actually is scikit-learn!
So, scikit-learn is a machine learning library for Python programming language which offers various important features for machine learning such as classification, regression, and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the python numerical and scientific libraries like NumPy and SciPy.
We will discuss each algorithm and its implementation with codes in detail later in the second part of this series.

Supervised Algorithms In Scikit-Learn

Since you are familiar with machine learning you already know that there are 2 types of algorithms i.e supervised and unsupervised.
So, let’s see what scikit-learn offers us in supervised algorithms.
The problem of supervised learning can be broken into 2 :

Classification: Samples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data. An example of classification problem would be the handwritten digit recognition example, in which the aim is to assign each input vector to one of a finite number of discrete categories. Another way to think of classification is as a discrete (as opposed to continuous) form of supervised learning where one has a limited number of categories and for each of the n samples provided, one is to try to label them with the correct category or class.

Regression: If the desired output consists of one or more continuous variables, then the task is called regression. An example of a regression problem would be the prediction of the yield in a chemical manufacturing process in which input consists of the concentration of reactants, temperature, and the pressure.

Scikit Learn supports following models :

  1. Generalized Linear Models
  2. Linear and Quadratic Discriminant Analysis
  3. Kernel ridge regression
  4. Support Vector Machines
  5. Stochastic Gradient Descent
  6. Nearest Neighbors
  7. Gaussian Processes
  8. Cross decomposition
  9. Naive Bayes
  10. Decision Trees
  11. Ensemble methods
  12. Multiclass and multilabel algorithms
  13. Feature selection
  14. Semi-Supervised
  15. Isotonic regression
  16. Probability calibration
  17. Neural network models (supervised)

Unsupervised Algorithms In Scikit-Learn

Now, let’s see what scikit-learn offers us in unsupervised algorithms.

In this, the training data consists of a set of input vectors x without any corresponding target values. The goal in such problems may be to discover groups of similar examples within the data, where it is called clustering, or to determine the distribution of data within the input space, known as density estimation, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization.

Scikit Learn supports these models :

  1. Gaussian mixture models
  2. Manifold learning
  3. Clustering
  4. Biclustering
  5. Decomposing signals in components (matrix factorization problems)
  6. Covariance estimation
  7. Novelty and Outlier Detection
  8. Density Estimation
  9. Neural network models (unsupervised)

Model Selection and Evaluation

As we know learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test.

Model selection contains the following :

  1. Cross-validation: evaluating estimator performance
  2. Tuning the hyper-parameters of an estimator
  3. Model evaluation: quantifying the quality of predictions
  4. Model persistence
  5. Validation curves: plotting scores to evaluate models

Note: How it is implemented will be discussed later in this series.

Dataset transformations

These are represented by classes with a fit method, which learns model parameters (e.g. mean and standard deviation for normalization) from a training set, and a transform method which applies this transformation model to unseen data. fit_transform may be more convenient and efficient for modeling and transforming the training data simultaneously.

It has following sub-categories :

  1. Pipeline and FeatureUnion: combining estimators
  2. Feature extraction
  3. Preprocessing data
  4. Unsupervised dimensionality reduction
  5. Random Projection
  6. Kernel Approximation
  7. Pairwise metrics, Affinities, and Kernels
  8. Transforming the prediction target (y)

Dataset Loading Utilities

The sklearn.datasets package embeds some small toy datasets.

Scikit Learn offers various toy datasets some of them are :

  1. The Olivetti faces dataset
  2. The 20 newsgroups text dataset
  3. Downloading datasets from the mldata.org repository
  4. The Labeled Faces in the Wild face recognition dataset
  5. Forest cover types
  6. RCV1 dataset
  7. Boston House Prices dataset
  8. Breast Cancer Wisconsin (Diagnostic) Database
  9. Diabetes dataset
  10. Optical Recognition of Handwritten Digits Data Set
  11. Iris Plants Database

and many more…..

Strategies to scale computationally: bigger data

For some applications, the number of examples, features and/or the speed at which they need to be processed are challenging for traditional approaches. In these cases, scikit-learn has a number of options you can consider to make your system scale.

Computational Performance

For some applications the performance (mainly latency and throughput at prediction time) of estimators is crucial.
So, scikit-learn offers various features that make it easy for us to complete our task with great score.
It gives us the chance to use the following functions in it :

  1. Prediction Latency
  2. Prediction Throughput

More coming soon….

As we have covered almost every feature that scikit-learn offers us and by the end of this, you must have understood the importance of this wonderful & easy yet powerful to use library. Its easy to get intimidated by seeing so much at one glance but dont worry you will get to learn in the most easiest way so stick with us in this journey.
Get ready to dive deep into implementations of these features here.

 

Share this article!

Deepanshu Gaur

Deepanshu Gaur

A technology lover and computer hardware enthusiast. If Gaming is my love then Machine learning is my passion.
Loves to learn new technologies and this attitude keeps me going.
Fast learner.
Strong foundation in data structures & algorithms.

https://www.linkedin.com/in/deepanshu-g-37a42899
Deepanshu Gaur

Deepanshu Gaur

A technology lover and computer hardware enthusiast. If Gaming is my love then Machine learning is my passion. Loves to learn new technologies and this attitude keeps me going. Fast learner. Strong foundation in data structures & algorithms. https://www.linkedin.com/in/deepanshu-g-37a42899

One Comment

Leave a Reply

Your email address will not be published. Required fields are marked *