Training a machine and visualizing the results are real fun to do, so is learning algorithms that train and test the data!
This article explains one of the most interesting Supervised Machine Learning Algorithm- SVM.
Support Vector Machines is a classification Algorithm and considered to be quiet distinct amongst all the classification algorithms (reasons will be discussed). SVM are broadly used for classification but also contributes in Regression Challenges. It is considered to be the most efficient algorithm in distinguishing/separating the classes.
The basic question that arises before implementing SVM is, why do we need to classify the data? Let us understand it through an example.
Consider two classes, class I and class II that has distinct observations on 2-D plane respectively.
On introducing new observation, we need to figure out in which class the new observation lies. For identifying in which class the new observation lie based on the previous observation’s parameters, we need classification Algorithms to separate one class from other.
For classifying or separating, a decision boundary line is required to distinguish the similar points of a class from similar points of other class in the BEST possible way so that when a new observation is recorded, the machine guides it into correct class.
Such ‘n’ number of decision boundary lines can be drawn for classifying the points but each line will have their own future consequence, so selection of OPTIMAL boundary is required. Hence, comes the need of SVM for classifying the data.
Now the question pings how SVM searches for the optimal boundary line for classifying data.
This line is selected through MAXIMUM MARGIN. Maximum Margin is the line which is equidistant to the support vectors of each class. (The sum of the distance from support vectors is to be maximized hence the term Maximum Margin is given.) Support Vectors are the most extreme points of a class that are slightly different from the common type (For n-dimension plane, generalized term ‘vector’ is used). Even if all the points of any class are removed (except support vectors), the result will be unaffected. This is how Support vectors are actually supporting the whole algorithm in a way.
SVM is distinct from other classification algorithm because extreme points (support vectors) are considered for training the data and not the most common type. This is how Support Vector Machines works in training and testing the data.
Now let us try implementing SVM in R.
Dataset- social network advertisements
This dataset contains information of users of a social network. The social network has several business clients. These clients put advertisements on social media for marketing campaign purposes. One of the clients is a car company who has just launched a brand new model at ridiculous price.
This company gathered information of users who responded to the advertisements positively by buying the product or negatively by not buying the product.
There are 400 observations and 5 attributes- User ID, Gender, Age, Estimated Salary and Purchased. Here we need to classify the response of users, whether they purchased the model after seeing the advertisement or not.
Dataset as well as the template code has been shared for references.
(Do not forget to set the working directory)
Data preprocessing include importing, reading, viewing, cleaning and splitting the data before applying any technique.
# reading the dataset dataset = read.csv('Social_Network_Ads.csv') #excluding the redundant attributes dataset = dataset[3:5] view(dataset)
After acknowledging the data, the next step is to split the whole dataset into training and test set. In training set, we are training the machine and in test set, the machine predicts the result. For any dataset, the SplitRatio is 0.75 that is number of observations for training set is nearly ¾ of the total data. More the number of observations, better the prediction done by machine for test set.
# Splitting the dataset into the Training set and Test set install.packages('caTools') library(caTools) set.seed(123) split = sample.split(dataset$Purchased, SplitRatio = 0.75) training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling training_set[-3] = scale(training_set[-3]) test_set[-3] = scale(test_set[-3])
Feature scaling basically means putting the data in same range. Scaling the data is necessary because one variable can dominate the other variable and result in negligence of other variable. Here, EstimatedSalary without feature scaling will dominate the other variables i.e. Age.
# Fitting SVM classifier to the Training set install.packages('e1071') library(e1071) classifier = svm(formula = Purchased ~ ., data = training_set, type = 'C-classification', kernel = 'linear')
# Predicting the Test set results y_pred = predict(classifier, newdata = test_set[-3])
# Visualising the Training set results library(ElemStatLearn) set = training_set X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01) X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01) grid_set = expand.grid(X1, X2) colnames(grid_set) = c('Age', 'EstimatedSalary') y_grid = predict(classifier, newdata = grid_set) plot(set[, -3], main = 'SVM (Training set)', xlab = 'Age', ylab = 'Estimated Salary', xlim = range(X1), ylim = range(X2)) contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE) points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato')) points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
# Visualising the Test set results library(ElemStatLearn) set = test_set X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01) X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01) grid_set = expand.grid(X1, X2) colnames(grid_set) = c('Age', 'EstimatedSalary') y_grid = predict(classifier, newdata = grid_set) plot(set[, -3], main = 'SVM (Test set)', xlab = 'Age', ylab = 'Estimated Salary', xlim = range(X1), ylim = range(X2)) contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE) points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato')) points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
Thus, we learned the inner workings of SVM and visualized the results using codes in R. Feel free to comment out your doubts. Happy learning 🙂
Visualising data enchants me, so does reading books.
Articulate communicator, Data Science Enthusiast and Future Data Scientist.
LinkedIn - linkedin.com/in/vaishali-sagar-a07831142/