I hope you’re having a good time learning about machine learning. Welcome to part 2 of ML algorithm series where I’ll be discussing one of the most famous and conceptual algorithm known as the Decision Tree Algorithm.
Decision Tree is a fairly basic term. Imagine a tree which has a lot of branches just like the one in the figure. You are sitting on the top and wish to come down. The only catch is that if you reach a branch having no ends, you have to jump down from there. As you progress down you decide on which branch to choose at each level by taking the help of some deciding factors that you decide upon. This is the fundamental of a decision tree.
Every branch has a subsequent YES branch and a NO branch and you have to choose which one to jump on. Let’s call all the branches as NODES and the branches that have no other sub branches as LEAF NODES.
LEAF nodes represent labels i.e. the final result derived from our traversal of the tree.
A decision tree works on how you place the nodes in the tree, which node is given a higher priority and what all nodes satisfy the conditions required. There are 2 popular methods of deciding priority for node named Information Gain and the Gini index which will be discussed in a special article with all the mathematical expressions for the same.
Here’s an example for buying a car. Here we select a few features like Road Tested, Mileage etc. to determine whether the car is good for buying or not and assign each of them a priority.
This is the basic logic behind Decision Tree. This algorithm is very useful as it is the basic foundation of all our decision making and also helps in generating new algorithms like Random Forest Classifier by following the same fundamental rules.
Now, let us try to implement this algorithm on a given dataset. Please note that implementation does not require the mathematical aspect of this algorithm.
Lets import all the neccesary modules in our iPython Notebook file!
import numpy as np import pandas as pd from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import NearestNeighbors from sklearn.metrics import accuracy_score from sklearn import tree
Now, we import out dataset into the notebook. The dataset we are using is the Balance Scale dataset which can is available at
https://archive.ics.uci.edu/ml/machine-learning-databases/balance-scale/balance-scale.data (just download the balance-scale.data file from the Data Folder and rename it to DecisionTree.data.txt).
NOTE-Open the DecisionTree.data.txt in a text editor and add the line “Label,A,B,C,D” (without the quotes) at the beginning of the file. This will work as the headlines for the different columns!
Each example is classified as having the balance scale tip to the right, tip to the left, or be balanced. The attributes are the left weight(A), the left distance(B), the right weight(C), and the right distance(D).
After importing,we separate out the features and the labels of our data.We store the features in a numpy array(balance_data.values gives us the data in a numpy array form) and name it X and the labels are stored in another numpy array labelled y.
X=balance_data.values[:,1:5] #This is known as numpy array splitting! y=balance_data.values[:,0]
We use the method of Cross Validation under the sklearn module to split our data into two parts i.e. the training part on which our model will be trained and the testing part where our model would be checked for accuracy.
We construct a classifer that uses Decision Tree algorithm which is already provided to us in the sklearn module. The ‘fit’ instance trains the classifier with the training data and ‘score’ instance calcualtes the accuracy of our model.
We use the criterion as ‘entropy’ to use the information gain method. Gini index is the default method and you can skip the criterion part inside the brackets if you want to use gini index
clf=DecisionTreeClassifier(criterion='entropy') clf.fit(X_train, y_train)
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best')
We construct a new example that contains data of a balance scale that is likely to tip towards “Right” and use the ‘predict’ instance to analyse and find the result.
example=np.array([[1,4,3,2]]) prediction=clf.predict(example) print(prediction)
Happy Learning 🙂