# SOFTMAX REGRESSION

In this article, we will learn a relatively uncommon machine learning regression algorithm called Softmax Regression.

Softmax regression can be seen as an extension of logistic regression, hence it also comes under the category of ‘classification algorithms’.

In a logistic regression model, the outcome or ‘y’ can take on binary values 0 or 1. However in softmax regression, the outcome ‘y’ can take on multiple values.

Understanding how softmax regression actually works involves a fair bit of Mathematics. It involves concepts like partial differentiation, maximum likelihood function, gradient descent and matrix multiplication. We will begin by understanding how parameters of this regression are found, and how the hypothesis function h(theta) is found to make predictions for new values of the attributes. In the latter part, we will translate our understanding into code and implement it on the famous ‘iris’ dataset for classifying flowers into one of three categories.

Interestingly, the sklearn module in Python does not provide any class for softmax regression, unlike it does for linear and logistic regression. Let this be a reminder to you to not rely on libraries too much for implementing your machine learning algorithms. Don’t be afraid to write code for your algorithm on your own. Let us begin.

### Generalised Linear Models

The exponential family of distributions is given by

Here, η is called the natural parameter

of the distribution; T(y) is the sufficient statistic and a(η) is the log

partition function.

The Bernoulli and Gaussian distributions are examples of this distribution. A fixed choice of T, a and b defines a family (or set) of distributions that

is parameterized by η; as we vary η, we then get different distributions within

this family

### Constructing Softmax regression as GLM

This implies T(y) is a k-1 dimensional vector. We will write **(T(y)****)*** i* to denote the i-th element of the vector T(y).

So, we can also write the relationship between T(y) and y as

This only means that the ith element of the vector T(y) is 1 only if y is equal to i.

We compare this with the general formula of an exponential distribution to get

So we get the link function between η and φ.

where i goes from (1,2….k)

Taking exponential of this equation on and summing from 1 to k on both sides we get,

We can substitute the value of this into the previous equation to get

This function is called the **softmax function.**

All this might seem overwhelming at first, but stay with me here. Work all this out with a pen and paper and you will see that this is nothing but basic algebra!

### Finding parameters of our model and building hypothesis function

Since this is a generalised linear model, it is correct to assume that that the ηi’s are linearly related to the x’s. So, we have

for i = (1, . . . , k) and where x is a vector of input attributes or features

Hence, our model assumes that the conditional distribution of y

given x is given by

This model, which applies to classification problems where y ∈ {1, . . . , k}, is

called softmax regression.

θ is represented as a k-by-(n + 1) matrix obtained by stacking up the θ in rows. Here n is the no. of attributes or features in x and a column of 1’s is added to represent the intercept of a line.

For parameter fitting or finding θ’s we can minimise the softmax cost function J(θ) of softmax regression just like we used to do in linear and logistic regression.

This can be written as

Now, we can use an iterative method such as gradient descent to minimize this cost function and obtain our parameters.

The gradient of a the cost function is given by taking its derivative. This function uses all the training examples (m is the number of examples in the dataset) where

is the ith example.

where α is the learning parameter and j goes from (1,2……..k). Also remember that, θj is a vector with (n+1) values,

Now for finding the hypothesis function of our model or the function that predicts the output given input features, we will find E[T(y)|x; θ] which means the expectation of T(y) given x parameterized by θ.

Hypothesis is a k X 1 vector which shows the probability of obtaining every outcome yi given x, for every value of i = (1,2,. . . , k) . Thus by finding the maximum probability from the hypothesis function and obtaining its corresponding yi we can predict the outcome for a given set of x’s.

## Implementing softmax regression on our dataset

That’s a lot of mathematics. Let’s implement our algorithm now.

```
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
import math
import warnings
warnings.filterwarnings('ignore')
```

First, let’s write a function for calculating φi’s. It takes in the parameter matrix θ, and a vector of features x.

```
def phi(i,theta,x): #i goes from 1 to k
mat_theta = np.matrix(theta[i])
mat_x = np.matrix(x)
num = math.exp(np.dot(mat_theta,mat_x.T))
den = 0
for j in range(0,k):
mat_theta_j = np.matrix(theta[j])
den = den + math.exp(np.dot(mat_theta_j,mat_x.T))
phi_i = num/den
return phi_i
```

Next, we implement the indictor function.

```
def indicator(a,b):
if a == b: return 1
else: return 0
```

While finding our paramaters, we use the gradient descent function. We calculate the derivative of the gradient in each iteration. So we implement the two functions, one for calculating the derivative of gradient and the other for running the gradient descent method. We give the default values for the learning parameter alpha and the number of iterations. Here m is the noumber of training examples in our dataset.

```
def get__der_grad(j,theta):
sum = np.array([0 for i in range(0,n+1)])
for i in range(0,m):
p = indicator(y[i],j) - phi(j,theta,x.loc[i])
sum = sum + (x.loc[i] *p)
grad = -sum/m
return grad
```

```
def gradient_descent(theta,alpha= 1/(10^4),iters=500):
for j in range(0,k):
for iter in range(iters):
theta[j] = theta[j] - alpha * get__der_grad(j,theta)
print('running iterations')
return theta
```

Finally, we define our hypothesis function which will give us the probability of obtaining each outcome given a vector of features x.

```
def h_theta(x):
x = np.matrix(x)
h_matrix = np.empty((k,1))
den = 0
for j in range(0,k):
den = den + math.exp(np.dot(thetadash[j], x.T))
for i in range(0,k):
h_matrix[i] = math.exp(np.dot(thetadash[i],x.T))
h_matrix = h_matrix/den
return h_matrix
```

As all the functions are now defined,it is time to read in the dataset.

```
iris = pd.read_csv('iris.data.txt',header=None)
iris.columns = ['sepal_length','sepal_width','petal_length','petal_width','class']
```

As usual, we divide the dataset into training and test sets.

```
train, test = train_test_split(iris, test_size = 0.3)# in this our main data is split into train and test
train = train.reset_index()
test = test.reset_index()
```

We assign the features part of our training dataset to x. Then we define n as the number of attributes or features and m as the number of training examples.

```
x = train[['sepal_length','sepal_width','petal_length','petal_width']]
n = x.shape[1]
m = x.shape[0]
```

The outcome or ‘class’ part in our training data is assigned to y and the different classes of flowers are replaced with numeric values 0,1 and 2. k is the number of outcomes possible for y.

```
y = train['class']
k = len(y.unique())
y =y.map({'Iris-setosa':0,'Iris-versicolor':1,'Iris-virginica':2})
y.value_counts()
```

We add a column to x which has only 1’s in it to signify the intercept part of the lines that would be fitted for our model

```
x[5] = np.ones(x.shape[0])
x.shape
```

We iniitalise theta as an empty (k X n+1) matrix

```
theta = np.empty((k,n+1))
```

All is done, and we can start training our algorithm. To find parameters, theta we apply the function gradient descent on our empty matrix theta and assign the obtained parameter matrix to theta_dash.

```
theta_dash = gradient_descent(theta)
```

We have finally obtained the parameter’s for our model

```
theta_dash
```

Now, let’s apply our model onto our test set and get our accuracy.

```
x_u = test[['sepal_length','sepal_width','petal_length','petal_width']]
n = x_u.shape[1]
m = x_u.shape[0]
```

y_true are the true outcomes in out test set. Let us modify them like we did with the training set and assign numeric values to it.

```
y_true = test['class']
k = len(y_true.unique())
y_true =y_true.map({'Iris-setosa':0,'Iris-versicolor':1,'Iris-virginica':2})
y_true.value_counts()
```

Again, like we did while training we add a column of one’s to x_u.

```
x_u[5] = np.ones(x_u.shape[0])
x_u.shape
```

We apply the algorithm to every row in our test set or x_u using our hypothesis function and find the maximum value of the obtained probabilities and its correspning index in the obtained probability matrix. The index of the maximum value is our predicted value for that particular row in our test set.

```
for index,row in x_u.iterrows():
h_matrix = h_theta(row)
prediction = int(np.where(h_matrix == h_matrix.max())[0])
x_u.loc[index,'prediction'] = prediction
```

We can now assign our predicted values, actual values and the features in our test set to a dataframe called results.

```
results = x_u
results['actual'] = y_true
```

```
results.head(10)
```

In the next few lines, we find the accuracy and pray it is good.

```
compare = results['prediction'] == results['actual']
correct = compare.value_counts()[1]
accuracy = correct/len(results)
```

```
accuracy * 100
```

Great, our accuracy looks amazing.

I hope it felt good to learn and implement algorithm which is not talked about a lot. I’ll be happy to answer any doubts you may have in the comments below. Happy Learning.

### Saksham Malhotra

Linkedin - www.linkedin.com/in/saksham-malhotra-9bb69513b

#### Latest posts by Saksham Malhotra (see all)

- ML Algos from Scratch – KMeans Clustering - November 22, 2017
- Compressing Images using K-Means Clustering - November 20, 2017
- Introduction to SoftMax Regression (with codes in Python) - September 24, 2017