# SOFTMAX REGRESSION

In this article, we will learn a relatively uncommon machine learning regression algorithm called Softmax Regression.

Softmax regression can be seen as an extension of logistic regression, hence it also comes under the category of ‘classification algorithms’.

In a logistic regression model, the outcome or ‘y’ can take on binary values 0 or 1. However in softmax regression, the outcome ‘y’ can take on multiple values.

Understanding how softmax regression actually works involves a fair bit of Mathematics. It involves concepts like partial differentiation, maximum likelihood function, gradient descent and matrix multiplication. We will begin by understanding how parameters of this regression are found, and how the hypothesis function h(theta) is found to make predictions for new values of the attributes. In the latter part, we will translate our understanding into code and implement it on the famous ‘iris’ dataset for classifying flowers into one of three categories.

Interestingly, the sklearn module in Python does not provide any class for softmax regression, unlike it does for linear and logistic regression. Let this be a reminder to you to not rely on libraries too much for implementing your machine learning algorithms. Don’t be afraid to write code for your algorithm on your own. Let us begin.

### Generalised Linear Models

Before getting into our model, let’s talk about generalised linear models. Generalised linear models(GLMs) are a larger class of models which follow exponential distribution of othe output variable ‘y’ with respect to a set of input features ‘x’. Linear regression, logistic regression and softmax regression all fall into this category of models and can be derived from a general linear model.

The exponential family of distributions is given by

Here, η is called the natural parameter
of the distribution; T(y) is the sufficient statistic and a(η) is the log
partition function.

The Bernoulli and Gaussian distributions are examples of this distribution. A fixed choice of T, a and b defines a family (or set) of distributions that
is parameterized by η; as we vary η, we then get different distributions within
this family

### Constructing Softmax regression as GLM

In our multinomial case, the variable y can take on any value from 1 to k so y ∈ {1, 2, . . . , k}. To parameterize our model we can use k parameters {φ1,φ2,. . . , φk} specifying the probability of each of the outcomes. It should be noted that since these are probabilities of k different outcomes

To express the multinomial as an exponential family distribution, we will
define

as follows

This implies T(y) is a k-1 dimensional vector. We will write (T(y))i to denote the i-th element of the vector T(y).

Before we continue, let us get ourselves accustomed with a function called the indicator function given by 1{.}. It gives output 1 if the statement inside the parenthesis is true and 0 otherwise. For instance, 1{True} =1 ,1{2 = 3} = 0 and 1{3 = 5 − 2} = 1.
So, we can also write the relationship between T(y) and y as

This only means that the ith element of the vector T(y) is 1 only if y is equal to i.

Therefore, we can write the probability distribution P(y;φ) as the product of probabilities of each of the outcomes, where each probability is raised to the power 1 if that outcome is occurs and 0 if that outcome does not occur.

We compare this with the general formula of an exponential distribution to get

So we get the link function between η and φ.

where i goes from (1,2….k)

Taking exponential of this equation on and summing from 1 to k on both sides we get,

We can substitute the value of this into the previous equation to get

This function is called the softmax function.

All this might seem overwhelming at first, but stay with me here. Work all this out with a pen and paper and you will see that this is nothing but basic algebra!

### Finding parameters of our model and building hypothesis function

Since this is a generalised linear model, it is correct to assume that that the ηi’s are linearly related to the x’s. So, we have

for i = (1, . . . , k) and where x is a vector of input attributes or features

Hence, our model assumes that the conditional distribution of y
given x is given by

This model, which applies to classification problems where y ∈ {1, . . . , k}, is
called softmax regression.

θ is represented as a k-by-(n + 1) matrix obtained by stacking up the θ in rows. Here n is the no. of attributes or features in x and a column of 1’s is added to represent the intercept of a line.

For parameter fitting or finding θ’s we can minimise the softmax cost function J(θ) of softmax regression just like we used to do in linear and logistic regression.

This can be written as

Now, we can use an iterative method such as gradient descent to minimize this cost function and obtain our parameters.

The gradient of a the cost function is given by taking its derivative. This function uses all the training examples (m is the number of examples in the dataset) where

is the ith example.

Thus parameters are given by,

where α is the learning parameter and j goes from (1,2……..k). Also remember that, θj is a vector with (n+1) values,

Now for finding the hypothesis function of our model or the function that predicts the output given input features, we will find E[T(y)|x; θ] which means the expectation of T(y) given x parameterized by θ.

Hypothesis is a k X 1 vector which shows the probability of obtaining every outcome yi given x, for every value of i = (1,2,. . . , k) . Thus by finding the maximum probability from the hypothesis function and obtaining its corresponding yi we can predict the outcome for a given set of x’s.

## Implementing softmax regression on our dataset

That’s a lot of mathematics. Let’s implement our algorithm now.

In [1]:
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
import math
import warnings
warnings.filterwarnings('ignore')


First, let’s write a function for calculating φi’s. It takes in the parameter matrix θ, and a vector of features x.

In [2]:
def phi(i,theta,x):  #i goes from 1 to k
mat_theta = np.matrix(theta[i])
mat_x = np.matrix(x)
num = math.exp(np.dot(mat_theta,mat_x.T))
den = 0
for j in range(0,k):
mat_theta_j = np.matrix(theta[j])
den = den + math.exp(np.dot(mat_theta_j,mat_x.T))
phi_i = num/den
return phi_i


Next, we implement the indictor function.

In [3]:
def indicator(a,b):
if a == b: return 1
else: return 0


While finding our paramaters, we use the gradient descent function. We calculate the derivative of the gradient in each iteration. So we implement the two functions, one for calculating the derivative of gradient and the other for running the gradient descent method. We give the default values for the learning parameter alpha and the number of iterations. Here m is the noumber of training examples in our dataset.

In [4]:
def get__der_grad(j,theta):
sum = np.array([0 for i in range(0,n+1)])
for i in range(0,m):
p = indicator(y[i],j) - phi(j,theta,x.loc[i])
sum = sum + (x.loc[i] *p)

In [5]:
def gradient_descent(theta,alpha= 1/(10^4),iters=500):
for j in range(0,k):
for iter in range(iters):
theta[j] = theta[j] - alpha * get__der_grad(j,theta)
print('running iterations')
return theta


Finally, we define our hypothesis function which will give us the probability of obtaining each outcome given a vector of features x.

In [6]:
def h_theta(x):
x = np.matrix(x)
h_matrix = np.empty((k,1))
den = 0
for j in range(0,k):
den = den + math.exp(np.dot(thetadash[j], x.T))
for i in range(0,k):
h_matrix = h_matrix/den
return h_matrix


As all the functions are now defined,it is time to read in the dataset.

In [7]:
iris = pd.read_csv('iris.data.txt',header=None)
iris.columns = ['sepal_length','sepal_width','petal_length','petal_width','class']


As usual, we divide the dataset into training and test sets.

In [8]:
train, test = train_test_split(iris, test_size = 0.3)# in this our main data is split into train and test
train = train.reset_index()
test = test.reset_index()


We assign the features part of our training dataset to x. Then we define n as the number of attributes or features and m as the number of training examples.

In [9]:
x = train[['sepal_length','sepal_width','petal_length','petal_width']]
n = x.shape[1]
m = x.shape[0]


The outcome or ‘class’ part in our training data is assigned to y and the different classes of flowers are replaced with numeric values 0,1 and 2. k is the number of outcomes possible for y.

In [10]:
y = train['class']
k = len(y.unique())
y =y.map({'Iris-setosa':0,'Iris-versicolor':1,'Iris-virginica':2})
y.value_counts()

Out[10]:
0    38
2    35
1    32
Name: class, dtype: int64

We add a column to x which has only 1’s in it to signify the intercept part of the lines that would be fitted for our model

In [11]:
x[5] = np.ones(x.shape[0])
x.shape

Out[11]:
(105, 5)

We iniitalise theta as an empty (k X n+1) matrix

In [12]:
theta = np.empty((k,n+1))


All is done, and we can start training our algorithm. To find parameters, theta we apply the function gradient descent on our empty matrix theta and assign the obtained parameter matrix to theta_dash.

In [13]:
theta_dash = gradient_descent(theta)

running iterations


We have finally obtained the parameter’s for our model

In [14]:
theta_dash

Out[14]:
array([[ 0.54750154,  1.46069551, -2.24366996, -1.0321951 ,  0.32658186],
[ 0.76749424, -0.27807236, -0.57695025, -1.08978552,  0.30959322],
[-0.90090227, -0.79051953,  1.31002273,  1.09595382, -0.45057825]])

Now, let’s apply our model onto our test set and get our accuracy.

In [15]:
x_u = test[['sepal_length','sepal_width','petal_length','petal_width']]
n = x_u.shape[1]
m = x_u.shape[0]


y_true are the true outcomes in out test set. Let us modify them like we did with the training set and assign numeric values to it.

In [16]:
y_true = test['class']
k = len(y_true.unique())
y_true =y_true.map({'Iris-setosa':0,'Iris-versicolor':1,'Iris-virginica':2})
y_true.value_counts()

Out[16]:
1    18
2    15
0    12
Name: class, dtype: int64

Again, like we did while training we add a column of one’s to x_u.

In [17]:
x_u[5] = np.ones(x_u.shape[0])
x_u.shape

Out[17]:
(45, 5)

We apply the algorithm to every row in our test set or x_u using our hypothesis function and find the maximum value of the obtained probabilities and its correspning index in the obtained probability matrix. The index of the maximum value is our predicted value for that particular row in our test set.

In [18]:
for index,row in x_u.iterrows():
h_matrix = h_theta(row)
prediction = int(np.where(h_matrix == h_matrix.max())[0])
x_u.loc[index,'prediction'] = prediction


We can now assign our predicted values, actual values and the features in our test set to a dataframe called results.

In [19]:
results = x_u
results['actual'] = y_true

In [20]:
results.head(10)

Out[20]:
sepal_length sepal_width petal_length petal_width 5 prediction actual
0 5.7 2.8 4.1 1.3 1.0 1.0 1
1 6.3 2.8 5.1 1.5 1.0 2.0 2
2 6.5 3.0 5.8 2.2 1.0 2.0 2
3 6.0 2.7 5.1 1.6 1.0 2.0 1
4 5.8 2.7 3.9 1.2 1.0 1.0 1
5 4.7 3.2 1.6 0.2 1.0 0.0 0
6 6.1 2.9 4.7 1.4 1.0 1.0 1
7 4.6 3.6 1.0 0.2 1.0 0.0 0
8 6.4 3.2 4.5 1.5 1.0 1.0 1
9 6.1 2.6 5.6 1.4 1.0 2.0 2

In the next few lines, we find the accuracy and pray it is good.

In [21]:
compare = results['prediction'] == results['actual']
correct = compare.value_counts()[1]
accuracy = correct/len(results)

In [22]:
accuracy * 100

Out[22]:
95.555555555555557

Great, our accuracy looks amazing.

I hope it felt good to learn and implement algorithm which is not talked about a lot. I’ll be happy to answer any doubts you may have in the comments below. Happy Learning.

### Saksham Malhotra

After learning python 2 years ago and dabbling in web development, I encountered data science and felt 'Yes, this is what I want to do.' I strongly believe that the world can be changed through the power of data. Among my other interests, I love reading fiction and owning more books than I can possibly read at once.