3

Introduction to Feature Engineering

Share this article!

Introduction to Feature Engineering with Stock Prices Prediction

This tutorial is meant to introduce to you what we call “Feature Engineering”.

You probably know that “features” are the attributes of the data. They typically correspond to the independent columns of the data set.
Feature Engineering is the process of deriving more knowledge from the already known data.
It consists of 2 steps:

  1. Feature Transformation
  2. Feature Creation

In Feature transformation, the existing features are transformed in such ways so that they are more meaningful for the machine learning algorithm.
In Feature creation, we develop new features from the existing ones with the aim of achieving higher accuracy.
In this tutorial, we will be focusing on Feature Creation.
We will first solve the problem without feature engineering, and then include it and see what difference it makes.

We will try to predict the next month stock prices of Tesla Motors from 5 years of data.

You can download the data from here

Let’s begin by importing the basic libraries.

In [1]:
import pandas as pd
import numpy as np

We will read the .csv file into a pandas DataFrame using read_csv() function.
Please note that you may be required to pass in the complete path of the file.
To check the success, we will print out the head of the Dataframe.

In [2]:
df = pd.read_csv("Tesla_stocks.csv")
df.head()
Out[2]:
Date Open High Low Close Volume Adj Close
0 6/29/2010 19.000000 25.00 17.540001 23.889999 18766300 23.889999
1 6/30/2010 25.790001 30.42 23.299999 23.830000 17187100 23.830000
2 7/1/2010 25.000000 25.92 20.270000 21.959999 8218800 21.959999
3 7/2/2010 23.000000 23.10 18.709999 19.200001 5139800 19.200001
4 7/6/2010 20.000000 20.00 15.830000 16.110001 6866900 16.110001

In every Data science problem, it is the elementary need to thoroughly understand each columns. We have got six columns here corresponding to the Date, Opening price on that day, Highest price, Lowest price, Closing price, Volume of the stocks traded that day, and the adjusted Closing price.

Let us define our “Label” column, which is the outcome or the prediction column. Our label will be the Closing Price of the stock after 30 days.

This can be done by this code:

In [3]:
df["PriceNextMonth"] = df["Adj Close"].shift(-30)

We have defined a new column “PriceNextMonth” and assigned it to the value of “Adj Close” shifted thirty places upwards. This will correspond to the “Adj Close” value thirty days later, just wat we want.

Let’s check our DataFrame.

In [4]:
df.head()
Out[4]:
Date Open High Low Close Volume Adj Close PriceNextMonth
0 6/29/2010 19.000000 25.00 17.540001 23.889999 18766300 23.889999 17.900000
1 6/30/2010 25.790001 30.42 23.299999 23.830000 17187100 23.830000 17.600000
2 7/1/2010 25.000000 25.92 20.270000 21.959999 8218800 21.959999 18.320000
3 7/2/2010 23.000000 23.10 18.709999 19.200001 5139800 19.200001 18.780001
4 7/6/2010 20.000000 20.00 15.830000 16.110001 6866900 16.110001 19.150000

There we see our Label column in place.
However, notice that as we have shifted the values thirty places upward, there will be no values in the label column for the last thirty rows. This can be checked using the tail() function, which outputs the last five rows.

In [5]:
df.tail()
Out[5]:
Date Open High Low Close Volume Adj Close PriceNextMonth
1687 3/13/2017 244.820007 246.850006 242.779999 246.169998 3010700 246.169998 NaN
1688 3/14/2017 246.110001 258.119995 246.020004 258.000000 7575500 258.000000 NaN
1689 3/15/2017 257.000000 261.000000 254.270004 255.729996 4816600 255.729996 NaN
1690 3/16/2017 262.399994 265.750000 259.059998 262.049988 7100400 262.049988 NaN
1691 3/17/2017 264.000000 265.329987 261.200012 261.500000 6475900 261.500000 NaN

Just as we thought!
In case you are wondering, NaN refers to “Not a Number”.

So now, we will divide our DataFrame into two: One for which we have the labels, and other for which we will predict the labels.
Let’s start by making separate arrays for the Date column.

In [6]:
dates = np.array(df["Date"])
dates_check = dates[-30:]
dates = dates[:-30]

We have used slicing technique here. You can see that the array “dates” contains all those dates for which we have theSo now, we will divide our DataFrame into two: One for which we have the labels, and other for which we will predict the labels.
Let’s start by making separate arrays for the Date column. value for label, and dates_check contains the last thirty dates for which label will be predicted.

Now, we will make the Feature and Label arrays.

In [7]:
X = np.array(df.drop(["PriceNextMonth", "Date"], 1))
X_Check = X[-30:]
X = X[:-30]
df.dropna(inplace = True)
y = np.array(df["PriceNextMonth"])

X is our feature array. We have taken all the columns except “PriceNextMonth”, which is the label column, and “Date”, which possibly cannot help in predicting the price of the stock.
We will use X_Check to store the features of the last thirty rows, and X will store the rest.

After that, we have deleted those rows which contain NA values, that is, the last thirty rows from our dataframe.

Then we have made our feature array and called it y.

Now we will move onto the Machine Learning part. Let’s import the required libraries.

In [8]:
from sklearn import cross_validation
from sklearn.ensemble import RandomForestRegressor

We will use Random Forest algorithm which is pre-defined in the scikit-learn library (sklearn).
But first, we will divide our data set into training data and testing data.
Training data will be used for training our machine learning model, and testing data will be used for testing its acuracy.
This division is done by train_test_split() function.

In [9]:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y,test_size = 0.2)

We will define our model and function fit() will be used to train the model with training data.

In [10]:
model = RandomForestRegressor()
model.fit(X_train, y_train)
Out[10]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

Now, let us check the accuracy of the model, by applying it on the testing data.

In [11]:
conf = model.score(X_test, y_test)
print(conf)
0.946946304388

Woah! That’s pretty amazing accuracy, given that we didn’t even use Feature Engineering!
Let’s see if we can get a better result by bringing it into the picture.
For that, we will again read the .csv file into a DataFrame.

In [12]:
df = pd.read_csv("Tesla_stocks.csv")

We will define two new columns in our DataFrame, which will be derived from the existing ones.
These two columns define the percentage change of

  1. High and Low values
  2. Close and Open values.
In [13]:
df["HL_Perc"] = (df["High"]-df["Low"]) / df["Low"] * 100
df["CO_Perc"] = (df["Close"] - df["Open"]) / df["Open"] * 100

We will only keep the desired columns in our DataFrame and discard others.

In [14]:
df = df[["HL_Perc", "CO_Perc", "Adj Close", "Volume"]]

As this is a new DataFrame, we will again have to define our label column.

In [15]:
df["PriceNextMonth"] = df["Adj Close"].shift(-30)

We will make our Feature and Label arrays again.

In [22]:
X = np.array(df.drop(["PriceNextMonth"], 1))
X_Check = X[-30:]
X = X[:-30]
df.dropna(inplace = True)
y = np.array(df["PriceNextMonth"][:-30])

We will import a new library for scaling the feratures, i.e., Feature Transformation.

In [25]:
from sklearn import preprocessing
X = preprocessing.scale(X)

Finally, we will divide the data into training and testing, fit our model, and then test our model using these new features.

In [26]:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y,test_size = 0.2)

model.fit(X_train, y_train)

conf = model.score(X_test, y_test)
print(conf)
0.962954653359

BOOM!
The score increased to 96% !
That’s what feature engineering does!

If you woud like to plot these results into a beautiful graph, you can use the following code.

In [27]:
#import matplotlib
import matplotlib.pyplot as plt

#Set the style
plt.style.use("fivethirtyeight")

#Fit the model again using the whole data set
model.fit(X,y)

#Make predictions
predictions = model.predict(X_Check)

#Make the final DataFrame containing Dates, ClosePrices, and Forecast values
actual = pd.DataFrame(dates, columns = ["Date"])
actual["ClosePrice"] = df["Adj Close"]
actual["Forecast"] = np.nan
actual.set_index("Date", inplace = True)
forecast = pd.DataFrame(dates_check, columns=["Date"])
forecast["Forecast"] = predictions
forecast["ClosePrice"] = np.nan
forecast.set_index("Date", inplace = True)
var = [actual, forecast]
result = pd.concat(var)  #This is the final DataFrame


#Plot the results
result.plot(figsize=(20,10), linewidth=1.5)
plt.legend(loc=2, prop={'size':20})
plt.xlabel('Date')
plt.ylabel('Price')
plt.show()

The blue line shows available data, and red line shows the predicted stock values of the next thirty days.
However, I will not discuss Data Visualization in this tutorial. Stay updated for one on visualization soon.

Feel free to comment out your doubts.
Happy Learning 🙂

Share this article!

Pranav Gupta

Co-Founder at DataScribble
An always cheerful and optimistic guy, with a knack for achieving the set target at any cost.
I am an avid learner and never shy off from working hard or working till late. I am also a passionate reader, and love to read thriller novels, Jeffrey Archer being the favorite writer.
LinkedIn: https://www.linkedin.com/in/pranav-gupta-284141126/
Liked it? Take a second to support DataScribble on Patreon!

Pranav Gupta

An always cheerful and optimistic guy, with a knack for achieving the set target at any cost. I am an avid learner and never shy off from working hard or working till late. I am also a passionate reader, and love to read thriller novels, Jeffrey Archer being the favorite writer. LinkedIn: https://www.linkedin.com/in/pranav-gupta-284141126/

3 Comments

  1. In feature Engineering section, change this y = np.array(df[“PriceNextMonth”][:-30]) to y = np.array(df[“PriceNextMonth”]), It raises length mismatch error.

Leave a Reply

Your email address will not be published. Required fields are marked *