Comprehensive Classification Series – Kaggle’s Titanic Problem Part 1: Introduction to Kaggle

Share this article!

Introduction to Kaggle

In this comprehensive series on Kaggle’s Famous Titanic Data set, we will walk through the complete procedure of solving a classification problem using python. No matter if you are novice in this field or an expert you may have come across the Titanic data set, the list of passengers their information which acts as the features and their survival which acts as the label.
But before diving into the details of the data lets brief our aim with this series, in this part one of multi part series we will focus on what data science problems look like and some of the most common techniques used to solve data science problems.

This series is not intended to make everyone experts on data science, rather it is intended to simply try and remove some of the fear and mystery surrounding the field. In order to be as practical as possible, this series will be structured as a walk through of the process of entering a Kaggle competition and the steps taken to arrive at the final submission.

What is Kaggle?

For those that do not know, Kaggle is a website that hosts data science problems for an online community of data science enthusiasts to solve. These problems can be anything from predicting cancer based on patient data, to sentiment analysis of movie reviews and handwriting recognition – the only thing they all have in common is that they are problems requiring the application of data science to be solved.

The problems on Kaggle come from a range of sources. Some are provided just for fun and/or educational purposes, but many are provided by companies that have genuine problems they are trying to solve. As an incentive for Kaggle users to compete, prizes are often awarded for winning these competitions, or finishing in the top x positions. Sometimes the prize is a job or products from the company, but there can also be substantial monetary prizes. Home Depot for example is currently offering $40,000 for the algorithm that returns the most relevant search results on homedepot.com.

Despite the large prizes on offer though, many people on Kaggle compete simply for practice and the experience. The competitions involve interesting problems and there are plenty of users who submit their scripts publicly, providing an excellent opportunity for learning for those just trying to break into the field. There are also active discussion forums full of people willing to provide advice and assistance to other users.

Machine Learning- classification problem

Data Scribble’s aim is to help everyone who is new to this field , though there are many forms of machine learning its main aim is to built predictive models. The one we will be focusing here is a classification problem, which is a form of ‘supervised learning’. Classification is the process of assigning records or instances (think rows in a data set) to a specific category in a predetermined set of categories. Think about a problem like predicting which passengers on the Titanic survived (i.e. there are two categories – ‘survived’ and ‘did not survive’) based on their age, class and gender. Any thing that you ll be able to classify , here a binary classification problem is used, where outputs will only be in form of 1 or 0 , yes or no , true or false etc.

The Titanic Data

For a supervised learning problem, the main aim is to build a model using the training data set , yet another interesting term. The training data contains all the information available to make the prediction as well as the categories each record corresponds to. This data is then used to ‘train’ the algorithm to find the most accurate way to classify those records for which we do not know the category.

Although that sounds straight forward but it isn’t, there are a huge number of algorithms on which our data can be trained, a model may be built using a single algorithm , but in most cases multiple models are used to train the data.
To make things a little more complicated we have a range of parameters on which these algorithms depend.

Feeding your training data directly to the machine learning algorithms is another mistake , we have already introduced you to Feature Engineering and its importance, you any how cant run away from it.

Until Next Time

We will be providing you with the complete series –
the process of assessing and analyzing data, cleaning, transforming and adding new features, constructing and testing a model, and finally creating final predictions.

Stay Tuned !!


Share this article!

Tanishk Sachdeva

Leave a Reply

Your email address will not be published. Required fields are marked *