In case you haven’t heard of Kaggle, it’s a data science competition site where companies/organizations provide data sets relevant to a problem they’re facing and anyone can attempt to build predictive models for the data set. The teams that best predict the data are rewarded with fame and fortune (or relatively minor sums of money compared to the value they provide, but that’s another post). Some competitions are for learning, some just for fun, and some for the $$$.
One of the current competitions involves data on the passengers of the Titanic. The goal is to attempt to determine if a set of passengers escaped the sinking ship and found their way to a lifeboat. This is one of the learning competitions, but it is particularly challenging due to a small sample size from which to learn as well as the random/chaotic nature of a group of panicking people.
As this is a learning challenge, there is a reference solution provided which is a great starting point. The provided code generates a random forest model implemented in python using the scikit-learn library. There are a lot of different tools and techniques in scikit-learn that we can employ in pursuit of optimizing the reference model, and we’ll cover many of these over the next couple weeks. Today’s post will cover the basics of reading in the data and preparing it for feature engineering using the Pandas library which scikit-learn is built on.
import pandas as pd # read in the training and testing data into Pandas.DataFrame objects input_df = pd.read_csv('data/raw/train.csv', header=0) submit_df = pd.read_csv('data/raw/test.csv', header=0) # merge the two DataFrames into one df = pd.concat([input_df, submit_df]) # re-number the combined data set so there aren't duplicate indexes df.reset_index(inplace=True) # reset_index() generates a new column that we don't want, so let's get rid of it df.drop('index', axis=1, inplace=True) # the remaining columns need to be reindexed so we can access the first column at '0' instead of '1' df = df.reindex_axis(input_df.columns, axis=1) print df.shape[1], "columns:", df.columns.values print "Row count:", df.shape[0]
A few things to point out about this script:
- We combine the data from the two files into one for a simple reason: when we perform feature engineering on the features, it’s often useful to know the full range of possible values, as well as the distributions of all known values. This will require that we keep track of the training and test data during our processing, but it turns out to not be too difficult.
- We are doing a fair amount of maintenance of the dataframe after combining. Pandas is extremely flexible with regards to combining data sets and requires a little extra TLC to make sure that it doesn’t throw away any of the original information it maintains unless we explicitly tell it to.
Kaggle Titanic Tutorial in Scikit-learn
Part I – Intro
Part II – Missing Values
Part III – Feature Engineering: Variable Transformations
Part IV – Feature Engineering: Derived Variables
Part V – Feature Engineering: Interaction Variables and Correlation
Part VI – Feature Engineering: Dimensionality Reduction w/ PCA
Part VII – Modeling: Random Forests and Feature Importance
Part VIII – Modeling: Hyperparamter Optimization
Part IX – Validation: Learning Curves
Part X – Validation: ROC Curves
Part XI – Summary