In the last two posts, we’ve covered reading in the data set and handling missing values. Now we can start working on transforming the variable values into formatted features that our model can use. Different implementations of the Random Forest algorithm can accept different types of data. Scikit-learn requires everything to be numeric so we’ll have to do some work to transform the raw data.
All possible data can be generally considered as one of two types: Quantitative and Qualitative. Quantitative variables are those whose values can be meaningfully sorted in a manner that indicates an underlying order. In the Titanic data set, Age is a perfect example of a quantitative variable. Qualitative variables describe some aspect of an object/phenomenon in a way that can’t directly be related to other values in a useful mathematical way. This includes things like names or categories. For example, the Embarked value is the name of a departure port.
Different types of transformations can be applied to different types of variables. Qualitative transformations include:
1) Dummy Variables
Also known as Categorical variable or Binary Variables, Dummy Variables can be used most effectively when a qualitative variable has a small number of distinct values that occur somewhat frequently. In the case of the Embarked variable in the Titanic dataset, there are three distinct values -> ‘S’, ‘C’, and ‘Q’. We can transform ‘Embarked’ into dummies (so that we can use the information in the scikit-learn RandomForestClassifier code) with some simple code:
import pandas as pd # Create a dataframe of dummy variables for each distinct value of 'Embarked' dummies_df = pd.get_dummies(df['Embarked']) # Rename the columns from 'S', 'C', 'Q' to 'Embarked_S', 'Embarked_C', 'Embarked_Q' dummies_df = dummies_df.rename(columns=lambda x: 'Embarked_' + str(x)) # Add the new variables back to the original data set df = pd.concat([df, dummies_df], axis=1) # (or written as a one-liner): df = pd.concat([df, pd.get_dummies(df['Embarked']).rename(columns=lambda x: 'Embarked_' + str(x))], axis=1)
2) Factorizing
Pandas has a method called factorize() that creates a numerical categorical variable from any other variable, assigning a unique ID to each distinct value encountered. This is especially useful for transforming an alphanumeric categorical variable into a numerical categorical variable. In some ways creating a factor variable is similar to dummy variables, in that it allows you to generate a numerical category, but in this case it does this within a single variable.
A categorical variable representing the letter of the Cabin can be created with the following code:
import re # Replace missing values with "U0" df['Cabin'][df.Cabin.isnull()] = 'U0' # create feature for the alphabetical part of the cabin number df['CabinLetter'] = df['Cabin'].map( lambda x : re.compile("([a-zA-Z]+)").search(x).group()) # convert the distinct cabin letters with incremental integer values df['CabinLetter'] = pd.factorize(df['CabinLetter'])[0]
Quantitative transformations include:
3) Scaling
Scaling is a technique used to address an issue with some models that variables with wildly different scales will be treated in proportion to the magnitude of their values. For example, Age values will likely max out around 100 while household income values may max out in the millions. Some models are sensitive to the magnitude of the values of the variables, so scaling all values by some constant can help to adjust the influence of each variable. Additionally, scaling can be performed in such a way to compress all values into a specific range (typically -1 to 1, or 0 to 1). This isn’t necessary for RandomForest models, but is very helpful in other models you may want to try out with this dataset.
# StandardScaler will subtract the mean from each value then scale to the unit variance scaler = preprocessing.StandardScaler() df['Age_scaled'] = scaler.fit_transform(df['Age'])
4) Binning
Binning is a term used to indicate creating quantiles. This allows you to create an ordered, categorical variable out of a range of values. In algorithms that respond effectively use categorical information this can be useful (probably not so great for linear regression).
# Divide all fares into quartiles df['Fare_bin'] = pd.qcut(df['Fare'], 4) # qcut() creates a new variable that identifies the quartile range, but we can't use the string so either # factorize or create dummies from the result df['Fare_bin_id'] = pd.factorize(df['Fare_bin']) df = pd.concat([df, pd.get_dummies(df['Fare_bin']).rename(columns=lambda x: 'Fare_' + str(x))], axis=1)
Kaggle Titanic Tutorial in Scikit-learn
Part I – Intro
Part II – Missing Values
Part III – Feature Engineering: Variable Transformations
Part IV – Feature Engineering: Derived Variables
Part V – Feature Engineering: Interaction Variables and Correlation
Part VI – Feature Engineering: Dimensionality Reduction w/ PCA
Part VII – Modeling: Random Forests and Feature Importance
Part VIII – Modeling: Hyperparamter Optimization
Part IX – Validation: Learning Curves
Part X – Validation: ROC Curves
Part XI – Summary