Titanic

16 Dec, 2014

Kaggle Titanic Competition Part XI – Summary

2017-01-30T11:40:40-08:00December 16th, 2014|6 Comments

This series was probably too long! I can't even remember the beginning, but once I started I figured I may as well be thorough. Hopefully it will provide some assistance to people getting started with scikit-learn and could use a little guidance on the basics. All the code is up on Github with instructions for running it locally, if anyone tries it out and has any issues running it on their machine please let me know! I'll update the README with whatever steps are missing. Thoughts: It can be tricky figuring out useful ways to transform string features, but [...]

16 Dec, 2014

Kaggle Titanic Competition Part X – ROC Curves and AUC

2017-01-30T13:49:35-08:00December 16th, 2014|0 Comments

In the last post, we looked at how to generate and interpret learning curves to validate how well our model is performing. Today we'll take a look at another popular diagnostic used to figure out how well our model is performing. The Receiver Operating Characteristic (ROC curve) is a chart that illustrates how the true positive rate and false positive rate of a binary classifier vary as the discrimination threshold changes. Did that make any sense? Probably not, hopefully it will by the time we're finished. An important thing to keep in mind is that ROC is all about [...]

12 Dec, 2014

Kaggle Titanic Competition Part IX – Bias, Variance, and Learning Curves

2017-01-30T13:51:33-08:00December 12th, 2014|0 Comments

In the previous post, we took at how we can search for the best set of hyperparameters to provide to our model. Our measure of "best" in this case is to minimize the cross validated error. We can be reasonably confident that we're doing about as well as we can with the features we've provided and the model we've chosen. But before we can run off and use this model on totally new data with any confidence, we would like to do a little validation to get an idea of how the model will do out in the wild. [...]

3 Dec, 2014

Kaggle Titanic Competition Part VIII – Hyperparameter Optimization

2017-01-30T13:52:36-08:00December 3rd, 2014|0 Comments

In the last post, we generated our first Random Forest model with mostly default parameters so that we could get an idea of how important the features are. From that we can further reduce the dimensionality of our data set by throwing out some arbitrary amount of the weakest features. We could continue experimenting with the threshold with which to remove "weak" features, or even go back and experiment with the correlation and PCA thresholds as well to modify how many parameters we end up with... but we'll move forward with what we've got. Now that we've got our [...]

1 Dec, 2014

Kaggle Titanic Competition Part VII – Random Forests and Feature Importance

2017-01-30T13:53:17-08:00December 1st, 2014|0 Comments

In the last post we took a look at how reduce noisy variables from our data set using PCA, and today we'll actually start modelling! Random Forests are one of the easiest models to run, and highly effective as well. A great combination for sure. If you're just starting out with a new problem, this is a great model to quickly build a reference model. There aren't a whole lot of parameters to tune, which makes it very user friendly. The primary parameters include how many decision trees to include in the forest, how much data to include in [...]

26 Nov, 2014

Kaggle Titanic Competition Part VI – Dimensionality Reduction

2017-01-30T13:53:44-08:00November 26th, 2014|0 Comments

In the last post, we looked at how to use an automated process to generate a large number of non-correlated variables. Now we're going to look at a very common way to reduce the number of features that we use in modelling. You may be wondering why we'd remove variables we just took the time to create. The answer is pretty simple - sometimes it helps. If you think about a predictive model in terms of finding a "signal" or "pattern" in the data, it makes sense that you want to remove noise in the data that hides the [...]

Go to Top