scikit-learn

15 Dec, 2016

Using Category Encoders library in Scikit-learn

2017-01-30T13:42:52-08:00December 15th, 2016|1 Comment

I recently found a relatively new library on github for handling categorical features named categorical_encoding and decided to give it a spin. As a reminder - categorical features are variables in your data that have a finite (ideally small) set of possible values, for example months of the year or hair color. You can't feed these into predictive models as raw text, so some conversion is necessary to prepare these variables to be useable. Typically, you create a new, separate column for each possible value (or alternately depending on the intended model, n-1 values) and each of these new [...]

18 Nov, 2016

TF-IDF Basics with Pandas and Scikit-Learn

2017-01-30T13:44:52-08:00November 18th, 2016|7 Comments

In a previous post we took a look at some basic approaches for preparing text data to be used in predictive models. In this post, well use pandas and scikit learn to turn the product "documents" we prepared into a Tf-idf weight matrix that can be used as the basis of a feature set for modeling. What is Tf-idf? Tf-idf is a very common technique for determining roughly what each document in a set of documents is "about". It cleverly accomplishes this by looking at two simple metrics: tf (term frequency) and idf (inverse document frequency). Term frequency is [...]

24 Jun, 2016

A Shiny New Python Data Science Sandbox in 30 Minutes Or Less

2017-01-30T11:40:40-08:00June 24th, 2016|5 Comments

This post will give beginners a full walkthrough to go from nothing to a fully functional linux/python/pandas/scikit-learn environement with jupyter as a front end. For exploratory work, I really like this stack. My native OS is Windows, but since we're using VMs I would imagine the setup for OS X is very similar and probably won't need any modification (other than steps for configuring the VM). If you have a solid internet connection, we should be able to get this all done in under 30 minutes startiiiinnnnnng NOW... 1. Download an Ubuntu Desktop version of your choice. I like [...]

16 Dec, 2014

Kaggle Titanic Competition Part XI – Summary

2017-01-30T11:40:40-08:00December 16th, 2014|6 Comments

This series was probably too long! I can't even remember the beginning, but once I started I figured I may as well be thorough. Hopefully it will provide some assistance to people getting started with scikit-learn and could use a little guidance on the basics. All the code is up on Github with instructions for running it locally, if anyone tries it out and has any issues running it on their machine please let me know! I'll update the README with whatever steps are missing. Thoughts: It can be tricky figuring out useful ways to transform string features, but [...]

16 Dec, 2014

Kaggle Titanic Competition Part X – ROC Curves and AUC

2017-01-30T13:49:35-08:00December 16th, 2014|0 Comments

In the last post, we looked at how to generate and interpret learning curves to validate how well our model is performing. Today we'll take a look at another popular diagnostic used to figure out how well our model is performing. The Receiver Operating Characteristic (ROC curve) is a chart that illustrates how the true positive rate and false positive rate of a binary classifier vary as the discrimination threshold changes. Did that make any sense? Probably not, hopefully it will by the time we're finished. An important thing to keep in mind is that ROC is all about [...]

12 Dec, 2014

Kaggle Titanic Competition Part IX – Bias, Variance, and Learning Curves

2017-01-30T13:51:33-08:00December 12th, 2014|0 Comments

In the previous post, we took at how we can search for the best set of hyperparameters to provide to our model. Our measure of "best" in this case is to minimize the cross validated error. We can be reasonably confident that we're doing about as well as we can with the features we've provided and the model we've chosen. But before we can run off and use this model on totally new data with any confidence, we would like to do a little validation to get an idea of how the model will do out in the wild. [...]

Go to Top