Dave

15 Dec, 2016

Using Category Encoders library in Scikit-learn

Dave2017-01-30T13:42:52-08:00December 15th, 2016|1 Comment

I recently found a relatively new library on github for handling categorical features named categorical_encoding and decided to give it a spin. As a reminder - categorical features are variables in your data that have a finite (ideally small) set of possible values, for example months of the year or hair color. You can't feed these into predictive models as raw text, so some conversion is necessary to prepare these variables to be useable. Typically, you create a new, separate column for each possible value (or alternately depending on the intended model, n-1 values) and each of these new [...]

18 Nov, 2016

TF-IDF Basics with Pandas and Scikit-Learn

Dave2017-01-30T13:44:52-08:00November 18th, 2016|7 Comments

In a previous post we took a look at some basic approaches for preparing text data to be used in predictive models. In this post, well use pandas and scikit learn to turn the product "documents" we prepared into a Tf-idf weight matrix that can be used as the basis of a feature set for modeling. What is Tf-idf? Tf-idf is a very common technique for determining roughly what each document in a set of documents is "about". It cleverly accomplishes this by looking at two simple metrics: tf (term frequency) and idf (inverse document frequency). Term frequency is [...]

24 Jun, 2016

A Shiny New Python Data Science Sandbox in 30 Minutes Or Less

Dave2017-01-30T11:40:40-08:00June 24th, 2016|5 Comments

This post will give beginners a full walkthrough to go from nothing to a fully functional linux/python/pandas/scikit-learn environement with jupyter as a front end. For exploratory work, I really like this stack. My native OS is Windows, but since we're using VMs I would imagine the setup for OS X is very similar and probably won't need any modification (other than steps for configuring the VM). If you have a solid internet connection, we should be able to get this all done in under 30 minutes startiiiinnnnnng NOW... 1. Download an Ubuntu Desktop version of your choice. I like [...]

20 May, 2016

Investigating missing data with missingno

Dave2017-01-30T11:40:40-08:00May 20th, 2016|0 Comments

I recently came across a new python package for visualizing missing elements of a data set. This is super useful when you're taking your first look at a new data set and trying to get a feel for what you're working with. Having a sense of the completeness of the data can help inform decisions about how to best handle missing values. In this post, we'll take a quick look at the small and simple Shelter Animal Outcomes data set from one of the current Kaggle competitions. The first visualization is the "matrix" display. This is a representation of [...]

10 May, 2016

Text Pre-processing Basics with Pandas

Dave2017-01-30T13:47:13-08:00May 10th, 2016|4 Comments

In this post, we'll take a look at the data provided in Kaggle's Home Depot Product Search Relevance challenge to demonstrate some techniques that may be helpful in getting started with feature generation for text data. Dealing with text data is considerably different than numerical data, so there are a few basic approaches that are an excellent place to start. As always, before we start creating features we'll need to clean and massage the data! In the Home Depot challenge, we have a few files which provide attributes and descriptions of each of the products on their website. The [...]

7 Jul, 2015

Recommend-ify Zillow

Dave2017-01-30T11:40:40-08:00July 7th, 2015|0 Comments

I love Zillow. It's such an amazing search interface for real estate. But that's it... it's just a search interface. And because it's just search, I have to sort through good properties and bad. Maybe that situation benefits their business model, which I won't pretend to know. However, with a little data science we could take the treasure trove of data they already have, add a few UI elements to capture some more, and provide personalized recommendations to house hunters. Users who get what they want quickly are happy customers! Let's look at what they could do to increase [...]

About Dave

Using Category Encoders library in Scikit-learn

TF-IDF Basics with Pandas and Scikit-Learn

A Shiny New Python Data Science Sandbox in 30 Minutes Or Less

Investigating missing data with missingno

Text Pre-processing Basics with Pandas

Recommend-ify Zillow