TF-IDF Basics with Pandas and Scikit-Learn
In a previous post we took a look at some basic approaches for preparing text data to be used in predictive models. In this post, well use pandas and scikit learn to turn the product "documents" we prepared into a Tf-idf weight matrix that can be used as the basis of a feature set for modeling. What is Tf-idf? Tf-idf is a very common technique for determining roughly what each document in a set of documents is "about". It cleverly accomplishes this by looking at two simple metrics: tf (term frequency) and idf (inverse document frequency). Term frequency is [...]