scikit-learn – Page 2 – Ultraviolet Analytics

3 Dec, 2014

Kaggle Titanic Competition Part VIII – Hyperparameter Optimization

Dave2017-01-30T13:52:36-08:00December 3rd, 2014|0 Comments

In the last post, we generated our first Random Forest model with mostly default parameters so that we could get an idea of how important the features are. From that we can further reduce the dimensionality of our data set by throwing out some arbitrary amount of the weakest features. We could continue experimenting with the threshold with which to remove "weak" features, or even go back and experiment with the correlation and PCA thresholds as well to modify how many parameters we end up with... but we'll move forward with what we've got. Now that we've got our [...]

1 Dec, 2014

Kaggle Titanic Competition Part VII – Random Forests and Feature Importance

Dave2017-01-30T13:53:17-08:00December 1st, 2014|0 Comments

In the last post we took a look at how reduce noisy variables from our data set using PCA, and today we'll actually start modelling! Random Forests are one of the easiest models to run, and highly effective as well. A great combination for sure. If you're just starting out with a new problem, this is a great model to quickly build a reference model. There aren't a whole lot of parameters to tune, which makes it very user friendly. The primary parameters include how many decision trees to include in the forest, how much data to include in [...]

26 Nov, 2014

Kaggle Titanic Competition Part VI – Dimensionality Reduction

Dave2017-01-30T13:53:44-08:00November 26th, 2014|0 Comments

In the last post, we looked at how to use an automated process to generate a large number of non-correlated variables. Now we're going to look at a very common way to reduce the number of features that we use in modelling. You may be wondering why we'd remove variables we just took the time to create. The answer is pretty simple - sometimes it helps. If you think about a predictive model in terms of finding a "signal" or "pattern" in the data, it makes sense that you want to remove noise in the data that hides the [...]

10 Nov, 2014

Kaggle Titantic Competition Part V – Interaction Variables

Dave2017-01-30T13:54:05-08:00November 10th, 2014|0 Comments

In the last post we covered some ways to derive variables from string fields using intuition and insight. This time we'll cover derived variables that are a lot easier to generate. Interaction variables capture effects of the relationship between variables. They are constructed by performing mathematical operations on sets of features. The simple approach that we use in this example is to perform basic operators (add, subtract, multiply, divide) on each pair of numerical features. We could also get much more involved and include more than 2 features in each calculation, and/or use other operators (sqrt, ln, trig functions, [...]

5 Nov, 2014

Kaggle Titanic Competition Part III – Variable Transformations

Dave2017-01-30T13:55:04-08:00November 5th, 2014|4 Comments

In the last two posts, we've covered reading in the data set and handling missing values. Now we can start working on transforming the variable values into formatted features that our model can use. Different implementations of the Random Forest algorithm can accept different types of data. Scikit-learn requires everything to be numeric so we'll have to do some work to transform the raw data. All possible data can be generally considered as one of two types: Quantitative and Qualitative. Quantitative variables are those whose values can be meaningfully sorted in a manner that indicates an underlying order. In [...]

3 Nov, 2014

Kaggle Titanic Competition Part II – Missing Values

Dave2017-01-30T13:55:32-08:00November 3rd, 2014|4 Comments

There will be missing/incorrect data in nearly every non-trivial data set a data scientist ever encounters. It is as certain as death and taxes. This is especially true with big data and applies to data generated by humans in a social context or by computer systems/sensors. Some predictive models inherently are able to deal with missing data (neural networks come to mind) and others require that the missing values be dealt with separately. The RandomForestClassifier model in scikit-learn is not able to handle missing values, so we'll need to use some different approaches to assign values before training the [...]