Kaggit Digit Recognizer: A Feature Extraction Fail in R
Feature extraction is where we’d generate some other features to train a classifier with rather than relying on just the pixel values we were provided.
5. The Space between the Data Set and the Algorithm
Many people go straight from a data set to applying an algorithm. But there’s a huge space in between of important stuff. It’s easy to run a piece of code that predicts or classifies. That’s not the hard part. The hard part is doing it well.
One needs to conduct exploratory data analysis as I’ve emphasized; and conduct feature selection as Will Cukierski emphasized.
I've highlighted the part of the post which describes exactly what we’ve been doing!
I created features for the number of non zero pixels, the number of 255 pixels, the average number of pixels and the average of the middle pixels of a number.
The code reads like this:
initial <- read.csv("train.csv", header = TRUE) initial$nonZeros <- apply(initial, 1, function(entries) length(Filter(function (x) x != 0, entries))) initial$fullHouses <- apply(initial, 1, function(entries) length(Filter(function (x) x == 255, entries))) initial$meanPixels <- apply(initial, 1, mean) initial$middlePixels <- apply(initial[,200:500], 1, mean)
I then wrote those features out into a CSV file like so:
newFeatures <- subset(initial, select=c(label, nonZeros, meanPixels, fullHouses, middlePixels)) write.table(file="feature-extraction.txt", newFeatures, row.names=FALSE, sep=",")
I then created a 100 tree random forest using Mahout to see whether or not we could get any sort of accuracy using these features.
Unforutnately the accuracy on the cross validation set (10% of the training data) was only 24% which is pretty useless so it’s back to the drawing board!
Our next task is to try and work out whether we can derive some features which have a stronger correlation with the label values or combining the new features with the existing pixel values to see if that has any impact.
As you can probably tell I don’t really understand how you should go about extracting features so if anybody has ideas or papers/articles I can read to learn more please let me know in the comments!