In case you missed them, here's a curated collection of the best and most informative posts from this week's edition of The Big Data Zone, as chosen by the curator of the Big Data Zone. This week: Hazards to watch out for while tuning Hadoop & Cassandra, what the Dept. of Homeland Security knows about data, how regression models aren't just about interpretation, and introduction to Apache Spark (in Python), and clustering customers for machine learning with Hadoop and Mahout.
When running Hadoop jobs against Cassandra, you will want to be careful about a few parameters. Specifically, pay special attention to vNodes, Splits and Page Sizes.
What does the Department of Homeland Security know that you don’t know? OK, that’s a trick question. The answer (in this case) is this: It knows how to get from data to information.
I uploaded a post where I tried to show that “standard” regression models were not performing badly. But my post was not complete: I was simply plotting the prediction obtained by some model. And it “looked like” the regression was nice, but so were the random forrest, the -nearest neighbour and boosting algorithm. What if we compare those models on new data?
Apache Spark is a fast and general-purpose cluster computing system. The latest version can be downloaded from http://spark.apache.org/downloads.html. In this post, we will try to perform some basic data manipulations using spark and python.
We wanted to create and test a solution that allowed us to group together similar customers using different sets of dimensions depending on the information we wanted to provide or obtain. This would be a very rough implementation that would allow us to prove certain techniques and solutions for this type of problems -- it certainly would NOT cover all the nuances that machine learning algorithms and analysis carry with them. This post covers the implementation of the solution.