Big Data Zone: Best of the Week (May 3-10)

DZone 's Guide to

Big Data Zone: Best of the Week (May 3-10)

· Big Data Zone ·
Free Resource

In case you missed them, here's a curated collection of the best and most informative posts from this week's edition of The Big Data Zone, as chosen by the curator of the Big Data Zone. This week: Hazards to watch out for while tuning Hadoop & Cassandra, what the Dept. of Homeland Security knows about data, how regression models aren't just about interpretation, and introduction to Apache Spark (in Python), and clustering customers for machine learning with Hadoop and Mahout.

1. Tuning Hadoop & Cassandra: Beware of vNodes, Splits and Pages

When running Hadoop jobs against Cassandra, you will want to be careful about a few parameters. Specifically, pay special attention to vNodes, Splits and Page Sizes.

2. What the Department of Homeland Security Knows About Data

What does the Department of Homeland Security know that you don’t know? OK, that’s a trick question. The answer (in this case) is this: It knows how to get from data to information.

3. Regression Models: It's Not Only About Interpretation

I uploaded a post where I tried to show that “standard” regression models were not performing badly. But my post was not complete: I was simply plotting the prediction obtained by some model. And it “looked like” the regression was nice, but so were the random forrest, the -nearest neighbour and boosting algorithm. What if we compare those models on new data?

4. Introduction to Apache Spark

Apache Spark is a fast and general-purpose cluster computing system. The latest version can be downloaded from http://spark.apache.org/downloads.html. In this post, we will try to perform some basic data manipulations using spark and python.

5. Clustering Customers for Machine Learning With Hadoop and Mahout

We wanted to create and test a solution that allowed us to group together similar customers using different sets of dimensions depending on the information we wanted to provide or obtain. This would be a very rough implementation that would allow us to prove certain techniques and solutions for this type of problems -- it certainly would NOT cover all the nuances that machine learning algorithms and analysis carry with them. This post covers the implementation of the solution.

best of the week, big data, bigdata

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}