In case you missed them, here are the best posts from this week's edition of The Big Data Zone (April 12th-17th). Hand-picked by the curator of The Big Data Zone. This week: A subtle way to over-fit a new set of data, "on the other side of Big Data," streaming data with Apache Ignite, visualizing matrix multiplication as a linear combination, and growing a spam tree.
If you train a model on a set of data, it should fit that data well. The hope, however, is that it will fit a new set of data well. So here’s what you could do.
We often discuss big data in the context of helping businesses improve their marketing efforts or cut back on expenses. In fact, we almost always discuss big data from a business point-of-view, rarely mentioning what it’s like on the other side, how it feels to be the audience in the era of big data analytics.
In its 1.0 release Apache Ignite added much better streaming support with ability to perform various data transformations, as well as query the streamed data using standard SQL queries.
When multiplying two matrices, there's a manual procedure we all know how to go through. While it's the easiest way to compute the result manually, it may obscure a very interesting property of the operation. In this quick post I want to show a colorful visualization that will make this easier to grasp.
Consider the following toy dataset, with some spam/ham information, and two words, “viagra” and “lottery”. For the first node, compute Gini index for the two variables. The Gini index is maximal for “viagra”, so that will be the first node.