Over a million developers have joined DZone.

The Big Data Zone: Best of the Week (Apr. 12-19)

DZone's Guide to

The Big Data Zone: Best of the Week (Apr. 12-19)

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

In case you missed them, here are the best posts from this week's edition of The Big Data Zone (April 12th-17th). Hand-picked by the curator of The Big Data Zone. This week: A subtle way to over-fit a new set of data, "on the other side of Big Data," streaming data with Apache Ignite, visualizing matrix multiplication as a linear combination, and growing a spam tree.

1. A Subtle Way to Over-Fit

If you train a model on a set of data, it should fit that data well. The hope, however, is that it will fit a new set of data well. So here’s what you could do.

2. On the Other Side of Big Data

We often discuss big data in the context of helping businesses improve their marketing efforts or cut back on expenses. In fact, we almost always discuss big data from a business point-of-view, rarely mentioning what it’s like on the other side, how it feels to be the audience in the era of big data analytics.

3. Streaming and Transforming Data with Apache Ignite

In its 1.0 release Apache Ignite added much better streaming support with ability to perform various data transformations, as well as query the streamed data using standard SQL queries.

4. Visualizing Matrix Multiplication As a Linear Combination

When multiplying two matrices, there's a manual procedure we all know how to go through. While it's the easiest way to compute the result manually, it may obscure a very interesting property of the operation. In this quick post I want to show a colorful visualization that will make this easier to grasp.

5. Growing a Spam Tree

Consider the following toy dataset, with some spam/ham information, and two words, “viagra” and “lottery”. For the first node, compute Gini index for the two variables. The Gini index is maximal for “viagra”, so that will be the first node.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

bigdata ,big data ,best of the week

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}