Over a million developers have joined DZone.

The Big Data Zone: Best of the Week (Apr. 12-19)

DZone's Guide to

The Big Data Zone: Best of the Week (Apr. 12-19)

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

In case you missed them, here are the best posts from this week's edition of The Big Data Zone (April 12th-17th). Hand-picked by the curator of The Big Data Zone. This week: A subtle way to over-fit a new set of data, "on the other side of Big Data," streaming data with Apache Ignite, visualizing matrix multiplication as a linear combination, and growing a spam tree.

1. A Subtle Way to Over-Fit

If you train a model on a set of data, it should fit that data well. The hope, however, is that it will fit a new set of data well. So here’s what you could do.

2. On the Other Side of Big Data

We often discuss big data in the context of helping businesses improve their marketing efforts or cut back on expenses. In fact, we almost always discuss big data from a business point-of-view, rarely mentioning what it’s like on the other side, how it feels to be the audience in the era of big data analytics.

3. Streaming and Transforming Data with Apache Ignite

In its 1.0 release Apache Ignite added much better streaming support with ability to perform various data transformations, as well as query the streamed data using standard SQL queries.

4. Visualizing Matrix Multiplication As a Linear Combination

When multiplying two matrices, there's a manual procedure we all know how to go through. While it's the easiest way to compute the result manually, it may obscure a very interesting property of the operation. In this quick post I want to show a colorful visualization that will make this easier to grasp.

5. Growing a Spam Tree

Consider the following toy dataset, with some spam/ham information, and two words, “viagra” and “lottery”. For the first node, compute Gini index for the two variables. The Gini index is maximal for “viagra”, so that will be the first node.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

bigdata ,big data ,best of the week

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}