Over a million developers have joined DZone.

The Best of the Week (Jan. 17): Big Data Zone

DZone's Guide to

The Best of the Week (Jan. 17): Big Data Zone

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (Jan. 17 to Jan. 23). Here they are, in order of popularity:

1. Apache Spark: The Next Big Data Thing?

Apache Spark is generating some buzz right now. Databricks, the company founded to support Spark raised $14M from Andreessen Horowitz, Cloudera has decided to fully support Spark, and others say it’s the next big thing. So, the author thought it’s time he get an understanding of what the buzz is about.

2. Big Data Search, Part 1

The author got tired of the old questions that they were asking candidates, so he decided to add a new one. Let us imagine a pretty trivial CSV file. However, let's assume that it's a small example of a CSV file that is 15 TB in size. The requirement is to be able to query on that file.

3. Big Data Search, Part 3: Binary Search of Textual Data

The index the author created for his previous exercise is just a text file, sorted by the indexed key. When doing a search by a human, that makes it easy to work with. Much easier than trying to work with a binary file, and it also helps debugging. However, it does make it running a binary search on the data a bit harder.

4. How to Set Up a Multi-Node Hadoop Cluster on Amazon EC2, Part 1

After spending some time playing around on a Single-Node pseudo-distributed cluster, it's time to get into real world Hadoop. It's important to note that there are multiple ways to achieve this, and the author is going to cover how to set up a multi-node Hadoop cluster on Amazon EC2.

5. Data News: Data Mining Reveals Big Problems for MOOCs, and More

This installment of Arthur Charpentier's regular collection of data science-related links includes R as a second language, the problems of MOOCs exposed by data mining, and the reality of the computer code you see in movies.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.


Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}