Over a million developers have joined DZone.

The Best of the Week (Jan. 17): Big Data Zone

DZone's Guide to

The Best of the Week (Jan. 17): Big Data Zone

· Big Data Zone
Free Resource

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (Jan. 17 to Jan. 23). Here they are, in order of popularity:

1. Apache Spark: The Next Big Data Thing?

Apache Spark is generating some buzz right now. Databricks, the company founded to support Spark raised $14M from Andreessen Horowitz, Cloudera has decided to fully support Spark, and others say it’s the next big thing. So, the author thought it’s time he get an understanding of what the buzz is about.

2. Big Data Search, Part 1

The author got tired of the old questions that they were asking candidates, so he decided to add a new one. Let us imagine a pretty trivial CSV file. However, let's assume that it's a small example of a CSV file that is 15 TB in size. The requirement is to be able to query on that file.

3. Big Data Search, Part 3: Binary Search of Textual Data

The index the author created for his previous exercise is just a text file, sorted by the indexed key. When doing a search by a human, that makes it easy to work with. Much easier than trying to work with a binary file, and it also helps debugging. However, it does make it running a binary search on the data a bit harder.

4. How to Set Up a Multi-Node Hadoop Cluster on Amazon EC2, Part 1

After spending some time playing around on a Single-Node pseudo-distributed cluster, it's time to get into real world Hadoop. It's important to note that there are multiple ways to achieve this, and the author is going to cover how to set up a multi-node Hadoop cluster on Amazon EC2.

5. Data News: Data Mining Reveals Big Problems for MOOCs, and More

This installment of Arthur Charpentier's regular collection of data science-related links includes R as a second language, the problems of MOOCs exposed by data mining, and the reality of the computer code you see in movies.

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks


Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}