Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

The Best of the Week (Jan. 17): Big Data Zone

DZone's Guide to

The Best of the Week (Jan. 17): Big Data Zone

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (Jan. 17 to Jan. 23). Here they are, in order of popularity:

1. Apache Spark: The Next Big Data Thing?

Apache Spark is generating some buzz right now. Databricks, the company founded to support Spark raised $14M from Andreessen Horowitz, Cloudera has decided to fully support Spark, and others say it’s the next big thing. So, the author thought it’s time he get an understanding of what the buzz is about.

2. Big Data Search, Part 1

The author got tired of the old questions that they were asking candidates, so he decided to add a new one. Let us imagine a pretty trivial CSV file. However, let's assume that it's a small example of a CSV file that is 15 TB in size. The requirement is to be able to query on that file.

3. Big Data Search, Part 3: Binary Search of Textual Data

The index the author created for his previous exercise is just a text file, sorted by the indexed key. When doing a search by a human, that makes it easy to work with. Much easier than trying to work with a binary file, and it also helps debugging. However, it does make it running a binary search on the data a bit harder.

4. How to Set Up a Multi-Node Hadoop Cluster on Amazon EC2, Part 1

After spending some time playing around on a Single-Node pseudo-distributed cluster, it's time to get into real world Hadoop. It's important to note that there are multiple ways to achieve this, and the author is going to cover how to set up a multi-node Hadoop cluster on Amazon EC2.

5. Data News: Data Mining Reveals Big Problems for MOOCs, and More

This installment of Arthur Charpentier's regular collection of data science-related links includes R as a second language, the problems of MOOCs exposed by data mining, and the reality of the computer code you see in movies.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}