Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (Jan. 17 to Jan. 23). Here they are, in order of popularity:
Apache Spark is generating some buzz right now. Databricks, the company founded to support Spark raised $14M from Andreessen Horowitz, Cloudera has decided to fully support Spark, and others say it’s the next big thing. So, the author thought it’s time he get an understanding of what the buzz is about.
The author got tired of the old questions that they were asking candidates, so he decided to add a new one. Let us imagine a pretty trivial CSV file. However, let's assume that it's a small example of a CSV file that is 15 TB in size. The requirement is to be able to query on that file.
The index the author created for his previous exercise is just a text file, sorted by the indexed key. When doing a search by a human, that makes it easy to work with. Much easier than trying to work with a binary file, and it also helps debugging. However, it does make it running a binary search on the data a bit harder.
After spending some time playing around on a Single-Node pseudo-distributed cluster, it's time to get into real world Hadoop. It's important to note that there are multiple ways to achieve this, and the author is going to cover how to set up a multi-node Hadoop cluster on Amazon EC2.
This installment of Arthur Charpentier's regular collection of data science-related links includes R as a second language, the problems of MOOCs exposed by data mining, and the reality of the computer code you see in movies.