The Best of the Week (Nov. 22): Big Data Zone
Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (Nov. 22 to Nov. 28). Here they are, in order of popularity:
Nowadays, Python is probably the programming language of choice (besides R) for data scientists for prototyping, visualization, and running data analyses on small and medium sized data sets. And rightly so, I think, given the large number of available tools. However, it wasn’t always like this.
Impala uses Hadoop as a storage engine, but moves away from MapReduce algorithms toward distributed queries. Also, R can be integrated with Impala to provide fast, interactive queries running on top of Hadoop data sets. The data can then be further processed or visualized within R.
This set of slides presents an introduction to machine learning with R. It covers the strong points of R as a language, the basic concepts and uses of machine learning, and provides an overview of each, complete with code samples in R and images of the visualized data.
This installment of Arthur Charpentier's regular collection of data science-related links includes a free e-book on "Applied Epidemiology Using R," an argument that statistics are the least important part of data science, and what every programmer should know about memory.
This recent tutorial demonstrates how to use non-Java languages - R, in particular - to work with Hadoop data through MapReduce and Hive. Though the tutorial focuses on R, it is also meant to open doors for users working with other languages, such as Python, Ruby, and Linux commands or Shell scripts.