In case you missed it, here is a curated list of the best posts from the past week of The DevOps Zone. This week: the pros and cons of the term "data science", 3 things about data science you won't find in the text books, replacing data with measurements, running a PageRank Hadoop job on AWS Elastic MapReduce, and clinical trials in machine learning.
I’ve resisted using the term “data science,” and enjoy poking fun at it now and then, but I’ve decided it’s not such a bad label after all. Here are some of the pros and cons of the term.
Knowing how to evaluate properly can help a lot to reduce the risk that the method won’t perform on future data. Getting the feature extraction right is maybe the most effective lever to pull to get good results, and finally, it doesn’t always to have Big Data, although distributed computation can help to bring down training times.
To tell whether a statement about data is over-hyped, see whether it retains its meaning if you replace data with measurements.
In a previous post I described an example to perform a PageRank calculation which is part of the Mining Massive Dataset course with Apache Hadoop. This post shows how to use this job on a real-life Hadoop cluster. The cluster is a AWS EMR cluster of 1 Master Node and 5 Core Nodes, each being backed by a m3.xlarge instance.
Arguments over the difference between statistics and machine learning are often pointless. However, there is one distinction that is helpful. Statistics aims to build accurate models of phenomena, implicitly leaving the exploitation of these models to others. Machine learning aims to solve problems more directly, and sees its models as intermediate artifacts; if an unrealistic model leads to good solutions, it’s good enough.