Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

The Big Data Zone - Best of the Week (Apr. 5-12)

DZone's Guide to

The Big Data Zone - Best of the Week (Apr. 5-12)

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

In case you missed it, here is a curated list of the best posts from the past week of The DevOps Zone. This week: the pros and cons of the term "data science", 3 things about data science you won't find in the text books, replacing data with measurements, running a PageRank Hadoop job on AWS Elastic MapReduce, and clinical trials in machine learning.

1. Pros and Cons of the Term "Data Science"

I’ve resisted using the term “data science,” and enjoy poking fun at it now and then, but I’ve decided it’s not such a bad label after all. Here are some of the pros and cons of the term.


2. Three Things About Data Science You Won't Find In the Books

Knowing how to evaluate properly can help a lot to reduce the risk that the method won’t perform on future data. Getting the feature extraction right is maybe the most effective lever to pull to get good results, and finally, it doesn’t always to have Big Data, although distributed computation can help to bring down training times.


3. Replace Data With Measurements

To tell whether a statement about data is over-hyped, see whether it retains its meaning if you replace data with measurements.


4. Running PageRank Hadoop job on AWS Elastic MapReduce

In a previous post I described an example to perform a PageRank calculation which is part of the Mining Massive Dataset course with Apache Hadoop. This post shows how to use this job on a real-life Hadoop cluster. The cluster is a AWS EMR cluster of 1 Master Node and 5 Core Nodes, each being backed by a m3.xlarge instance.


5. Clinical Trials and Machine Learning

Arguments over the difference between statistics and machine learning are often pointless. However, there is one distinction that is helpful. Statistics aims to build accurate models of phenomena, implicitly leaving the exploitation of these models to others. Machine learning aims to solve problems more directly, and sees its models as intermediate artifacts; if an unrealistic model leads to good solutions, it’s good enough.


Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
bigdata ,big data ,best of the week

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}