Over a million developers have joined DZone.

The Big Data Zone - Best of the Week (Apr. 5-12)

· Big Data Zone

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

In case you missed it, here is a curated list of the best posts from the past week of The DevOps Zone. This week: the pros and cons of the term "data science", 3 things about data science you won't find in the text books, replacing data with measurements, running a PageRank Hadoop job on AWS Elastic MapReduce, and clinical trials in machine learning.

1. Pros and Cons of the Term "Data Science"

I’ve resisted using the term “data science,” and enjoy poking fun at it now and then, but I’ve decided it’s not such a bad label after all. Here are some of the pros and cons of the term.


2. Three Things About Data Science You Won't Find In the Books

Knowing how to evaluate properly can help a lot to reduce the risk that the method won’t perform on future data. Getting the feature extraction right is maybe the most effective lever to pull to get good results, and finally, it doesn’t always to have Big Data, although distributed computation can help to bring down training times.


3. Replace Data With Measurements

To tell whether a statement about data is over-hyped, see whether it retains its meaning if you replace data with measurements.


4. Running PageRank Hadoop job on AWS Elastic MapReduce

In a previous post I described an example to perform a PageRank calculation which is part of the Mining Massive Dataset course with Apache Hadoop. This post shows how to use this job on a real-life Hadoop cluster. The cluster is a AWS EMR cluster of 1 Master Node and 5 Core Nodes, each being backed by a m3.xlarge instance.


5. Clinical Trials and Machine Learning

Arguments over the difference between statistics and machine learning are often pointless. However, there is one distinction that is helpful. Statistics aims to build accurate models of phenomena, implicitly leaving the exploitation of these models to others. Machine learning aims to solve problems more directly, and sees its models as intermediate artifacts; if an unrealistic model leads to good solutions, it’s good enough.


Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

Topics:
bigdata ,big data ,best of the week

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}