Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Machine Learning Requires Big Data

DZone's Guide to

Machine Learning Requires Big Data

Machine learning requires big data to work. Without large, well-maintained training sets, machine learning algorithms fall far short of their potential.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

During the Deep Learning Summit at AWS re:Invent 2017, Terrence Sejnowski (a pioneer of deep learning) succinctly said, "Whoever has more data wins."

He was echoing a premise that has been repeated many times in many ways by many people: machine learning requires big data to work. Without large, well-maintained training sets, machine learning algorithms — especially deep learning algorithms — fall far short of their potential. That's why here at Qubole we believe that enabling data scientists starts with giving them a platform to quickly select, clean, and aggregate datasets on a massive scale.

The recent surge in impactful applications of deep learning algorithms has misled many people to believe that there has been a corresponding upswell in innovation in this field. Although there are indeed new bleeding-edge algorithms being released (most recently, Geoffrey Hinton's milestone capsule networks), most of the deep learning algorithms used in innovative technologies are actually decades old. What's truly driving these new applications of artificial intelligence and machine learning isn't new algorithms, but bigger data. As Moore's Law predicted, data scientists now have incredible compute and storage capabilities that enable them to begin to leverage the massive amounts of data that are being collected.

The view that machine learning systems rely significantly on non-machine learning tools is visually explained by a simple diagram from the seminal paper, Hidden Technical Debt in Machine Learning Systems:

In this image, the Google engineers who wrote the paper are illustrating that actual machine learning code makes up a tiny portion of the overall system required to support ML algorithms at scale. Without all of these other aspects of the system, a standalone block of ML code would be all but useless.

Qubole's holistic view of the data science workflow has enabled us to take advantage of cutting-edge tools for data preparation. Whether it's using Hive for simple SQL data selections and aggregations or Spark for feature extraction and engineering, Qubole Data Service provides tools that a data scientist needs in order to successfully leverage their big data for big results.

You can sign up for the free QDS Business Edition here.

(Qubole offers Qubole Data Service (QDS) Business Edition at no cost, but usage is limited by Qubole compute hours per month, which is approximately a $1,000/month value. You must provide your own cloud account and you are responsible for the infrastructure costs managed by Qubole on your behalf.)

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,machine learning ,algorithms ,data analytics ,training data ,data science ,data preparation

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}