Over a million developers have joined DZone.

Conjecture: Scalable Machine Learning in Hadoop with Scalding

DZone's Guide to

Conjecture: Scalable Machine Learning in Hadoop with Scalding

· Big Data Zone
Free Resource

Effortlessly power IoT, predictive analytics, and machine learning applications with an elastic, resilient data infrastructure. Learn how with Mesosphere DC/OS.

When it comes to predictive modeling and machine learning, the most obvious product of engineering work that is seen client-side are those tailored ads: they scour your internet behavior and feed you content based on your preferences. This type of framework becomes particularly important on e-commerce platforms in recommending related purchases and other behaviors.

A blogger from the Etsy engineering team shared some of their process in a post about scalable machine learning.  

...we use predictive machine learning models to estimate click rates of items so that we can present high quality and relevant items to potential buyers on the site.  This estimation is particularly important when used for ranking our cost-per-click search ads, a substantial source of revenue. In addition to contributing to on-site experiences, we use machine learning as a component of many internal tools, such as routing and prioritizing our internal support e-mail queue.  By automatically categorizing and estimating an “urgency” for inbound support e-mails, we can assign support requests to the appropriate personnel and ensure that urgent requests are handled by staff more rapidly, helping to ensure a good customer experience.

The way in which they set up predictive machine learning operates on three basic premises:

  1. Java classes which define the machine learning models and data types.

  2. Scala methods which perform MapReduce training using Scalding.

  3. PHP classes which use the produced models to make predictions in real-time on the web site.

The modeling laid out in the rest of the article  is only a small part of what Etsy does both externally and internally to utilize the large amount of data that passes through its hands every day.

Learn to design and build better data-rich applications with this free eBook from O’Reilly. Brought to you by Mesosphere DC/OS.


Opinions expressed by DZone contributors are their own.


Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.


{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}