Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Scoring Machine Learning Models at Scale [Video]

DZone's Guide to

Scoring Machine Learning Models at Scale [Video]

This video shows a demo of MemSQL and Apache Spark for entity resolution and fraud detection across a dataset composed of a huge group of people.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

At Strata+Hadoop World, MemSQL Software Engineer John Bowler shared two ways of making production data pipelines in MemSQL:

  1. Using Spark for general purpose computation.

  2. Through a transform defined in MemSQL pipeline for general purpose computation.

In the video below, John runs a live demonstration of MemSQL and Apache Spark for entity resolution and fraud detection across a dataset composed of a hundred thousand employees and fifty million customers. John uses MemSQL and writes a Spark job along with an open source entity resolution library called Duke to sort through and score combinations of customer and employee data.

MemSQL makes this possible by reducing network overhead through the MemSQL Spark Connector along with native geospatial capabilities. John finds the top 10 million flagged customer and employee pairs across 5 trillion possible combinations in only three minutes. Finally, John uses MemSQL Pipelines and TensorFlow to write a machine learning Python script that accurately identifies thousands of handwritten numbers after training the model in seconds.

About the speaker: John Bowler is a Software Engineer at MemSQL. John has a background in machine learning, algorithms, and distributed data warehouses. John is a graduate of MIT who previously interned at SpaceX where he helped write control algorithms for the SuperDraco rocket engine.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,machine learning ,memsql ,apache spark

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}