# How to Master the Basics and Transform your Dataset

# How to Master the Basics and Transform your Dataset

### This is a very fully documented walk through of coding Machine Learning in Spark 2.0 in Java with full source code and annotated descriptions.

Join the DZone community and get the full member experience.

Join For FreeHortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

You might be familiar with various number puzzles on LinkedIn. Although some might complain about how they disrupt their LinkedIn news feed (e.g. “This isn’t Facebook!”), the puzzles are designed to trigger your intelligence or challenge your neurons.

Let’s look at the puzzle in the featured image of this post.

5, 15, 25, 35, 45, 55…

What comes after 55? Instantly, you want to say 65, right? But do you know how you got there? Did you add 10 to the last one?

What if the numbers were 5, 13, 27, 39, 41, and 55… Would you still say 65? You can still add 10 to 55, but where does this 10 come from?

## Linear Regression and Machine Learning Theory

If you have browsed the opening chapters of Fundamentals of Deep Learning by Nikhil Buduma, you’ve probably noticed the algorithmic complexities that are involved in machine learning. But puzzles like our sample above can be distilled to the bare basics to help you understand these algorithms.

The mathematical concept behind our sample puzzle is a **linear regression**. It is part of linear algebra. It is one of the very first exercises you go through when you learn about **machine learning**. If you take a piece of graph paper and start plotting, you will get something like the following diagram:

The 7th element in the series can be 65 and you get a sense that the 8th might be 75.

Just as when you learn new concepts, you acquire a new vocabulary. So the elements on our x-axis (1, 2, 3…8) are called **features**, while the values (5, 15…) are called **labels**.

On my second series, I get the following graph:

The idea is to draw a straight line that is the closest to all points. The line is then expressed as an equation:

*y*= β1* x*+ β0

β1 (the **regression parameter**) and β0 (the **intercept**) can be (easily) calculated if you like linear algebra – or you can use tools to do it. In our first example, our equation is simply: y=10x-5. In our second example, the equation is y=9.8857∙x-4.6. So there **is a difference**. When x is 7, we get 65 in our first equation, but 64.6 in our second equation. Close enough?

So far so good? Okay. I loved math during my high school and college years and I must admit that it is not my biggest passion anymore. Let’s code!

## A Coding Exercise in Machine Learning

I will use Java and Apache Spark ML 2.0.0. Java is probably the most used development language in enterprises and Spark is a wonderful analytics package. ML is the Machine Learning library – yes, they lacked inspiration the day they had to find a name. There are two ways to get the code: download it from GitHub, or type/copy it all from here.

You can download all the code example from GitHub:

https://github.com/jgperrin/net.jgp.labs.spark. There will be a few dependencies.

There are quite a few imports, even for a small example. I left them here as I do not want you to be confused by similar names in different packages (Vector is one for example).

Then we have a basic main() that will instantiate the class and start() it.

Spark 2.0.0 enforces the use of a Spark session. In prior versions, it was a little confusing because you might have needed several session and configuration objects.

We need a UDF (User Defined Function) that transforms our input into a format that can be used by Spark.

Our data is in tuple-data-file.csv. Actually, our first set is in tuple-data-file-set1.csv and the second is in guess-what-file.csv? No, they are in tuple-data-file-set2.csv, but I wanted to check if you were following.

In this situation, we need to force the structure of our data because Spark needs some guidance on the metadata of our data.

In Spark 2.0.0, in a Java context, our beloved dataframe is implemented as a Dataset<Row>.

As you can see, we transformed our dataframe to create a label and features. More precisely, each label as a vector of features.

We are now ready to build our linear regression. We will limit to 20 iterations.

We assign our dataframe to our linear regression.

And now, we can throw it for the 7th element, which feature is 7. We can create a vector and predict from it:

In the code on GitHub, you’ll have a lot more statistical information displayed. After executing, we should get the following information on the first dataset:

And the following information on the second data set:

This is a very basic example with a very limited dataset. Now that you have worked through it, you can proudly check the box for acquiring “knowledge of machine learning”.

There are many other technologies and data science methods that you can glean to get a better understanding of how data is changing the way machines ‘learn’. For additional thoughts, check out our video series Big Data Think Tank or topics in Data Science.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub. Join the discussion.

Published at DZone with permission of Jean Georges Perrin , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

## {{ parent.title || parent.header.title}}

## {{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}