Machine Learning for Dummies
Machine Learning for Dummies
Ever wanted to learn some Machine Learning, but don't know where to start? Or maybe feeling too stupid for this? This text is for dummies just like you!
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Becoming a Level 2 Dummy
I first came across a real application of Machine Learning at work. We were supposed to prepare an application that will recognize frauds in the Zooplus shop. After months of trying different solutions: external providers, additional
if statements in the code, fire-fighting scripts and such, we ended up with a conclusion that Machine Learning is the best tool for the job. Since then, we were trying to convince everyone around to invest in our education and pursue the Machine Learning path, but without any spectacular successes. Yet I had a chance to make my first step by playing a bit with Amazon's Machine Learning capabilities, so I consider myself a level 2 dummy. In this text, I'll try to show you — level 1 dummies — how to make that first step and get a feeling of what Machine Learning really is.
What is Machine Learning?
There's probably a ton of definitions of Machine Learning around that Internet, but hey, we're level 1 dummies; we want something simple — something of dummy level! Let's try to work this out together.
The word Machine in the term probably means the computer. Well, we could think of robots, drones, and stuff, but they're steered by computers anyway, right? So it's about "Computer Learning".
Now, what can Learning really mean? The computer doesn't have a brain! No neurons to activate, no paths to create. All it can do is storing some data and performing some simple operations on it. But we know that it's something connected with data — Big Data (at least that's where it shows up on DZone, right?). So we have "Computer doing stuff with Big Data".
What's doing stuff? Well, as a level 2 dummy, I can tell you something about it (although I bet real practitioners will see it as heresy). It's statistics using some advanced algorithms that level 1 and 2 dummies don't want to know about.
I think it's enough to form our final definition of Machine Learning for this text: Computer doing statistics on Big Data. Cool, huh?
What Can We Do With It?
I get it, I get it. So much text and you still don't know what the heck can you do with this whole Machine Learning stuff. Again, as a level 2 dummy, I had a chance to learn something 'bout it.
There are 2 kinds of Machine Learning: supervised and unsupervised.
I really wanted to put some analogy with supervising kids there, but it just doesn't click for me. Who invented that name?!
Supervised Machine Learning is when you provide the computer the information it's supposed to look for. Remember the fraud recognition case from my work? That's supervised learning. I tell the computer: I want to know if this customer is a fraudster! And the computer does it's advanced magic and gets out an answer: Yes master, he is! or No master, he's a moron, but a fair one. In general, supervised ML is used in the so-called classification problems. You give the computer a lot of data and it classifies: Will America vote for Mr. Trump again? Will this guy get cancer? Will you keep reading this long, supposedly-funny text?
Unsupervised ML is when you have no idea what you're looking for. You're clueless. You tell the computer: Here are scalaloads of data! Find something interesting. And it performs even more advanced algorithms than in supervised learning.
Since we're not clueless - we know exactly what we want (and we're less interested in the "even more advanced algorithms"), we're going to focus on supervised ML for the rest of this post.
Introducing Amazon ML
Not so long ago, it was really hard for a dummy like you and me to get a touch of Machine Learning. It was for the big brain nerds, who keep thinking about numbers and think Scala and Python are good programming languages. Luckily Amazon, the guys who got so much into selling that they started to sell their own infrastructure, have introduced a great tool for us: Amazon Machine Learning.
Creating a Data Source
We're over 600 words of text, so we better get straight to work. Open your Amazon Web Console and find the "Machine Learning" button. Click it! You might see some entry screen that will offer you a tutorial or something. Just skip it. You don't need a dummy tutorial when you're in the middle of one! You should now see something like this:
So, the first step to do the Computer doing statistics on Big Data will be to provide the actual Big Data! Download the file using the link below and put it in an S3 bucket:
(Yeah, we're using the data provided by AWS Docs Tutorial. Just this tutorial is much better!)
Once you have this, you can get back to the Machine Learning screen and choose "Create new..." and then "Datasource". You should see something like this:
Insert the S3 location and choose a Datasource name. The name doesn't matter (we're going to delete it anyway), so feel free to insult anyone you want with it. Once you're done with it press "Verify" and then "Continue".
You should now see a Schema screen like this:
As you can see, Amazon tries to make sense out of this data by dividing it into some data types. Since it's their damn tutorial data, everything should have gone smoothly. You just need to click "Yes" to the question regarding column names and, if everything went fine, the column on the last page named "y" should be of type "Binary". If that's the case, press "Continue"; otherwise, I don't know - I'm just a level 2 dummy.
On the third page, Amazon finally asks us what do we really want to get out this magic. That's what's called "Target". Select the last column as on the screen below:
As you can see, Amazon recognized this as a binary classification problem, which means we're the supervisors now! Press "Continue".
Our data does not contain an identifier, so just press "Review" and then "Create Datasource". It will take a while until it's created. Once it's completed you should see something like this:
We're done with the Datasource! We have our Big Data in the system!
Creating an ML Model
A thousand words in and we're ready for the best part. We will create the does statistics part. ML Model is the brain of our super cool Machine Learning solution. It's the mythical creature, created by Amazon based on our Big Data and settings, that will predict the value of column "y" for provided data. Let's get going!
Go back to the Machine Learning dashboard and, once again, choose "Create new..." and then "ML Model". Pick your newly created Datasource. You should see something like this:
Press "Continue" and then follow with "Review" and "Create ML Model". We don't want to change any advanced settings. Remember, we're level 1 and 2 dummies; we just want to see something working.
After a bit of time and F5'ing, we should see the screen of success (below). Our ML Model is created!
Creating a Prediction
It would be a shame if we created the fantastic brain of our solution and would not predict anything. Choose "Try real-time predictions" from the left side of the ML Model success screen. Click "Paste a record" button and paste in the following row:
That row is the same format as our Big Data file, but it lacks the final column - "y". This is what our does statistics magic ML Model will predict. Once you're ready for a shock press "Create prediction".
Yes, yes, yes! It worked! It predicted! If you've done everything I told you correctly, right side of your prediction screen should look like this:
The "Predicted label" is the result of our prediction - stunning 0! That's it!
Make sure to delete the data from the S3 bucket, so that you're not charged for storage. You could also delete the Machine Learning stuff from your account, but it's up to you since it costs nothing.
We started by making a lousy definition of Machine Learning. Then, we learned what is the difference between supervised and unsupervised Machine Learning. Finally, we rushed through the interface of Amazon Machine Learning to create a simple prediction. You might be wondering right now: What did we just predict? What was the data that we put in there? What if it didn't work? For now, it doesn't matter. It was just an example. What matters now, my level 2 dummies is What do you want to predict? What data do you have, that you could make use of? and What can you do to make it work? I'll leave you some resources below and... good luck on the road to level 3!
Opinions expressed by DZone contributors are their own.