Over a million developers have joined DZone.

Machine Learning Is Easy

DZone's Guide to

Machine Learning Is Easy

If you're a beginner who's interested to learn why machine learning is needed at all and why it's gaining popularity, this article will answer all your questions.

· AI Zone ·
Free Resource

Insight for I&O leaders on deploying AIOps platforms to enhance performance monitoring today. Read the Guide.

In this article, we will talk about machine learning in general and its interaction with data warehouses. If you are a beginner who doesn't know where to start your studies and you're interested to know why machine learning is needed at all and why recently it is gaining popularity, I'll answer all your questions. We will use Python 3, as it is a fairly simple tool for learning machine learning.

Who Is This Article For?

Anyone who will be interested in digging in history after searching for new facts, or everyone who at least once asked himself the question "how does all this, machine learning, work", will find here an answer to the question of interest to him. Most likely, the experienced machine learner will not find here anything interesting since the program part leaves much to be desired. It is somewhat simplified for the beginners to master, but inquiring about the origin of machine learning and its development as a whole will not hurt anyone.

The Numbers

Every year, there is a growing need to study large data for both companies and active enthusiasts. In such large companies as Yandex or Google, data analysis tools like the programming language R, or the library for Python are increasingly being used (in this article, I give examples written in Python 3). 

Image title

According to the Law of Moore (and in the picture above is he himself), the number of transistors on an integrated circuit doubles every 24 months. This means that every year, the productivity of our computers grows, and hence the previously inaccessible knowledge boundaries are again "shifted to the right" — there is an open space for studying large data, which is primarily due to the creation of the "big data science," the study of which mainly became possible due to the application of previously described algorithms of machine learning, which could only be checked after half a century. Who knows? Maybe in a few years, we will be able to describe with absolute accuracy the various forms of fluid motion, for example.

So... Data Analysis Is Easy?

Yes. And alongside the special importance for all mankind to study big data, there is a relative simplicity in their independent studying and application of the received "answer" (from the enthusiast to enthusiasts). To solve the problem of classification today, there is a huge amount of resources; dropping most of them, you can use the tools of the library scikit-learn (SKlearn). We create our first learning machine:

clf = RandomForestClassifier()
clf.fit(X, y)

So we created a simple machine capable of predicting (or classifying) the values of arguments by their attributes.

Further use requires the reader to have some knowledge of the syntax of Python and its capabilities. As usual, we import the necessary libraries for work:

import numpy as np
from pandas import read_csv as read

Sometimes, it is convenient to "visualize" the available data so that it is easier to work with them. Moreover, most of the data from the popular service Kaggle are collected by users in the CSV format. Let's pass to the main part of the article: solving a classification problem. In order:

  • Create a training sample.
  • Train the car on randomly selected parameters and classes of corresponding.
  • Calculate the quality of the sold machine.

Let's look at the implementation (each excerpt from the code is a separate cell in the notebook):

X = data.values[::, 1:14]
y = data.values[::, 0:1]

from sklearn.cross_validation import train_test_split as train
X_train, X_test, y_train, y_test = train(X, y, test_size=0.6)

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, n_jobs=-1)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

Create arrays, where X = signs (from 1 to 13 columns) and y = classes (0th column). Then, to collect the test and training sample from the original data, we use the convenient cross-validation function train_test_split, implemented in scikit-learn. With the ready samples working further, we import the RandomForestClassifier from the ensemble into sklearn. This class contains all the methods and functions necessary for training and testing the machine. Assign the class clandom (classifier) to the RandomForestClassifier class, then call the function fit()  to train the machine from the clf class, where X_train is the attributes of the y_train categories. Now, you can use the built-in metric score to determine the accuracy of the predicted categories for X_train by the true values of these categories of y_train. When using this metric, the accuracy value is output from 0 to 1, where 1 <=> 100% done!

The Last Word

I hope this article has helped you at least a little to master the development of simple machine learning in Python. This knowledge will be enough to continue an intensive course on the further study of big data and machine learning. The main thing is to move from simple to in-depth gradually.

TrueSight is an AIOps platform, powered by machine learning and analytics, that elevates IT operations to address multi-cloud complexity and the speed of digital transformation.

machine learning ,ai ,big data analytics

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}