Machine Learning Is Easy
If you're a beginner who's interested to learn why machine learning is needed at all and why it's gaining popularity, this article will answer all your questions.
Join the DZone community and get the full member experience.Join For Free
In this article, we will talk about machine learning in general and its interaction with data warehouses. If you are a beginner who doesn't know where to start your studies and you're interested to know why machine learning is needed at all and why recently it is gaining popularity, I'll answer all your questions. We will use Python 3, as it is a fairly simple tool for learning machine learning.
Who Is This Article For?
Anyone who will be interested in digging in history after searching for new facts, or everyone who at least once asked himself the question "how does all this, machine learning, work", will find here an answer to the question of interest to him. Most likely, the experienced machine learner will not find here anything interesting since the program part leaves much to be desired. It is somewhat simplified for the beginners to master, but inquiring about the origin of machine learning and its development as a whole will not hurt anyone.
Every year, there is a growing need to study large data for both companies and active enthusiasts. In such large companies as Yandex or Google, data analysis tools like the programming language R, or the library for Python are increasingly being used (in this article, I give examples written in Python 3).
According to the Law of Moore (and in the picture above is he himself), the number of transistors on an integrated circuit doubles every 24 months. This means that every year, the productivity of our computers grows, and hence the previously inaccessible knowledge boundaries are again "shifted to the right" — there is an open space for studying large data, which is primarily due to the creation of the "big data science," the study of which mainly became possible due to the application of previously described algorithms of machine learning, which could only be checked after half a century. Who knows? Maybe in a few years, we will be able to describe with absolute accuracy the various forms of fluid motion, for example.
So... Data Analysis Is Easy?
Yes. And alongside the special importance for all mankind to study big data, there is a relative simplicity in their independent studying and application of the received "answer" (from the enthusiast to enthusiasts). To solve the problem of classification today, there is a huge amount of resources; dropping most of them, you can use the tools of the library scikit-learn (SKlearn). We create our first learning machine:
clf = RandomForestClassifier() clf.fit(X, y)
So we created a simple machine capable of predicting (or classifying) the values of arguments by their attributes.
Further use requires the reader to have some knowledge of the syntax of Python and its capabilities. As usual, we import the necessary libraries for work:
import numpy as np from pandas import read_csv as read
Sometimes, it is convenient to "visualize" the available data so that it is easier to work with them. Moreover, most of the data from the popular service Kaggle are collected by users in the CSV format. Let's pass to the main part of the article: solving a classification problem. In order:
- Create a training sample.
- Train the car on randomly selected parameters and classes of corresponding.
- Calculate the quality of the sold machine.
Let's look at the implementation (each excerpt from the code is a separate cell in the notebook):
X = data.values[::, 1:14] y = data.values[::, 0:1] from sklearn.cross_validation import train_test_split as train X_train, X_test, y_train, y_test = train(X, y, test_size=0.6) from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier(n_estimators=100, n_jobs=-1) clf.fit(X_train, y_train) clf.score(X_test, y_test)
Create arrays, where X = signs (from 1 to 13 columns) and y = classes (0th column). Then, to collect the test and training sample from the original data, we use the convenient cross-validation function
train_test_split, implemented in scikit-learn. With the ready samples working further, we import the
RandomForestClassifier from the ensemble into sklearn. This class contains all the methods and functions necessary for training and testing the machine. Assign the class
clandom (classifier) to the
RandomForestClassifier class, then call the function
fit() to train the machine from the
clf class, where
X_train is the attributes of the
y_train categories. Now, you can use the built-in metric score to determine the accuracy of the predicted categories for
X_train by the true values of these categories of
y_train. When using this metric, the accuracy value is output from 0 to 1, where 1 <=> 100% done!
The Last Word
I hope this article has helped you at least a little to master the development of simple machine learning in Python. This knowledge will be enough to continue an intensive course on the further study of big data and machine learning. The main thing is to move from simple to in-depth gradually.
Opinions expressed by DZone contributors are their own.