Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Intro to Machine Learning for Developers

DZone's Guide to

Intro to Machine Learning for Developers

This post will simplify this process of machine learning tool selection down to one — scikit-learn.

· AI Zone ·
Free Resource

Bias comes in a variety of forms, all of them potentially damaging to the efficacy of your ML algorithm. Read how Alegion's Chief Data Scientist discusses the source of most headlines about AI failures here.

Welcome to the world of machine learning with scikit-learn. Machine learning can be overwhelming at times, and this is partly due to a large number of tools that are available on the market. This post will simplify this process of tool selection down to one — scikit-learn.

In this series, you will learn how to construct an end-to-end machine learning pipeline using some of the most popular algorithms that are widely used in industry and professional competitions, such as Kaggle.

However, in this introductory post, we will go through the following topics:

  • A brief introduction to machine learning
  • What is scikit-learn?
  • Installing scikit-learn
  • Algorithms that you will learn to implement scikit-learn in series.
  • Example to build your first regression model

Now, let's begin this fun journey into the world of machine learning with scikit-learn!

A Brief Introduction to Machine Learning

Machine learning has generated quite the buzz — from Elon Musk fearing the role of unregulated artificial intelligence in society to Mark Zuckerberg having a view that contradicts Musk.

So, what exactly is machine learning? Simply put, it is a set of methods that can detect patterns in data and use those patterns to make future predictions. Machine learning has found immense value in a wide range of industries, ranging from finance to healthcare. This translates to a higher requirement of talent with the skill capital in the field of machine learning.

Here is a quick overview of the Google trend for machine learning.
Broadly speaking, machine learning can be categorized into three main types:
  • Supervised learning
  • Unsupervised learning
  • Reinforcement learning

Supervised Learning

Supervised learning is a form of machine learning in which our data comes with a set of labels or a target variable that is numeric. These labels/categories usually belong to one feature/attribute, which is commonly known as the target variable. For instance, each row of your data could either belong to the category of Healthy or Not Healthy.
Given a set of features such as weight, blood sugar levels, and age, we can use the supervised machine learning algorithm to predict whether the person is healthy or not. In the following simple mathematical expression, S is the supervised learning algorithm, X is the set of input features, such as weight and age, and Y is the target variable with the labels Healthy or Not Healthy: Although supervised machine learning is the most common type of machine learning that is implemented with scikit-learn and in the industry, most datasets typically do not come with predefined labels. Unsupervised learning algorithms are first used to cluster data without labels into distinct groups to which we can then assign labels. This is discussed in detail in the following section.

Supervised Learning Algorithms

Supervised learning algorithms can be used to solve both classification and regression problems. You will learn how to implement some of the most popular supervised machine learning algorithms. Popular supervised machine learning algorithms are the ones that are widely used in industry and research and have helped us solve a wide range of problems across a wide range of domains. The following are some of the supervised learning algorithms:

  • Linear regression: This supervised learning algorithm is used to predict continuous numeric outcomes such as house prices, stock prices, and temperature, to name a few.
  • Logistic regression: The logistic learning algorithm is a popular classification algorithm that is especially used in the credit industry in order to predict loan defaults.
  • k-Nearest Neighbors: The k-NN algorithm is a classification algorithm that is used to classify data into two or more categories, and is widely used to classify houses into expensive and affordable categories based on price, area, bedrooms, and a whole range of other features.
  • Support vector machines: The SVM algorithm is a popular classification algorithm that is used in image and face detection, along with applications such as handwriting recognition.
  • Tree-Based algorithms: Tree-based algorithms such as decision trees, Random Forests, and Boosted trees are used to solve both classification and regression problems.
  • Naive Bayes: The Naive Bayes classifier is a machine learning algorithm that uses the mathematical model of probability to solve classification problems.

Unsupervised Learning

Unsupervised learning is a form of machine learning in which the algorithm tries to detect/find patterns in data that do not have an outcome/target variable. In other words, we do not have data that comes with pre-existing labels. Thus, the algorithm will typically use a metric such as distance to group data together depending on how close they are to each other. As discussed in the previous section, most of the data that you will encounter in the real world will not come with a set of predefined labels and, as such, will only have a set of input features without a target attribute. In the following simple mathematical expression, U is the unsupervised learning algorithm, while X is a set of input features, such as weight and age: Given this data, our objective is to create groups that could potentially be labeled as Healthy or Not Healthy. The unsupervised learning algorithm will use a metric such as distance in order to identify how close a set of points are to each other and how far apart two such groups are.

Unsupervised Learning Algorithms

Unsupervised machine learning algorithms are typically used to cluster points of data based on distance. The unsupervised learning algorithm that you will learn is as follows:

  • k-means: The k-means algorithm is a popular algorithm that is typically used to segment customers into unique categories based on a variety of features, such as their spending habits. This algorithm is also used to segment houses into categories based on their features, such as price and area.

Reinforcement Learning

Reinforcement learning is an area of Machine Learning. Reinforcement. It is about taking suitable action to maximize reward in a particular situation. It is employed by various software and machines to find the best possible behavior or path it should take in a specific situation. Reinforcement learning differs from the supervised learning in a way that in supervised learning the training data has the answer key with it so the model is trained with the correct answer itself whereas in reinforcement learning, there is no answer but the reinforcement agent decides what to do to perform the given task. In the absence of training dataset, it is bound to learn from its experience.

Prerequisites for the Machine Learning:

  1. How to Setup Jupyter Notebook perfectly for Data Analysis
  2. Pandas in Python for Data Analysis with Example (Step-by-Step guide)
  3. Data Visualization

How We Are Going to Do It — Scikit-Learn

Scikit-learn is a free and open source software that helps you tackle supervised and unsupervised machine learning problems. The software is built entirely in Python and utilizes some of the most popular libraries that Python has to offer, namely NumPy and SciPy. The main reason why scikit-learn is very popular stems from the fact that most of the world's most popular machine learning algorithms can be implemented quite quickly in a plug and play format once you know what the core pipeline is like. Another reason is that popular algorithms for classification such as logistic regression and support vector machines are written in Cython. Cython is used to give these algorithms C-like performance and thus makes the use of scikit-learn quite efficient in the process.

Scikit-learn is designed to tackle problems pertaining to supervised and unsupervised learning only and does not support reinforcement learning at present.

Installing the Scikit-Learn Package

There are two ways in which you can install scikit-learn on your personal device:

  • By using the pip method
  • By using the Anaconda method

The pip method can be implemented on the macOS/Linux Terminal or the Windows PowerShell, while the Anaconda method will work with the Anaconda prompt. Choosing between these two methods of installation is pretty straightforward:

The pip Method

pip3 install NumPy
pip3 install SciPy
pip3 install scikit-learn
pip3 install -U scikit-learn

The Anaconda Method

conda install NumPy
conda install SciPy
conda install scikit-learn
conda install -U scikit-learn

So far, this lesson has focused on the brief introduction into what machine learning is for those of you who are just beginning your journey into the world of machine learning. You have learned about how scikit-learn fits into the context of machine learning and how you can go about installing the necessary software.

Now, we'll put this into practice and do some data exploration and analysis.

The dataset we'll look at in this section is the so-called Boston housing dataset.

Loading the Data Into Jupyter Using a Pandas DataFrame

Often times, data is stored in tables, which means it can be saved as a comma-separated variable (CSV) file. This format and many others can be read into Python as a DataFrame object, using the Pandas library. Other common formats include tab-separated variable (TSV), SQL tables, and JSON data structures. Indeed, Pandas has support for all of these. In this example, however, we are not going to load the data this way because the dataset is available directly through scikit-learn.

The Boston housing dataset can be accessed from the module sklearn.datasets using the method. load_boston

from sklearn import datasets
boston = datasets.load_boston()
type(boston)

print(boston['DESCR'])

import pandas as pd

## Loading the data as Dataframe in pandas

df = pd.DataFrame(data=boston['data'], columns = boston['feature_names'])

#Checking our top 5 rows of the dataframe

df.head()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

In machine learning, the variable that is being modeled is called the target variable; it's what you are trying to predict given the features. For this dataset, the suggested target is MEDV, the median house value in 1,000s of dollars.

## Adding Target temp Column to our dataframe
df['MEDV'] = boston['target']


## Creating copy of the target Value
y = df['MEDV'].copy()

##Deleting the Newly created column
del df['MEDV']


## Concat the target columns to our existing dataframe
df = pd.concat((y, df), axis=1)
MEDV CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 24.0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.9 4.98
1 21.6 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.9 9.14

Here, we introduce a dummy variable y to hold a copy of the target column before removing it from the DataFrame. We then use the Pandas concatenation function to combine it with the remaining DataFrame along the 1st axis (as opposed to the 0th axis, which combines rows).

print(df.shape)

df.isnull().sum()
---------------------
(506, 14)

MEDV       0
CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
dtype: int64

For this dataset, we see there are no NaNs, which means we have no immediate work to do in cleaning the data and can move on.

To simplify the analysis, the final thing we'll do before exploration is remove some of the columns. We won't bother looking at these and instead focus on the remainder in more detail.

Remove some columns by running the cell that contains the following code:

for col in ['ZN', 'NOX', 'RAD', 'PTRATIO', 'B']:
    del df[col]

Data Exploration

Since this is an entirely new dataset that we've never seen before, the first goal here is to understand the data. We've already seen the textual description of the data, which is important for qualitative understanding. We'll now compute a quantitative description.

df.describe().T
count mean std min 25% 50% 75% max
MEDV 506.0 22.532806 9.197104 5.00000 17.025000 21.20000 25.000000 50.0000
CRIM 506.0 3.613524 8.601545 0.00632 0.082045 0.25651 3.677083 88.9762
INDUS 506.0 11.136779 6.860353 0.46000 5.190000 9.69000 18.100000 27.7400
CHAS 506.0 0.069170 0.253994 0.00000 0.000000 0.00000 0.000000 1.0000
RM 506.0 6.284634 0.702617 3.56100 5.885500 6.20850 6.623500 8.7800
AGE 506.0 68.574901 28.148861 2.90000 45.025000 77.50000 94.075000 100.0000
DIS 506.0 3.795043 2.105710 1.12960 2.100175 3.20745 5.188425 12.1265
TAX 506.0 408.237154 168.537116 187.00000 279.000000 330.00000 666.000000 711.0000
LSTAT 506.0 12.653063 7.141062 1.73000 6.950000 11.36000 16.955000

37.9700

This computes various properties including the mean, standard deviation, minimum, and maximum for each column. This table gives a high-level idea of how everything is distributed. Note that we have taken the transform of the result by adding a .T to the output; this swaps the rows and columns.

cols = ['RM', 'AGE', 'TAX', 'LSTAT', 'MEDV'] 

df[cols].corr()

import matplotlib.pyplot as plt 

import seaborn as sns 
%matplotlib inline 
ax = sns.heatmap(df[cols].corr(), 
                 cmap=sns.cubehelix_palette(20, light=0.95, dark=0.15))
ax.xaxis.tick_top() # move labels to the top

We call sns.heatmap and pass the pairwise correlation matrix as input. We use a custom color palette here to override the Seaborn default.

This resulting table shows the correlation score between each set of values. Large positive scores indicate a strong positive (that is, in the same direction) correlation. As expected, we see maximum values of 1 on the diagonal.

Pearson coefficient is defined as the covariance between two variables, divided by the product of their standard deviations:

The covariance, in turn, is defined as follows:

Here, n is the number of samples, xi and yi are the individual samples being summed over, and Xbar and Ybar are the means of each set.

Linear Model With Scikit-Learn

We can see that it presents us with a regression problem where we predict a continuous target variable given a set of features. In particular, we'll be predicting the median house value (MEDV). We'll train models that take only one feature as input to make this prediction. This way, the models will be conceptually simple to understand and we can focus more on the technical details of the scikit-learn API.

We'll import the LinearRegression class and build our linear classification model the same way as before when we calculated the MSE. Run the following:

Use scikit-learn to fit a polynomial regression model to predict the median house value (MEDV), given the LSTAT values. We are hoping to build a model that has a lower mean-squared error (MSE)

y = df['MEDV'].values

x = df['LSTAT'].values.reshape(-1,1)


from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=3)

x_poly = poly.fit_transform(x)



from sklearn.linear_model import LinearRegression

clf = LinearRegression()

clf.fit(x_poly, y)


y_pred = clf.predict(x_poly)

resid_MEDV = y - y_pred


from sklearn.metrics import mean_squared_error

error = mean_squared_error(y, y_pred)

print('mse = {:.2f}'.format(error))

fig, ax = plt.subplots()
# Plot the samples
ax.scatter(x.flatten(), y, alpha=0.6)
# Plot the polynomial model
x_ = np.linspace(2, 38, 50).reshape(-1, 1)
x_poly = poly.fit_transform(x_)
y_ = clf.predict(x_poly)
ax.plot(x_, y_, color='red', alpha=0.8)
ax.set_xlabel('LSTAT'); ax.set_ylabel('MEDV');
--------------------------------------
mse = 28.88

This completes the extensive guide to understanding how to write your first machine learning model. Here, we used visual assists, such as scatter plots, to deepen our understanding of the data. We also performed simple predictive modeling. In the next part, we will look into what is MSE/RMSE and work with other models to enhance accuracy. Hope you liked it! Let me know your thoughts in the comments section.

Your machine learning project needs enormous amounts of training data to get to a production-ready confidence level. Get a checklist approach to assembling the combination of technology, workforce and project management skills you’ll need to prepare your own training data.

Topics:
big data ,python ,machine learning ,ml pipeline ,artificial intlligence ,supervised learning ,unsupervised learning ,reinforcement learning ,scikit-learn

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}