# Intro to Machine Learning for Developers

### This post will simplify this process of machine learning tool selection down to one — scikit-learn.

Join the DZone community and get the full member experience.

Join For FreeWelcome to the world of machine learning with scikit-learn. Machine learning can be overwhelming at times, and this is partly due to a large number of tools that are available on the market. This post will simplify this process of tool selection down to one — scikit-learn.

In this series, you will learn how to construct an end-to-end machine learning pipeline using some of the most popular algorithms that are widely used in industry and professional competitions, such as Kaggle.

However, in this introductory post, we will go through the following topics:

- A brief introduction to machine learning
- What is scikit-learn?
- Installing scikit-learn
- Algorithms that you will learn to implement scikit-learn in series.
- Example to build your first regression model

Now, let's begin this fun journey into the world of machine learning with **scikit-learn!**

**A Brief Introduction to Machine Learning**

Machine learning has generated quite the buzz — from Elon Musk fearing the role of unregulated artificial intelligence in society to Mark Zuckerberg having a view that contradicts Musk.

Here is a quick overview of the Google trend for machine learning.

Broadly speaking, machine learning can be categorized into three main types:

- Supervised learning
- Unsupervised learning
- Reinforcement learning

## Supervised Learning

*Healthy*or

*Not Healthy*.

*Healthy*or

*Not Healthy*: Although supervised machine learning is the most common type of machine learning that is implemented with scikit-learn and in the industry, most datasets typically do not come with predefined labels. Unsupervised learning algorithms are first used to cluster data without labels into distinct groups to which we can then assign labels. This is discussed in detail in the following section.

### Supervised Learning Algorithms

Supervised learning algorithms can be used to solve both classification and regression problems. You will learn how to implement some of the most popular supervised machine learning algorithms. Popular supervised machine learning algorithms are the ones that are widely used in industry and research and have helped us solve a wide range of problems across a wide range of domains. The following are some of the supervised learning algorithms:

**Linear regression:**This supervised learning algorithm is used to predict continuous numeric outcomes such as house prices, stock prices, and temperature, to name a few.**Logistic regression:**The logistic learning algorithm is a popular classification algorithm that is especially used in the credit industry in order to predict loan defaults.**k-Nearest Neighbors:**The k-NN algorithm is a classification algorithm that is used to classify data into two or more categories, and is widely used to classify houses into expensive and affordable categories based on price, area, bedrooms, and a whole range of other features.**Support vector machines:**The SVM algorithm is a popular classification algorithm that is used in image and face detection, along with applications such as handwriting recognition.**Tree-Based algorithms:**Tree-based algorithms such as decision trees, Random Forests, and Boosted trees are used to solve both classification and regression problems.**Naive Bayes:**The Naive Bayes classifier is a machine learning algorithm that uses the mathematical model of probability to solve classification problems.

## Unsupervised Learning

Unsupervised learning is a form of machine learning in which the algorithm tries to detect/find patterns in data that do not have an outcome/target variable. In other words, we do not have data that comes with pre-existing labels. Thus, the algorithm will typically use a metric such as distance to group data together depending on how close they are to each other. As discussed in the previous section, most of the data that you will encounter in the real world will not come with a set of predefined labels and, as such, will only have a set of input features without a target attribute. In the following simple mathematical expression, U is the unsupervised learning algorithm, while X is a set of input features, such as weight and age: Given this data, our objective is to create groups that could potentially be labeled as *Healthy* or *Not Healthy*. The unsupervised learning algorithm will use a metric such as distance in order to identify how close a set of points are to each other and how far apart two such groups are.

### Unsupervised Learning Algorithms

Unsupervised machine learning algorithms are typically used to cluster points of data based on distance. The unsupervised learning algorithm that you will learn is as follows:

- k-means: The k-means algorithm is a popular algorithm that is typically used to segment customers into unique categories based on a variety of features, such as their spending habits. This algorithm is also used to segment houses into categories based on their features, such as price and area.

## Reinforcement Learning

Reinforcement learning is an area of Machine Learning. Reinforcement. It is about taking suitable action to maximize reward in a particular situation. It is employed by various software and machines to find the best possible behavior or path it should take in a specific situation. Reinforcement learning differs from the supervised learning in a way that in supervised learning the training data has the answer key with it so the model is trained with the correct answer itself whereas in reinforcement learning, there is no answer but the reinforcement agent decides what to do to perform the given task. In the absence of training dataset, it is bound to learn from its experience.

**Prerequisites for the Machine Learning:**

- How to Setup Jupyter Notebook perfectly for Data Analysis
- Pandas in Python for Data Analysis with Example (Step-by-Step guide)
- Data Visualization

**How We Are Going to Do It — Scikit-Learn**

Scikit-learn is a free and open source software that helps you tackle supervised and unsupervised machine learning problems. The software is built entirely in Python and utilizes some of the most popular libraries that Python has to offer, namely NumPy and SciPy. The main reason why scikit-learn is very popular stems from the fact that most of the world's most popular machine learning algorithms can be implemented quite quickly in a plug and play format once you know what the core pipeline is like. Another reason is that popular algorithms for classification such as logistic regression and support vector machines are written in Cython. Cython is used to give these algorithms *C-like* performance and thus makes the use of scikit-learn quite efficient in the process.

Scikit-learn is designed to tackle problems pertaining to supervised and unsupervised learning only and does not support reinforcement learning at present.

**Installing the Scikit-Learn Package**

There are two ways in which you can install scikit-learn on your personal device:

- By using the pip method
- By using the Anaconda method

The pip method can be implemented on the macOS/Linux Terminal or the Windows PowerShell, while the Anaconda method will work with the Anaconda prompt. Choosing between these two methods of installation is pretty straightforward:

### The pip Method

```
pip3 install NumPy
pip3 install SciPy
pip3 install scikit-learn
pip3 install -U scikit-learn
```

**The Anaconda Method**

```
conda install NumPy
conda install SciPy
conda install scikit-learn
conda install -U scikit-learn
```

So far, this lesson has focused on the brief introduction into what machine learning is for those of you who are just beginning your journey into the world of machine learning. You have learned about how scikit-learn fits into the context of machine learning and how you can go about installing the necessary software.

Now, we'll put this into practice and do some data exploration and analysis.

The dataset we'll look at in this section is the so-called **Boston housing dataset.**

**Loading the Data Into Jupyter Using a Pandas DataFrame**

Often times, data is stored in tables, which means it can be saved as a comma-separated variable (CSV) file. This format and many others can be read into Python as a DataFrame object, using the Pandas library. Other common formats include tab-separated variable (TSV), SQL tables, and JSON data structures. Indeed, Pandas has support for all of these. In this example, however, we are not going to load the data this way because the dataset is available directly through scikit-learn.

The Boston housing dataset can be accessed from the module **sklearn.datasets **using the method. **load_boston**

```
from sklearn import datasets
boston = datasets.load_boston()
type(boston)
print(boston['DESCR'])
```

```
import pandas as pd
## Loading the data as Dataframe in pandas
df = pd.DataFrame(data=boston['data'], columns = boston['feature_names'])
#Checking our top 5 rows of the dataframe
df.head()
```

CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 |

1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 |

2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 |

3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 |

4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 |

In machine learning, the variable that is being modeled is called the target variable; it's what you are trying to predict given the features. For this dataset, the suggested target is MEDV, the median house value in 1,000s of dollars.

```
## Adding Target temp Column to our dataframe
df['MEDV'] = boston['target']
## Creating copy of the target Value
y = df['MEDV'].copy()
##Deleting the Newly created column
del df['MEDV']
## Concat the target columns to our existing dataframe
df = pd.concat((y, df), axis=1)
```

MEDV | CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 24.0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.9 | 4.98 |

1 | 21.6 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.9 | 9.14 |

Here, we introduce a dummy variable y to hold a copy of the target column before removing it from the DataFrame. We then use the Pandas concatenation function to combine it with the remaining DataFrame along the 1st axis (as opposed to the 0th axis, which combines rows).

```
print(df.shape)
df.isnull().sum()
---------------------
(506, 14)
MEDV 0
CRIM 0
ZN 0
INDUS 0
CHAS 0
NOX 0
RM 0
AGE 0
DIS 0
RAD 0
TAX 0
PTRATIO 0
B 0
LSTAT 0
dtype: int64
```

For this dataset, we see there are no NaNs, which means we have no immediate work to do in cleaning the data and can move on.

To simplify the analysis, the final thing we'll do before exploration is remove some of the columns. We won't bother looking at these and instead focus on the remainder in more detail.

Remove some columns by running the cell that contains the following code:

```
for col in ['ZN', 'NOX', 'RAD', 'PTRATIO', 'B']:
del df[col]
```

**Data Exploration**

Since this is an entirely new dataset that we've never seen before, the first goal here is to understand the data. We've already seen the textual description of the data, which is important for qualitative understanding. We'll now compute a quantitative description.

`df.describe().T`

count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|

MEDV | 506.0 | 22.532806 | 9.197104 | 5.00000 | 17.025000 | 21.20000 | 25.000000 | 50.0000 |

CRIM | 506.0 | 3.613524 | 8.601545 | 0.00632 | 0.082045 | 0.25651 | 3.677083 | 88.9762 |

INDUS | 506.0 | 11.136779 | 6.860353 | 0.46000 | 5.190000 | 9.69000 | 18.100000 | 27.7400 |

CHAS | 506.0 | 0.069170 | 0.253994 | 0.00000 | 0.000000 | 0.00000 | 0.000000 | 1.0000 |

RM | 506.0 | 6.284634 | 0.702617 | 3.56100 | 5.885500 | 6.20850 | 6.623500 | 8.7800 |

AGE | 506.0 | 68.574901 | 28.148861 | 2.90000 | 45.025000 | 77.50000 | 94.075000 | 100.0000 |

DIS | 506.0 | 3.795043 | 2.105710 | 1.12960 | 2.100175 | 3.20745 | 5.188425 | 12.1265 |

TAX | 506.0 | 408.237154 | 168.537116 | 187.00000 | 279.000000 | 330.00000 | 666.000000 | 711.0000 |

LSTAT | 506.0 | 12.653063 | 7.141062 | 1.73000 | 6.950000 | 11.36000 | 16.955000 | 37.9700 |

This computes various properties including the mean, standard deviation, minimum, and maximum for each column. This table gives a high-level idea of how everything is distributed. Note that we have taken the transform of the result by adding a .T to the output; this swaps the rows and columns.

```
cols = ['RM', 'AGE', 'TAX', 'LSTAT', 'MEDV']
df[cols].corr()
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
ax = sns.heatmap(df[cols].corr(),
cmap=sns.cubehelix_palette(20, light=0.95, dark=0.15))
ax.xaxis.tick_top() # move labels to the top
```

We call sns.heatmap and pass the pairwise correlation matrix as input. We use a custom color palette here to override the Seaborn default.

This resulting table shows the correlation score between each set of values. Large positive scores indicate a strong positive (that is, in the same direction) correlation. As expected, we see maximum values of 1 on the diagonal.

Pearson coefficient is defined as the covariance between two variables, divided by the product of their standard deviations:

The covariance, in turn, is defined as follows:

Here, n is the number of samples, xi and yi are the individual samples being summed over, and Xbar and Ybar are the means of each set.

**Linear Model With Scikit-Learn**

We can see that it presents us with a regression problem where we predict a continuous target variable given a set of features. In particular, we'll be predicting the median house value (MEDV). We'll train models that take only one feature as input to make this prediction. This way, the models will be conceptually simple to understand and we can focus more on the technical details of the scikit-learn API.

We'll import the `LinearRegression`

class and build our linear classification model the same way as before when we calculated the MSE. Run the following:

Use scikit-learn to fit a polynomial regression model to predict the median house value (MEDV), given the LSTAT values. We are hoping to build a model that has a lower mean-squared error (MSE)

```
y = df['MEDV'].values
x = df['LSTAT'].values.reshape(-1,1)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3)
x_poly = poly.fit_transform(x)
from sklearn.linear_model import LinearRegression
clf = LinearRegression()
clf.fit(x_poly, y)
y_pred = clf.predict(x_poly)
resid_MEDV = y - y_pred
from sklearn.metrics import mean_squared_error
error = mean_squared_error(y, y_pred)
print('mse = {:.2f}'.format(error))
fig, ax = plt.subplots()
# Plot the samples
ax.scatter(x.flatten(), y, alpha=0.6)
# Plot the polynomial model
x_ = np.linspace(2, 38, 50).reshape(-1, 1)
x_poly = poly.fit_transform(x_)
y_ = clf.predict(x_poly)
ax.plot(x_, y_, color='red', alpha=0.8)
ax.set_xlabel('LSTAT'); ax.set_ylabel('MEDV');
```

```
--------------------------------------
mse = 28.88
```

This completes the extensive guide to understanding how to write your first machine learning model. Here, we used visual assists, such as scatter plots, to deepen our understanding of the data. We also performed simple predictive modeling. In the next part, we will look into what is MSE/RMSE and work with other models to enhance accuracy. Hope you liked it! Let me know your thoughts in the comments section.

Published at DZone with permission of Ram Dunge. See the original article here.

Opinions expressed by DZone contributors are their own.

Comments