# Linear Regression Model

### Linear regression can be very useful in many business situations. The author has walked you through how to create a linear regression model.

Join the DZone community and get the full member experience.

Join For FreeLinear regression is a machine learning technique that is used to establish a relationship between a scalar response and one or more explanatory variables. The first scaler response is called a target or dependent variable while the explanatory variables are known as a response or independent variables. When more than one independent variable is used in the modeling technique we call it multiple linear regression.

Independent variables are known as explanatory variables as they can explain the factors that control the dependent variable along with the degree of the impact. This can also be calculated using ‘parameter estimates’ or ‘coefficients’.

Coefficients are tested for statistical significance with the help of confidence intervals built around them which also aids model robustness. The elasticity is based on the coefficient. They can describe the extent to which a certain factor explains the dependent. Adding to this, a negative coefficient can be interpreted to have a negative or an inverse relation with the dependent variable and a positive coefficient can be said to have a positive influence.

### Application of Linear Regression

- Insights on consumer behavior
- Understanding business factors influencing profitability
- Estimates or forecasts
- marketing effectiveness, pricing, and promotions strategies
- Assess and minimize risk in portfolios
- Understand important factors leading to the customer default

A very broad overview of the important steps needed to build a linear regression model could be

- Feasibility of application of regression technique on the data set
- Preparing the data to make it ready for regression
- Build the regression model and test its accuracy
- Save the model for future prediction
- Deploy and maintain/monitor the model

### Techniques to Build a Linear Regression Model

We can build a linear regression model by using any of the below techniques

- Gradient Descent
- Ordinary Least Square(OLS)

The process of minimizing a function with the help of gradients of the cost function is called gradient descent(GD). The understanding of the form of the cost and its derivative is important in order to understand the path which needs to be followed. In machine learning parlance, this can be done using a very similar concept known as stochastic gradient descent(SGD). This is also responsible for minimizing the error of the model on training data.

Predictions are made by the model based on each training instance shown to the model. This results in error calculation which is propagated/updated in the next prediction which helps to reduce the error.

The ordinary least squares method is a technique for estimating unknown parameters in a linear regression model with the help least square method. It aims to minimize the sum of squares of the differences between the observed and the predicted points.

The applicability of the OLS technique is based on certain assumptions. Hence it is a good practice to check the assumptions of OLS before we apply it to build the linear regression model. The assumptions of OLS are mentioned below

- Linear relationship - the relationship between the independent and dependent variables is linear
- Multivariate normality - all variables to be multivariate normal
- No or little multicollinearity - there is little or no multicollinearity in the data. Multicollinearity is the phenomenon when the independent variables are highly correlated with each other
- No auto-correlation - There is little or no autocorrelation in the data. Autocorrelation can be experienced when the residuals are not independent of each other
- Homoscedasticity - Residuals exhibit homoscedasticity. Homoscedasticity describes a situation in which the error term is the same across all values of the independent variable

### Build a Linear Regression Model from Scratch

Let's build a linear regression from scratch using a publicly available data set using both OLS and SGD techniques in python.

There are a lot of ways/libraries to build a linear regression model in python, but we will mostly concentrate on scikit learn library, pandas, and NumPy mainly to develop the same.

Sharing some of the screenshots during data wrangling, model building, and model saving activities.

#### 1. Importing the necessary Python libraries

`# Enabling print for all lines`

`from IPython.core.interactiveshell import InteractiveShell`

`InteractiveShell.ast_node_interactivity = "all"`

`xxxxxxxxxx`

`# Importing the necessary libraries`

`import pandas as pd`

`import numpy as np`

`from pandas import Series, DataFrame`

`import matplotlib.pyplot as plt`

`from matplotlib import rcParams`

`%matplotlib inline`

`import seaborn as sns`

#### 2. Loading the Data Set

**Boston Housing Data**

`Description:`

This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston), and has been used extensively throughout the literature to benchmark algorithms. However, these comparisons were primarily done outside of Delve and are thus somewhat suspect. The dataset is small in size with only 506 cases.

The data was originally published by Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.

`Dataset Naming:`

The name for this dataset is simply boston. MEDV is the variable which we are trying to predict

`Miscellaneous Details:`

- Origin- The origin of the boston housing data is Natural
- Usage- This dataset may be used for Assessment
- Number of Cases- The dataset contains a total of 506 cases
- Order- The order of the cases is mysterious
- Variables- There are 14 attributes in each case of the dataset. They are:
- CRIM - per capita crime rate by town
- ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS - proportion of non-retail business acres per town.
- CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
- NOX - nitric oxides concentration (parts per 10 million)
- RM - average number of rooms per dwelling
- AGE - proportion of owner-occupied units built prior to 1940
- DIS - weighted distances to five Boston employment centres
- RAD - index of accessibility to radial highways
- TAX - full-value property-tax rate per 10,000 dollars
- PTRATIO - pupil teacher ratio by town
- B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT - % lower status of the population
- MEDV - Median value of owner occupied homes in 1000s

`xxxxxxxxxx`

`# Loading the data set from Python`

`from sklearn.datasets import load_boston`

`boston = load_boston()`

`type(boston)`

Output:

sklearn.utils.Bunch

`xxxxxxxxxx`

`# Converting the inbuilt data(independen) into data frame`

`df = pd.DataFrame(boston.data, columns=boston.feature_names)`

`# Converting the target into a series`

`df['target'] = pd.Series(boston.target)`

#### 3. Having a look at the different variables from the data set

Input:

`xxxxxxxxxx`

`# Having a look at the first few observations in the data and basic data properties`

`df.head()`

`df.shape`

`df.info()`

Output:

#### 4. Insights from the data set

Input:

`xxxxxxxxxx`

`# Checking the statistical properties of the variables`

`df.describe().T`

Output:

Input:

`xxxxxxxxxx`

`# Check for column wise missing values`

`df.isnull().sum()`

Output:

Input:

`xxxxxxxxxx`

`# Building the box`

`rcParams['figure.figsize'] = 20,5`

`df.boxplot(color=dict(boxes='r', whiskers='r', medians='r', caps='r'))`

`# sns.set(rc={'figure.figsize':(6,6)})`

`# sns.set_style("whitegrid")`

`# sns.boxplot(data=df, orient="h", palette="Set2")`

Output:

Input:

`xxxxxxxxxx`

`# Finding and plotting the correlation for the independent variables`

`sns.set(rc={'figure.figsize':(14,5)})`

`ind_var = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']`

`# df[ind_var].corr()`

`sns.heatmap(df[ind_var].corr(), cmap=sns.cubehelix_palette(20, light=0.95, dark=0.15));`

Output:

Input:

`xxxxxxxxxx`

`# Building a pair plot to understand the relationship between independent and dependent variables`

`sns.pairplot(df, x_vars = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM'], y_vars = ['target']);`

`sns.pairplot(df, x_vars = ['AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT'], y_vars = ['target']);`

Output:

From the above pair plot, we can conclude

- RM and MEDV have a shape like that in a normally distributed graph
- AGE is skewed to the left and LSTAT is skewed to the right
- TAX has a large amount of distribution around the point 700

#### 5. Building the model using scikit learn

Input:

`xxxxxxxxxx`

`# Separating out the independent and target data`

`X = df.iloc[:,:-1]`

`y = df.iloc[:,-1]`

Input:

`xxxxxxxxxx`

`# Spitting the data into train and test`

`from sklearn.model_selection import train_test_split`

`X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=113)`

`# Checking the shape of the four parts of the data after split`

`X_train.shape, X_test.shape`

`y_train.shape, y_test.shape`

Output:

((404, 13), (102, 13))

((404,), (102,))

Input:

`xxxxxxxxxx`

`# Building the linear regression model using sklearn`

`from sklearn.linear_model import LinearRegression`

`regressor = LinearRegression()`

`# Fitting the train data to the model`

`model = regressor.fit(X_train, y_train)`

#### 6. Prediction using the model and scoring the model

Input:

`xxxxxxxxxx`

`# Making predictions on test data using the above model`

`y_pred = model.predict(X_test)`

`xxxxxxxxxx`

`# Lets try to see the model accuracy/score on the training data`

`model.score(X_train, y_train)`

`# Lets try to see the model accuracy/score on the test data`

`model.score(X_test, y_test)`

Output:

0.7345219949558541

0.7505954479592696

Input:

`xxxxxxxxxx`

`from sklearn.metrics import mean_absolute_error, mean_squared_error, explained_variance_score`

`import math`

`mae = mean_absolute_error(y_test, y_pred)`

`mse = mean_squared_error(y_test, y_pred)`

`rmse = math.sqrt(mse)`

`mv = explained_variance_score(y_test, y_pred)`

`print("Mean Absolute Error is", round(mae,1))`

`print("Root Mean Squared Error is", round(rmse,1))`

`print("Variance explained by model", round(mv*100,1), "%")`

Output:

Mean Absolute Error is 3.4 Root Mean Squared Error is 5.0 Variance explained by model 75.1 %

Input:

`xxxxxxxxxx`

`# Linear regression coefficient and intercept. We can build the regression equation using the below parameter`

`model.coef_`

`model.intercept_`

Output:

array([-1.14331280e-01, 3.30399664e-02, 2.19911151e-02, 1.93047806e+00, -1.53459876e+01, 4.11678898e+00, -5.20475977e-03, -1.26111638e+00, 3.52665352e-01, -1.37375084e-02, -1.01521476e+00, 9.98692962e-03, -4.90950094e-01])

33.54862614612348

#### 7. Saving the model locally

Input:

`xxxxxxxxxx`

`# Saving the model`

`import pickle`

`filename = 'boston_regression_model.sav'`

`pickle.dump(model, open(filename, 'wb'))`

`# Loading the model from disk`

`loaded_model = pickle.load(open(filename, 'rb'))`

`result = round(loaded_model.score(X_test, y_test)*100,2)`

`print("Moel accuracy is", result, "%")`

Output:

Model accuracy is 75.06 %

### Conclusion:

To summarize, linear regression can be very useful in many business situations however it can also have limited applicability in certain scenarios as it can work only when the dependent variable is continuous in nature. To learn more about the best machine learning course, you can click on the link.

Opinions expressed by DZone contributors are their own.

Comments