Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Trying to Fill in Missing Values

DZone's Guide to

Trying to Fill in Missing Values

We try an experiment that may or may not be successful trying to fill in some of the missing values while replicating Poonam Ligade’s exploratory analysis.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

I’ve been playing around with the data in Kaggle’s House Prices: Advanced Regression Techniques and while replicating Poonam Ligade’s exploratory analysis, I wanted to see if I could create a model to fill in some of the missing values.

Poonam wrote the following code to identify which columns in the dataset had the most missing values:

import pandas as pd
train = pd.read_csv('train.csv')
null_columns=train.columns[train.isnull().any()]
 
>>> print(train[null_columns].isnull().sum())
LotFrontage      259
Alley           1369
MasVnrType         8
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64

The one that I’m most interested in is LotFrontage, which describes "linear feet of street connected to property." There are a few other columns related to lots so I thought I might be able to use them to fill in the missing LotFrontage values.

We can write the following code to find a selection of the rows missing a LotFrontage value:

cols = [col for col in train.columns if col.startswith("Lot")]
missing_frontage = train[cols][train["LotFrontage"].isnull()]
 
>>> print(missing_frontage.head())
    LotFrontage  LotArea LotShape LotConfig
7           NaN    10382      IR1    Corner
12          NaN    12968      IR2    Inside
14          NaN    10920      IR1    Corner
16          NaN    11241      IR1   CulDSac
24          NaN     8246      IR1    Inside

I want to use scikit-learn's linear regression model, which only works with numeric values, so we need to convert our categorical variables into numeric equivalents. We can use the pandas get_dummies function for this.

Let’s try it out on the LotShape column:

sub_train = train[train.LotFrontage.notnull()]
dummies = pd.get_dummies(sub_train[cols].LotShape)
 
>>> print(dummies.head())
   IR1  IR2  IR3  Reg
0    0    0    0    1
1    0    0    0    1
2    1    0    0    0
3    1    0    0    0
4    1    0    0    0

Cool, that looks good. We can do the same with LotConfig and then we need to add these new columns onto the original DataFrame. We can use pandas concat function to do this.

import numpy as np
 
data = pd.concat([
        sub_train[cols],
        pd.get_dummies(sub_train[cols].LotShape),
        pd.get_dummies(sub_train[cols].LotConfig)
    ], axis=1).select_dtypes(include=[np.number])
 
>>> print(data.head())
   LotFrontage  LotArea  IR1  IR2  IR3  Reg  Corner  CulDSac  FR2  FR3  Inside
0         65.0     8450    0    0    0    1       0        0    0    0       1
1         80.0     9600    0    0    0    1       0        0    1    0       0
2         68.0    11250    1    0    0    0       0        0    0    0       1
3         60.0     9550    1    0    0    0       1        0    0    0       0
4         84.0    14260    1    0    0    0       0        0    1    0       0

We can now split data into train and test sets and create a model.

from sklearn import linear_model
from sklearn.model_selection import train_test_split
 
X = data.drop(["LotFrontage"], axis=1)
y = data.LotFrontage
 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33)
 
lr = linear_model.LinearRegression()
 
model = lr.fit(X_train, y_train)

Now it’s time to give it a try on the test set:

>>> print("R^2 is: \n", model.score(X_test, y_test))
R^2 is: 
 -0.84137438493

Hmm, that didn’t work too well. An R^2 score of less than 0 suggests that we’d be better off just predicting the average LotFrontage regardless of any of the other features. We can confirm that with the following code:

from sklearn.metrics import r2_score
 
>>> print(r2_score(y_test, np.repeat(y_test.mean(), len(y_test))))
0.0

Whereas if we had all of the values correct, we’d get a score of 1:

>>> print(r2_score(y_test, y_test))
1.0

In summary, not a very successful experiment. Poonam derives a value for LotFrontage based on the square root of  LotArea, so perhaps that’s the best we can do here.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
tutorial ,kaggle ,big data ,pandas ,data analytics

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}