Trying to Fill in Missing Values
Trying to Fill in Missing Values
We try an experiment that may or may not be successful trying to fill in some of the missing values while replicating Poonam Ligade’s exploratory analysis.
Join the DZone community and get the full member experience.Join For Free
How to Simplify Apache Kafka. Get eBook.
I’ve been playing around with the data in Kaggle’s House Prices: Advanced Regression Techniques and while replicating Poonam Ligade’s exploratory analysis, I wanted to see if I could create a model to fill in some of the missing values.
Poonam wrote the following code to identify which columns in the dataset had the most missing values:
import pandas as pd train = pd.read_csv('train.csv') null_columns=train.columns[train.isnull().any()] >>> print(train[null_columns].isnull().sum()) LotFrontage 259 Alley 1369 MasVnrType 8 MasVnrArea 8 BsmtQual 37 BsmtCond 37 BsmtExposure 38 BsmtFinType1 37 BsmtFinType2 38 Electrical 1 FireplaceQu 690 GarageType 81 GarageYrBlt 81 GarageFinish 81 GarageQual 81 GarageCond 81 PoolQC 1453 Fence 1179 MiscFeature 1406 dtype: int64
The one that I’m most interested in is
LotFrontage, which describes "linear feet of street connected to property." There are a few other columns related to lots so I thought I might be able to use them to fill in the missing
We can write the following code to find a selection of the rows missing a
cols = [col for col in train.columns if col.startswith("Lot")] missing_frontage = train[cols][train["LotFrontage"].isnull()] >>> print(missing_frontage.head()) LotFrontage LotArea LotShape LotConfig 7 NaN 10382 IR1 Corner 12 NaN 12968 IR2 Inside 14 NaN 10920 IR1 Corner 16 NaN 11241 IR1 CulDSac 24 NaN 8246 IR1 Inside
I want to use scikit-learn's linear regression model, which only works with numeric values, so we need to convert our categorical variables into numeric equivalents. We can use the pandas
get_dummies function for this.
Let’s try it out on the
sub_train = train[train.LotFrontage.notnull()] dummies = pd.get_dummies(sub_train[cols].LotShape) >>> print(dummies.head()) IR1 IR2 IR3 Reg 0 0 0 0 1 1 0 0 0 1 2 1 0 0 0 3 1 0 0 0 4 1 0 0 0
Cool, that looks good. We can do the same with
LotConfig and then we need to add these new columns onto the original DataFrame. We can use pandas
concat function to do this.
import numpy as np data = pd.concat([ sub_train[cols], pd.get_dummies(sub_train[cols].LotShape), pd.get_dummies(sub_train[cols].LotConfig) ], axis=1).select_dtypes(include=[np.number]) >>> print(data.head()) LotFrontage LotArea IR1 IR2 IR3 Reg Corner CulDSac FR2 FR3 Inside 0 65.0 8450 0 0 0 1 0 0 0 0 1 1 80.0 9600 0 0 0 1 0 0 1 0 0 2 68.0 11250 1 0 0 0 0 0 0 0 1 3 60.0 9550 1 0 0 0 1 0 0 0 0 4 84.0 14260 1 0 0 0 0 0 1 0 0
We can now split data into train and test sets and create a model.
from sklearn import linear_model from sklearn.model_selection import train_test_split X = data.drop(["LotFrontage"], axis=1) y = data.LotFrontage X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33) lr = linear_model.LinearRegression() model = lr.fit(X_train, y_train)
Now it’s time to give it a try on the test set:
>>> print("R^2 is: \n", model.score(X_test, y_test)) R^2 is: -0.84137438493
Hmm, that didn’t work too well. An R^2 score of less than 0 suggests that we’d be better off just predicting the average
LotFrontage regardless of any of the other features. We can confirm that with the following code:
from sklearn.metrics import r2_score >>> print(r2_score(y_test, np.repeat(y_test.mean(), len(y_test)))) 0.0
Whereas if we had all of the values correct, we’d get a score of 1:
>>> print(r2_score(y_test, y_test)) 1.0
In summary, not a very successful experiment. Poonam derives a value for
LotFrontage based on the square root of
LotArea, so perhaps that’s the best we can do here.
Published at DZone with permission of Mark Needham , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.