Cleaning Data 101: Imputing NULLs
NULL values can definitely drive you crazy. Learn how to get rid of the NULLs in your data sets and get that data clean!
Join the DZone community and get the full member experience.
Join For FreeEven though it seems like a bit of a grind, cleaning your data can be the most creative part of the job.
If you’re doing any sort of machine learning with your data, NULL values in your set are going to drive you mental.
So, my pretties, let’s start at the beginning and impute the empty data from a set (for those of you who are new to big data, imputing is just a fancy way of saying ‘replace’).
I’ve been using the Titanic data, which is a fairly popular learning set, you can find it here: Titanic Data.
I’ve already imported the csv file:
titanic=pd.read_csv("titanic_train.csv")
The first thing to do is create a nice little heat map to see where the NULLS are:
sns.heatmap(titanic.isnull(),yticklabels=False,cbar=False,cmap='cubehelix')
The 'cmap' value in the above command will determine the color pallet used in your heat map. Feel free to search for other options if you don't like the white on black.
We see in the above plot that there are several NULL values (in white). The first ones we'll tackle will be in the Age column.
There are lots of ways to impute these values. I've decided to find the average age of each of the 3 possible Cabin values, and apply this average to each of the missing values determined by which Cabin the missing passenger was traveling in. The function is a bit sloppy, as was my description, but here's what I concocted:
def impute_age(cols):
Age = cols[0]
Pclass = cols[1]
ageAv = titanic.groupby('Pclass', as_index=False)['Age'].mean()
if pd.isnull(Age):
if Pclass == 1:
return ageAv.loc[0][1]
elif Pclass == 2:
return ageAv.loc[1][1]
else:
return ageAv.loc[2][1]
else:
return Age
And to apply it to the Age values in your data use this:
titanic['Age'] = titanic[['Age','Pclass']].apply(impute_age,axis=1)
Now your heatmap should look like this:
The next column to tackle is the Cabin value. Since there are tonnes of NULL values in this, and since we don't really need it anyway, let's just drop the whole thing:
titanic.drop('Cabin',axis=1,inplace=True)
Your plot should now look like this:
That little one remaining guy we'll just scrap too:
titanic.dropna(inplace=True)
And voila:
No more NULL values! And you've still got a solid set of data to use for more exciting things to come.
Published at DZone with permission of Matt Hughes. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments