{{announcement.body}}
{{announcement.title}}

Cleaning Data 101: Imputing NULLs

DZone 's Guide to

Cleaning Data 101: Imputing NULLs

NULL values can definitely drive you crazy. Learn how to get rid of the NULLs in your data sets and get that data clean!

· Big Data Zone ·
Free Resource

Data Science Learning

Even though it seems like a bit of a grind, cleaning your data can be the most creative part of the job.

If you’re doing any sort of machine learning with your data, NULL values in your set are going to drive you mental.

So, my pretties, let’s start at the beginning and impute the empty data from a set (for those of you who are new to big data, imputing is just a fancy way of saying ‘replace’).

I’ve been using the Titanic data, which is a fairly popular learning set, you can find it here: Titanic Data.

I’ve already imported the csv file:

titanic=pd.read_csv("titanic_train.csv")

The first thing to do is create a nice little heat map to see where the NULLS are:

sns.heatmap(titanic.isnull(),yticklabels=False,cbar=False,cmap='cubehelix')


Image title

The 'cmap' value in the above command will determine the color pallet used in your heat map.  Feel free to search for other options if you don't like the white on black.

We see in the above plot that there are several NULL values (in white). The first ones we'll tackle will be in the Age column.

There are lots of ways to impute these values. I've decided to find the average age of each of the 3 possible Cabin values, and apply this average to each of the missing values determined by which Cabin the missing passenger was traveling in. The function is a bit sloppy, as was my description, but here's what I concocted:

def impute_age(cols):
  Age = cols[0]
  Pclass = cols[1]
  ageAv = titanic.groupby('Pclass', as_index=False)['Age'].mean()
  if pd.isnull(Age):
    if Pclass == 1:
      return ageAv.loc[0][1]
    elif Pclass == 2:
      return ageAv.loc[1][1]
    else:
      return ageAv.loc[2][1]
    else:
      return Age

And to apply it to the Age values in your data use this:

titanic['Age'] = titanic[['Age','Pclass']].apply(impute_age,axis=1)

Now your heatmap should look like this:

Image title

The next column to tackle is the Cabin value.  Since there are tonnes of NULL values in this, and since we don't really need it anyway, let's just drop the whole thing:

titanic.drop('Cabin',axis=1,inplace=True)

Your plot should now look like this:

Image title

That little one remaining guy we'll just scrap too:

titanic.dropna(inplace=True)

And voila:

Image title

No more NULL values!  And you've still got a solid set of data to use for more exciting things to come.

mattdata.com

Topics:
big data, data cleaning, data science, impute data

Published at DZone with permission of Matt Hughes . See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}