DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Cleaning Data 101: Imputing NULLs

Cleaning Data 101: Imputing NULLs

NULL values can definitely drive you crazy. Learn how to get rid of the NULLs in your data sets and get that data clean!

Matt Hughes user avatar by
Matt Hughes
·
Aug. 20, 18 · Tutorial
Like (3)
Save
Tweet
Share
6.68K Views

Join the DZone community and get the full member experience.

Join For Free

Data Science Learning

Even though it seems like a bit of a grind, cleaning your data can be the most creative part of the job.

If you’re doing any sort of machine learning with your data, NULL values in your set are going to drive you mental.

So, my pretties, let’s start at the beginning and impute the empty data from a set (for those of you who are new to big data, imputing is just a fancy way of saying ‘replace’).

I’ve been using the Titanic data, which is a fairly popular learning set, you can find it here: Titanic Data.

I’ve already imported the csv file:

titanic=pd.read_csv("titanic_train.csv")

The first thing to do is create a nice little heat map to see where the NULLS are:

sns.heatmap(titanic.isnull(),yticklabels=False,cbar=False,cmap='cubehelix')


Image title

The 'cmap' value in the above command will determine the color pallet used in your heat map.  Feel free to search for other options if you don't like the white on black.

We see in the above plot that there are several NULL values (in white). The first ones we'll tackle will be in the Age column.

There are lots of ways to impute these values. I've decided to find the average age of each of the 3 possible Cabin values, and apply this average to each of the missing values determined by which Cabin the missing passenger was traveling in. The function is a bit sloppy, as was my description, but here's what I concocted:

def impute_age(cols):
  Age = cols[0]
  Pclass = cols[1]
  ageAv = titanic.groupby('Pclass', as_index=False)['Age'].mean()
  if pd.isnull(Age):
    if Pclass == 1:
      return ageAv.loc[0][1]
    elif Pclass == 2:
      return ageAv.loc[1][1]
    else:
      return ageAv.loc[2][1]
    else:
      return Age

And to apply it to the Age values in your data use this:

titanic['Age'] = titanic[['Age','Pclass']].apply(impute_age,axis=1)

Now your heatmap should look like this:

Image title

The next column to tackle is the Cabin value.  Since there are tonnes of NULL values in this, and since we don't really need it anyway, let's just drop the whole thing:

titanic.drop('Cabin',axis=1,inplace=True)

Your plot should now look like this:

Image title

That little one remaining guy we'll just scrap too:

titanic.dropna(inplace=True)

And voila:

Image title

No more NULL values!  And you've still got a solid set of data to use for more exciting things to come.

mattdata.com

Data science

Published at DZone with permission of Matt Hughes. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Apache Kafka vs. Memphis.dev
  • Iptables Basic Commands for Novice
  • Utilize OpenAI API to Extract Information From PDF Files
  • Do Not Forget About Testing!

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: