DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Stop Running Two Data Systems for One Agent Query
  • Setting Up a Data Catalog With Azure Purview and Collibra: What Three Attempts Taught Me
  • Beyond Partitioning and Z-Order: A Deep Dive into Liquid Clustering for Unity Catalog Managed Tables
  • Why Google Data Migration Gets Stuck at 99%: Causes and Proven Fixes

Trending

  • Product-Led Software Delivery: Intelligent Platforms for DevOps at Scale
  • DevOps and Platform Engineering Readiness Checklist: Everything Needed for a Scalable, Secure, High-Velocity Delivery Platform
  • Why Google Data Migration Gets Stuck at 99%: Causes and Proven Fixes
  • What Is Plagiarism? How to Avoid It and Cite Sources
  1. DZone
  2. Data Engineering
  3. Data
  4. Imputing Missing Data Using Sklearn SimpleImputer

Imputing Missing Data Using Sklearn SimpleImputer

In this post, learn how to use Python's Sklearn SimpleImputer for imputing/replacing numerical and categorical missing data using different strategies.

By 
Ajitesh Kumar user avatar
Ajitesh Kumar
·
Aug. 18, 20 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
32.5K Views

Join the DZone community and get the full member experience.

Join For Free

In this post, you will learn about how to use Python's Sklearn SimpleImputer for imputing/replacing numerical and categorical missing data using different strategies. In one of the related articles posted sometime back, the usage of fillna method of Pandas DataFrame is discussed. Here is the link, Replace missing values with mean, median and mode. Handling missing values is a key part of data preprocessing and hence, it is of utmost importance for data scientists/machine learning engineers to learn different techniques in relation imputing / replacing numerical or categorical missing values with appropriate value based on appropriate strategies.

The following topics will be covered in this post:

  • SimpleImputer explained with Python code example
  • SimpleImputer for imputing numerical missing data
  • SimpleImputer for imputing categorical missing data

SimpleImputer Explained With Python Code Example

SimpleImputer is a class found in package sklearn.impute. It is used to impute / replace the numerical or categorical missing data related to one or more features with appropriate values such as following:

Each of the above type represents strategy when creating an instance of SimpleImputer. Here is the Python code sample representing the usage of SimpleImputor for replacing numerical missing value with the mean.

First and foremost, let's create a sample Pandas Dataframe representing marks, gender and result of students.

Java
 




x
14


 
1
import pandas as pd
2
import numpy as np
3

           
4
students = [[85, 'M', 'verygood'],
5
           [95, 'F', 'excellent'],
6
           [75, None,'good'],
7
           [np.NaN, 'M', 'average'],
8
           [70, 'M', 'good'],
9
           [np.NaN, None, 'verygood'],
10
           [92, 'F', 'verygood'],
11
           [98, 'M', 'excellent']]
12

           
13
dfstd = pd.DataFrame(students)
14
dfstd.columns = ['marks', 'gender', 'result']



Sample data used to illustrate SimpleImputer usage
Fig 1. Sample data used to illustrate SimpleImputer usage


There are two columns / features (one numerical - marks, and another categorical - gender) which are having missing values and need to be imputed. In the code below, an instance of SimpleImputer is created with strategy as "mean". The missing value is represented using NaN. Note some of the following:


  • sklearn.impute package is used for importing SimpleImputer class.
  • SimpleImputer takes two argument such as missing_values and strategy.
  • fit_transform method is invoked on the instance of SimpleImputer to impute the missing values.
Java
 




xxxxxxxxxx
1
10


 
1
from sklearn.impute import SimpleImputer
2
#
3
# Missing values is represented using NaN and hence specified. If it 
4
# is empty field, missing values will be specified as ''
5
#
6
imputer = SimpleImputer(missing_values=np.NaN, strategy='mean')
7

           
8
dfstd.marks = imputer.fit_transform(dfstd['marks'].values.reshape(-1,1))[:,0]
9

           
10
dfstd


Here is how the output would look like. Note that missing value of marks is imputed / replaced with the mean value, 85.83333


Fig 2. Numerical missing values imputed with mean using SimpleImputer
Fig 2. Numerical missing values imputed with mean using SimpleImputer


SimpleImputer for Imputing Numerical Missing Data

For the numerical missing data, the following strategy can be used.

The code example below represents the instantiation of SimpleImputer with appropriate strategies for imputing numerical missing data

Java
 




xxxxxxxxxx
1
17


 
1
#
2
# Imputing with mean value
3
#
4
imputer = SimpleImputer(missing_values=np.NaN, strategy='mean')
5
#
6
# Imputing with median value
7
#
8
imputer = SimpleImputer(missing_values=np.NaN, strategy='median')
9
#
10
# Imputing with most frequent / mode value
11
#
12
imputer = SimpleImputer(missing_values=np.NaN, strategy='most_frequent')
13
#
14
# Imputing with constant value; The command below replaces the missing
15
# value with constant value such as 80
16
#
17
imputer = SimpleImputer(missing_values=np.NaN, strategy='constant', fill_value=80)


SimpleImputer for Imputing Categorical Missing Data

For handling categorical missing values, you could use one of the following strategies. However, it is the "most_frequent" strategy which is preferably used.

  • Most frequent (strategy='most_frequent')
  • Constant (strategy='constant', fill_value='someValue')

Here is how the code would look like when imputing missing value with strategy as most_frequent. In the code sample used in this post, gender is having missing values. Note how the missing value under gender column is replaced with 'M' which occurs most frequently.

Java
 




xxxxxxxxxx
1


 
1
from sklearn.impute import SimpleImputer
2

           
3
imputer = SimpleImputer(missing_values=None, strategy='most_frequent')
4
dfstd.gender = imputer.fit_transform(dfstd['gender'].values.reshape(-1,1))[:,0]
5
dfstd



Fig 3. Categorical missing values imputed with most_frequent using SimpleImputer
Fig 3. Categorical missing values imputed with most_frequent using SimpleImputer


Here is how the code would look like when imputing missing value with strategy as constant. Note how the missing value under gender column is replaced with 'F' which is assigned using fill_value parameter.

Java
 




xxxxxxxxxx
1


 
1
from sklearn.impute import SimpleImputer
2

           
3
imputer = SimpleImputer(missing_values=None, strategy='constant', fill_value='F')
4
dfstd.gender = imputer.fit_transform(dfstd['gender'].values.reshape(-1,1))[:,0]
5
dfstd



Fig 4. Categorical missing values imputed with constant using SimpleImputer
Fig 4. Categorical missing values imputed with constant using SimpleImputer


Conclusions

Here is the summary of what you learned in this post:

  • You can use Sklearn.impute class SimpleImputer to impute/replace missing values for both numerical and categorical features.
  • For numerical missing values, a strategy such as mean, median, most frequent, and constant can be used.
  • For categorical features, a strategy such as the most frequent and constant can be used.
Data (computing)

Published at DZone with permission of Ajitesh Kumar. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Stop Running Two Data Systems for One Agent Query
  • Setting Up a Data Catalog With Azure Purview and Collibra: What Three Attempts Taught Me
  • Beyond Partitioning and Z-Order: A Deep Dive into Liquid Clustering for Unity Catalog Managed Tables
  • Why Google Data Migration Gets Stuck at 99%: Causes and Proven Fixes

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook