Practical Strategies to Handle Missing Values
No data set is going to be perfect, and some might not even contain all the values you need. Learn how to get around this using statistics.
Join the DZone community and get the full member experience.Join For Free
One of the major challenges in most BI projects is to figure out a way to get clean data. 60 to 80 percent of the total time is spent on cleaning the data before you can make any meaningful sense of it. This is true for both BI and Predictive Analytics projects. To improve the effectiveness of the data cleaning process, the current trend is to migrate from the manual data cleaning to more intelligent machine learning-based processes.
Identify the Type of Missing Values We Are Dealing With
Before we dig into figuring out how to handle missing values, it's critical to figure out the nature of the missing values. There are three possible types, depending on if there exists a relationship between the missing data with the other data in the dataset.
- They may be no pattern to why a column’s data is missing with other columns in your data and is referred to as missing completely at random (MCAR). An example of this could be that survey data is missing because someone could not make it to an appointment. Or an administrator misplaces test results he/she was supposed to enter into the computer. Reasons for missing values are unrelated to the data in the data set.
- There are scenarios where data missing in a column can be explained by data in other columns. This is referred to as this missing at random (MAR). An example of this could be in a school where students who get above the cutoff scores are typically given grades. This means if the grades are missing for some students it can be explained by the column that has the scores that are below the cutoff. Reasons for missing values can be described by data in another column.
- There are scenarios where the value that is missing is related to the value itself. For example, higher-income earneres may not disclose their income. There is a correlation between missing values to the actual income and it's not dependent on other variables in the dataset. This is referred to as not missing at random (MNAR)
In general, deleting missing rows is okay for MCAR. Imputating the data is suitable for both MCAR and MAR. If it's MNAR, one has to model this by introducing additional columns.
Strategies for Dealing With Missing Data
There are several strategies to deal with missing values. To start with one can use algorithms like random forest or KNN which are robust in handling missing values. The first common strategy is to delete the rows with missing values. Typically any row which has a missing value in any cell gets deleted. Sometimes there's a high possibility that many rows will get removed. Because of the loss of information and data, when there are not enough samples, typically this method is not used.
Imputing the missing data can be done in a variety of ways. It could be based solely on information that exists in the column that has missing values. Or it can also be based on other columns present in the data set. Lastly, one can use also classification or regression models to predict the missing value(s). We will discuss those with examples below.
Dealing With Missing Values in Numerical Columns
The first approach is to replace the missing value(s) with one of the following strategies.
- Replace it with a constant value. Typically, this is used in discussion with the domain expert for the data we are dealing with.
- Replace it with the mean or the median. This is a decent approach especially when the data size small, but it does add bias.
- Replace it with values by making use of information from other columns. We will discuss this more with examples below.
In the employee dataset subset above, we have salaries missing in three rows. We also have their state of residence and their years of experience in the data set.
The first approach is to fill the missing values with the mean of the column. Here we are solely using the information from the column which has missing values.
We can do a little better with the help of a domain expert and use information from the other columns from the dataset.
The average salary is different for different states. Use that for filling in the values. Calculate the average salary of people working in Texas and replace the row where the salary is missing for Texas with the average salary for people who work in Texas. Do the same for other states.
What else can we do better? We can make use of the years of experience in addition to the state column.
Calculate the average entry-level salary of people working in Texas and replace the row where the salary is missing for an entry level person in Texas with the average entry-level salary in Texas. Do the same for other states and other levels.
That provides you a good set of ways to impute data, especially when we have a domain expert to guide us. Please note that there are some boundary conditions that need to be handled. For example, there could be a row for a person living in Texas that could have both salary as well as the years of experience missing. There are multiple ways to handle this. A straightforward way is to replace the missing value with the average salary in Texas.
Predicting the Missing Value Using an Algorithm
Another approach is to create a simple regression model. The column to predict here is the salary using other columns in the data set. If there are missing values in the input columns, we have to handle those conditions when creating the predictive model. There are many ways to manage that but a simple way is to pick the features that do not have missing values or take the rows that do not have missing values in any of the cells. One can experiment with different techniques and different algorithms antique the one that gives the best accuracy.
Dealing With Missing Values in Categorical Columns
Dealing with missing values in a categorical column is a lot easier than dealing with the same in numerical columns.
1. Replace it with a constant value or the most popular category. This is a good approach especially when the data size is small — but it does add bias. For example, say we a column for education with values for high school and college degrees. If there are more people with a college degree in the dataset, we can replace the missing value with a college degree.
2. We can tweak this more by making use of information in other columns. If there are more people with high school degrees from Texas in the data set, we can replace the missing values to refelct people who graduated from high school in Texas.
3. One can also create a classification model. The column we want to predict here is education by using the other columns in the data set.
4. The most common and popular approach is to model the missing value in a categorical column as a new category called “Unknown.”
We discussed several strategies to deal with missing values. Depending on the kind of data and the type of problem, the approach applied will be slightly different. If you have access to a domain expert, it'll be good to incorporate that expert advice when filling in the missing values.
It's important to note that whichever input methods are chosen, it's good to run the predictive model to see which method works best from the standpoint of accuracy.
Published at DZone with permission of Sriram Parthasarathy. See the original article here.
Opinions expressed by DZone contributors are their own.