We often encounter data sets in which all the values are not available. This can occur for many reasons such as typing errors, unanswered questions in a survey, a machine malfunctioning, etc. Encountering missing values are typical in a data mining tasks and dealing with these missing values is significant, which consumes most of a data scientist's time. We have various methods to impute the values, but deciding which one to use is a skill. Sometimes it is better to remove the null values directly and, for some tasks, it is better to use sophisticated mining techniques to impute the values.
So, the two most important questions arise: When do you use what? What are the advantages and disadvantages of each method? It's just not the science of various techniques for missing value imputation; it's an art altogether to choose the best method for imputation. I believe it comes from experience and doing a comparative analysis to select the appropriate imputation method to assign attributes with the least biased value.
When I see missing values in the data, I try to investigate why it is missing in the first place. Is it introduced because of some issues in the process or because of some other problems? If we find out the cause for missing values, then we can fix that which is better than any imputation. However, sometimes missing values are just missing. Nothing can be done about that. For example, some people don't like to disclose their salaries, so we cannot force them to reveal their salaries. In this scenario, we can either decide to remove those cases or go with missing data imputation. The decision should be based on the best analysis strategy to yield the least biased estimate.
In short, we would start with the below 'IAD rule' for missing value estimation:
1. Investigate why data is missing.
2. Analyze distribution of missing data.
3. Decide on the best strategy that yields the least biased estimates.
You can determine the best strategy to impute missing values based on various available techniques or innovate one yourself. After all, it is our job to think what works best for our data set.
Below are some of the available ways that we can utilize to deal with the missing values:
1. Ignore or delete the data - You can just ignore the missing values and analyze the cases with available data on each variable. This method is known as listwise deletion. As we don't use all the data and ignore records, it might reduce the useful information out of the data. Pairwise deletion is an another technique where we can delete only the null values. This means we analyze all cases in which data of interest are only present. The disadvantage is that we might have different results each time because samples can differ every time.
2. Use a global constant to fill in the missing value - Sometimes, we hold "empty" as well as "missing" values. Empty values are something that doesn't have values but forms a vital ingredient for the analytical exercise. So, if we think the missing values cannot be differentiated from "empty," then we can categorize it as a different value from rest of the data set and go with a global constant such as 'NA.'
3. Use domain knowledge to replace the missing value - The expert's voice or a field specialist can sometimes assist in suggesting a proper value for the missing data.
4. Use the attribute mean/median/mode - Replace missing value with sample mean/median (if numerical) or mode (if categorical). The disadvantage is that it reduces the variability in the data which will weaken the correlation estimates.
5. Use an indicator variable for missing values - Impute missing values and create another binary indicator variable which will denote whether it is a real or imputed variable. Results are not biased if a value is missing because of a genuine skip.
6. Use a data mining algorithm to predict the most probable value - We can use data mining algorithms such as linear regression, decision trees, random forest or KNN method to predict the most likely value of the missing attribute. We will discuss more these methods in our future posts. The disadvantage is that it might overfit the data if we plan to use the same data set for a prediction task using similar algorithms.
We can see from the above details that there are different methods to impute the data with various types. If we classify the techniques, broadly they belong to two distinct categories based on data types — numeric and categorical. However, there can be some very specific techniques depending on the type of analytical exercises such as the data from Time Series. In the next post, we will walk through the different methods to impute the missing values, see some R packages that can help us to perform these missing value replacements easily and comparing the different imputation method to find out which one gives us a better result for a prediction exercise.