Cleaning Data 102: Pesky Texty
A tutorial on how to use Python to turn letters into numbers so your machine can read the value and better impute the data.
Join the DZone community and get the full member experience.Join For Free
If you’re going to be doing any analysis or machine learning with your data, it’s very important to make sure that your data is readable by … a machine! Imagine that. This often means getting rid of, or imputing (smart speak), any data that isn’t in a numerical format.
Computers love numbers. I also love numbers.
But it turns out that computers still hate words.
So, in order to start, it's important to try and turn your pesky letters into numbers.
Luckily Python can help.
Like a lot of data science stuff, you have to make more stuff first before you can make less stuff later.
That’s a sick quote!
Let’s just try dealing with Sex. This is a good attribute to start with because it can only consists of two values, male or female.
We’ll use the Titanic data again.
When we originally take a look at the Sex column of the data, this is what we have:
0 male 1 female 2 female 3 female 4 male 5 male 6 male 7 male 8 female
Essentially, what we want to do is convert the ‘male’ and ‘female’ values into ones and zeros, because that’s what computers like.
They also like taking over the Earth and destroying all life.
But that's also irrelevant.
Pandas is beautiful and thankfully it gives us a very simple way of doing what we need to do here with the
What this function does is create a new dataset and splits all the possible values of your input data into new columns containing numerical data:
sex = pd.get_dummies(titanic['Sex'],drop_first=True)
In the above example, the new sex dataset will look like this:
M | F 1 | 0 0 | 1 0 | 1
We can then remove the pre-existing Sex column...
... and replace it with the new sex column of zeros and ones.
titanic = pd.concat([titanic,sex],axis=1)
If you didn't notice, you also dropped one of the two columns in the new dataset you created, because you don't need a male and female column, since the two values are mutually exclusive (you can't be male AND female. Well... in the Titanic times you couldn't be).
So what you end up with is a new column in your main dataset called 'male' that looks like this:
0 1 1 0 2 0 3 0 4 1 5 1 6 1 7 1 8 0
Done like dinner...
Opinions expressed by DZone contributors are their own.