Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Python: UnicodeEncodeError Messages in CSV File Extraction

DZone's Guide to

Python: UnicodeEncodeError Messages in CSV File Extraction

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

I’ve been trying to write some Python code to extract the players and the team they represented in the Bayern Munich/Barcelona match into a CSV file and had much more difficulty than I expected.

I have some scraping code (which is beyond the scope of this article) which gives me a list of (player, team) pairs that I want to write to disk. The contents of the list is as follows:

$ python extract_players.py
(u'Sergio Busquets', u'Barcelona')
(u'Javier Mascherano', u'Barcelona')
(u'Jordi Alba', u'Barcelona')
(u'Bastian Schweinsteiger', u'FC Bayern Mont-weight: bold;">\xfcnchen')
(u'Dani Alves', u'Barcelona')

I started with the following script:

nt-weight:bold;">with open("data/players.csv", "w") nt-weight:bold;">as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["player", "team"])

    nt-weight:bold;">for player, team nt-weight:bold;">in players:
        nt-weight:bold;">print player, team, type(player), type(team)
        writer.writerow([player, team])

And if I run that I’ll see this error:

$ python extract_players.py
...
Bastian Schweinsteiger FC Bayern München ont-weight: bold;"><ont-weight: bold;">type 'unicode'ont-weight: bold;">> ont-weight: bold;"><ont-weight: bold;">type 'unicode'ont-weight: bold;">>
Traceback ont-weight: bold;">(most recent call ont-weight: bold;">lastont-weight: bold;">):
  File "extract_players.py", line 67, ont-weight: bold;">in ont-weight: bold;"><moduleont-weight: bold;">>
    writer.writerowont-weight: bold;">(ont-weight: bold;">[player, teamont-weight: bold;">]ont-weight: bold;">)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 11: ordinal not in range(128)

So it looks like the ‘ü’ in ‘FC Bayern München’ is causing us issues. Let’s try and encode the teams to avoid this:

nt-weight:bold;">with open("data/players.csv", "w") nt-weight:bold;">as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["player", "team"])

    nt-weight:bold;">for player, team nt-weight:bold;">in players:
        nt-weight:bold;">print player, team, type(player), type(team)
        writer.writerow([player, team.encode("utf-8")])
$ python extract_players.py
...
Thomas Müller FC Bayern München ont-weight: bold;"><ont-weight: bold;">type 'unicode'ont-weight: bold;">> ont-weight: bold;"><ont-weight: bold;">type 'unicode'ont-weight: bold;">>
Traceback ont-weight: bold;">(most recent call ont-weight: bold;">lastont-weight: bold;">):
  File "extract_players.py", line 70, ont-weight: bold;">in ont-weight: bold;"><moduleont-weight: bold;">>
    writer.writerowont-weight: bold;">(ont-weight: bold;">[player, team.encodeont-weight: bold;">("utf-8"ont-weight: bold;">)ont-weight: bold;">]ont-weight: bold;">)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 8: ordinal not in range(128)

Now we’ve got the same issue with the ‘ü’ in Müller so let’s encode the players too:

nt-weight:bold;">with open("data/players.csv", "w") nt-weight:bold;">as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["player", "team"])

    nt-weight:bold;">for player, team nt-weight:bold;">in players:
        nt-weight:bold;">print player, team, type(player), type(team)
        writer.writerow([player.encode("utf-8"), team.encode("utf-8")])
$ python extract_players.py
...
Gerard Piqué Barcelona ont-weight: bold;"><ont-weight: bold;">type 'str'ont-weight: bold;">> ont-weight: bold;"><ont-weight: bold;">type 'unicode'ont-weight: bold;">>
Traceback ont-weight: bold;">(most recent call ont-weight: bold;">lastont-weight: bold;">):
  File "extract_players.py", line 70, ont-weight: bold;">in ont-weight: bold;"><moduleont-weight: bold;">>
    writer.writerowont-weight: bold;">(ont-weight: bold;">[player.encodeont-weight: bold;">("utf-8"ont-weight: bold;">), team.encodeont-weight: bold;">("utf-8"ont-weight: bold;">)ont-weight: bold;">]ont-weight: bold;">)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 11: ordinal not in range(128)

Now we’ve got a problem with Gerard Piqué because that value has type string rather than unicode. Let’s fix that:

nt-weight:bold;">with open("data/players.csv", "w") nt-weight:bold;">as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["player", "team"])

    nt-weight:bold;">for player, team nt-weight:bold;">in players:
        nt-weight:bold;">if isinstance(player, str):
            player = unicode(player, "utf-8")
        nt-weight:bold;">print player, team, type(player), type(team)
        writer.writerow([player.encode("utf-8"), team.encode("utf-8")])

Et voila! All the players are now successfully written to the file.

An alternative approach is to change the default encoding of the whole script to be ‘UTF-8′, like so:

ont-style: italic;"># encoding=utf8
nt-weight:bold;">import sys
reload(sys)
sys.setdefaultencoding('utf8')

nt-weight:bold;">with open("data/players.csv", "w") nt-weight:bold;">as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["player", "team"])

    nt-weight:bold;">for player, team nt-weight:bold;">in players:
        nt-weight:bold;">print player, team, type(player), type(team)
        writer.writerow([player, team])

It took me a while to figure it out but finally the players are ready to go!

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
python ,big data ,unicodeencodeerror

Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}