Over a million developers have joined DZone.

Python: UnicodeEncodeError Messages in CSV File Extraction

· Big Data Zone

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

I’ve been trying to write some Python code to extract the players and the team they represented in the Bayern Munich/Barcelona match into a CSV file and had much more difficulty than I expected.

I have some scraping code (which is beyond the scope of this article) which gives me a list of (player, team) pairs that I want to write to disk. The contents of the list is as follows:

$ python extract_players.py
(u'Sergio Busquets', u'Barcelona')
(u'Javier Mascherano', u'Barcelona')
(u'Jordi Alba', u'Barcelona')
(u'Bastian Schweinsteiger', u'FC Bayern Mont-weight: bold;">\xfcnchen')
(u'Dani Alves', u'Barcelona')

I started with the following script:

nt-weight:bold;">with open("data/players.csv", "w") nt-weight:bold;">as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["player", "team"])

    nt-weight:bold;">for player, team nt-weight:bold;">in players:
        nt-weight:bold;">print player, team, type(player), type(team)
        writer.writerow([player, team])

And if I run that I’ll see this error:

$ python extract_players.py
...
Bastian Schweinsteiger FC Bayern München ont-weight: bold;"><ont-weight: bold;">type 'unicode'ont-weight: bold;">> ont-weight: bold;"><ont-weight: bold;">type 'unicode'ont-weight: bold;">>
Traceback ont-weight: bold;">(most recent call ont-weight: bold;">lastont-weight: bold;">):
  File "extract_players.py", line 67, ont-weight: bold;">in ont-weight: bold;"><moduleont-weight: bold;">>
    writer.writerowont-weight: bold;">(ont-weight: bold;">[player, teamont-weight: bold;">]ont-weight: bold;">)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 11: ordinal not in range(128)

So it looks like the ‘ü’ in ‘FC Bayern München’ is causing us issues. Let’s try and encode the teams to avoid this:

nt-weight:bold;">with open("data/players.csv", "w") nt-weight:bold;">as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["player", "team"])

    nt-weight:bold;">for player, team nt-weight:bold;">in players:
        nt-weight:bold;">print player, team, type(player), type(team)
        writer.writerow([player, team.encode("utf-8")])
$ python extract_players.py
...
Thomas Müller FC Bayern München ont-weight: bold;"><ont-weight: bold;">type 'unicode'ont-weight: bold;">> ont-weight: bold;"><ont-weight: bold;">type 'unicode'ont-weight: bold;">>
Traceback ont-weight: bold;">(most recent call ont-weight: bold;">lastont-weight: bold;">):
  File "extract_players.py", line 70, ont-weight: bold;">in ont-weight: bold;"><moduleont-weight: bold;">>
    writer.writerowont-weight: bold;">(ont-weight: bold;">[player, team.encodeont-weight: bold;">("utf-8"ont-weight: bold;">)ont-weight: bold;">]ont-weight: bold;">)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 8: ordinal not in range(128)

Now we’ve got the same issue with the ‘ü’ in Müller so let’s encode the players too:

nt-weight:bold;">with open("data/players.csv", "w") nt-weight:bold;">as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["player", "team"])

    nt-weight:bold;">for player, team nt-weight:bold;">in players:
        nt-weight:bold;">print player, team, type(player), type(team)
        writer.writerow([player.encode("utf-8"), team.encode("utf-8")])
$ python extract_players.py
...
Gerard Piqué Barcelona ont-weight: bold;"><ont-weight: bold;">type 'str'ont-weight: bold;">> ont-weight: bold;"><ont-weight: bold;">type 'unicode'ont-weight: bold;">>
Traceback ont-weight: bold;">(most recent call ont-weight: bold;">lastont-weight: bold;">):
  File "extract_players.py", line 70, ont-weight: bold;">in ont-weight: bold;"><moduleont-weight: bold;">>
    writer.writerowont-weight: bold;">(ont-weight: bold;">[player.encodeont-weight: bold;">("utf-8"ont-weight: bold;">), team.encodeont-weight: bold;">("utf-8"ont-weight: bold;">)ont-weight: bold;">]ont-weight: bold;">)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 11: ordinal not in range(128)

Now we’ve got a problem with Gerard Piqué because that value has type string rather than unicode. Let’s fix that:

nt-weight:bold;">with open("data/players.csv", "w") nt-weight:bold;">as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["player", "team"])

    nt-weight:bold;">for player, team nt-weight:bold;">in players:
        nt-weight:bold;">if isinstance(player, str):
            player = unicode(player, "utf-8")
        nt-weight:bold;">print player, team, type(player), type(team)
        writer.writerow([player.encode("utf-8"), team.encode("utf-8")])

Et voila! All the players are now successfully written to the file.

An alternative approach is to change the default encoding of the whole script to be ‘UTF-8′, like so:

ont-style: italic;"># encoding=utf8
nt-weight:bold;">import sys
reload(sys)
sys.setdefaultencoding('utf8')

nt-weight:bold;">with open("data/players.csv", "w") nt-weight:bold;">as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["player", "team"])

    nt-weight:bold;">for player, team nt-weight:bold;">in players:
        nt-weight:bold;">print player, team, type(player), type(team)
        writer.writerow([player, team])

It took me a while to figure it out but finally the players are ready to go!

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

Topics:
python ,big data ,unicodeencodeerror

Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}