First, your Python application is multi-language friendly, right? I mean, I'm as functionally monolinguistic as most Americans, but I love the diversity of languages we have in the world, and appreciate that people really want to use their desktop and applications in their native language. Fortunately, it's not that hard to write good i18n'd Python code, and there are many tools available for helping volunteers translate your application, such as Pootle and Launchpad translations.
Originally Authored by Barry Warsaw
So there really is no excuse not to i18n your Python application. In fact, GNU Mailman has been i18n'd for many years, and pioneered the supporting code in Python's standard library, namely the gettext module. As part of the Mailman 3 effort, I've also written a higher level library called flufl.i18n which makes it even easier to i18n your application, even in tricky multi-language contexts such as server programs, where you might need to get a German translation and a French translation in one operation, then turn around and get Japanese, Italian,and English for the next operation.
In any case, my colleague was having a problem in a typically simple command line program. What's common about these types of applications is that you fire them up once, they run then exit, and they only have to deal with one language during the entire execution of the program, specifically the language defined in the user's locale. If you read the gettext module's documentation, you'd be inclined to do this at the very start of your application:
from gettext import gettext as _ gettext.textdomain(my_program_name)
then, you'd wrap translatable strings in code like this:
print _('Here is something I want to tell you')
What gettext does is look up the source string in a translation catalog, returning the text in the appropriate language, which will then be printed. There are some additional details regarding i18n that I won't go into here. If you're curious, ask in the comments, and I'll try to fill things in.
Anyway, if you do write the above code, you'll be in for a heap of trouble, as my colleague soon found out. Just running his program with --help in a French locale, he was getting the dreaded UnicodeEncodeError:
"UnicodeEncodeError: 'ascii' codec can't encode character"
I've also seen reports of such errors when trying to send translated strings to a log file (a practice which I generally discourage, since I think log messages usually shouldn't be translated). In any case, I'm here to tell you why the above "obvious" code is wrong, and what you should do instead.
First, why is that code wrong, and why does it lead to the UnicodeEncodeErrors? What might not be obvious from the Python 2 gettext documentation is that gettext.gettext() always returns 8-bit strings (a.k.a. byte strings in Python 3 terminology), and these 8-bit strings are encoded with the charset defined in the language's catalog file. Now, it's generally best practice in Python to always deal with human readable text using unicodes, converting any bytes to unicode as early as possible. This is traditionally more problematic in Python 2, where English programs can cheat and use 8-bit strings and usually not crash, since their character range overlaps with ASCII and you only ever print to English locales, which are compatible with ASCII. As soon as you use text that's not ASCII though, you're probably going to run into trouble. By using unicodes everywhere, you can generally avoid such problems, and in fact it will make your life much easier when you eventually switch to Python 3.
So the 8-bit strings that gettext.gettext() hands you will really just hurt you, and to avoid the pain, you'd want to convert them back to unicodes before you print them to stdout or a log file. However, converting to unicodes makes the i18n APIs much less convenient, so no one does it until there's way too much code to fix.
What you really want in Python 2 is something like this:
from gettext import ugettext as _
which you'd think you should be able to do, the "u" prefix meaning "give me unicode". But for reasons I can only describe as based on our misunderstandings of unicode and i18n back in the days this module was originally written, you can't actually do that, because ugettext() is not exposed as a module-level function. It is available in the class-based API, but that's a more advanced API that again almost no one uses. Sadly, it's too late to fix this in Python 2. The good news is that in Python 3 it is fixed, not by exposing ugettext(), but by changing the most commonly used gettext module APIs to return unicode strings directly, as it always should have done. In Python 3, the obvious code just works:
from gettext import gettext as _
What can you do in Python 2 then? Here's what you should use instead of the two lines of code at the beginning of this article:
_ = gettext.translation(my_program_name).ugettext
and now you can wrap all your translatable strings in _('Foo') and it should Just Work. Or you can use the flufl.i18n API, which always uses ugettext and always returns unicode strings.
The one-liner above was in fact the solution my colleague implemented to fix his bug, so I didn't really help him much. I just explained why his fix was the correct one and the original code was buggy. At least now you know too!
(The fact that this works correctly in Python 3 is yet another reason to switch to Python 3!)
Aside: It's really fun to install a desktop in a language you cannot read. Fortunately, French, like other Latin-drived languages, has enough familiarity that I can mostly get by with the standard Ubuntu installer. It also doesn't hurt that I've done about a bazillion installs of Ubuntu Oneiric (due out tomorrow!) so I pretty much know what each screen and prompt means even without going to Google Translate.
Also interesting was that I could never reproduce the crash when ssh'd into the French locale VM. It would only crash for me when I was logged into a terminal on the VM's graphical desktop. The only difference between the two that I could tell was that in the desktop's terminal, locale(8) returned French values (e.g. fr_FR.UTF-8) for everything, but in the ssh console, it returned the French values for everything except the LC_CTYPE environment variable. For the life of me, I could not get LC_CTYPE set to anything other than en_US.UTF-8 in the ssh context, so the reproducible test case would just return the English text, and not crash. This happened even if I explicitly set that environment variable either as a separate export command in the shell, or as a prefix to the normally crashing command. Maybe there's something in ssh that causes this, but I couldn't find it.