{{announcement.body}}
{{announcement.title}}

Projecting Unicode to ASCII

DZone 's Guide to

Projecting Unicode to ASCII

If you should ever need to, say, convert from Unicode to ASCII, here's how you can do so using Python.

· Open Source Zone ·
Free Resource

Sometimes you need to downgrade Unicode text to more restricted ASCII text. For example, while working on my previous post, I was surprised that there didn’t appear to be an asteroid named after Poincaré. There is one, but it was listed as Poincare in my list of asteroid names.

Python Module

I used the Python module unidecode to convert names to ASCII before searching, and that fixed the problem. Here’s a small example showing how the code works.

    import unidecode

    for x in ["Poincaré", "Gödel"]:
        print(x, unidecode.unidecode(x))

This produces:

    Poincaré Poincare
    Gödel Godel

Installing the unidecode module also installs a command line utility by the same name. So you could, for example, pipe text to that utility.

Projections

I titled this post “Projecting Unicode to ASCII” because this code is a projection in the mathematical sense. A projection is a function P such that for all inputs x,

P(P(x) ) = P(x).

That is, applying the function twice does the same thing as applying the function once. The name comes from projection in the colloquial sense, such as projecting a three-dimensional object onto a two-dimensional plane. An equivalent term is to say Pis idempotent. [1]

The unidecode function maps the full range of Unicode characters into the range 0x00 to 0x7F, and if you apply it to a character already in that range, the function leaves it unchanged. So the function is a projection, or you could say the function is idempotent.

Projection is such a simple condition that it hardly seems worth giving it a name. And yet it is extremely useful. A general principle in user interface to design is to make something a projection if the user expects it to be a projection. Users probably don’t have the vocabulary to say “I expected this to be a projection” but they’ll be frustrated if something is almost a projection but not quite.

For example, if software has a button to convert an image from color to grayscale, it would be surprising if (accidentally) clicking button a second time had any effect. It would be unexpected if it returned the original color image, and it would be even more unexpected if it did something else, such as keeping the image in grayscale but lowering the resolution.

Related Posts

[1] The term “idempotent” may be used more generally than “projection,” the latter being more common in linear algebra. Some people may think of a projection as linear idempotent function. We’re not exactly doing linear algebra here, but people do think of portions of Unicode geometrically, speaking of “planes.”

Topics:
open source ,python ,unicode ,ascii ,projections ,code

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}