UTF-8 in HTTP Headers

When working with UTF-8, we can use the encoding described in RFC 6266 for "Content-Disposition" and transliterate it to Latin in other cases.

Stan Surov

Nov. 26, 18 · Tutorial

Likes (5)

Comment

Save

64.6K Views

HTTP 1.1 is a well-known hypertext protocol for data transfer. HTTP messages are encoded with ISO-8859-1 (which can be nominally considered as an enhanced ASCII version, containing umlauts, diacritic and other characters of West European languages). At the same time, the message body can use another encoding assigned in "Content-Type" header. But what shall we do if we need to assign non-ASCII characters not in the message bodies, but in the header? Probably the most well-known case is setting a filename in the "Content-Disposition" header. It seems like a common task, but its implementation isn’t obviou

TL;DR: Use the encoding described in RFC 6266 for "Content-Disposition" and transliterate it to Latin in other cases.

Introduction to Encodings

The article mentions US-ASCII (usually named just ASCII), ISO-8859-1, and UTF-8 encodings. Here is a small introduction to these encodings. This paragraph is for developers who rarely work with these encodings or doesn’t use them at all, and have partially forgotten them.

ASCII is a simple encoding with 128 characters, including Latin alphabet, digits, punctuation marks, and utility characters.

7 bits is enough to represent any ASCII character. The word "test" in HEX representation would look like: 0x74 0x65 0x73 0x74. The first bit of any character would always be 0, as the encoding has 128 characters, and a bite gives 2^8 = 256 variants.

ISO-8859-1 is an encoding aimed at West European languages. It has diacritics, German umlauts, etc.

The encoding has 256 characters, thus, can be represented with 1 byte. The first half (128 characters) fully matches ASCII. Hence, if the first bit = 0, it’s a usualy an ASCII character. If it’s 1, we recognize a specific ISO-8859-1 character.

UTF-8 is one of the most famous encodings alongside with ASCII. It is capable of encoding 1,112,064 characters. Each character size is varied from 1 to 4 bites (previously the values could be up to 6 bites).

The program processing this encoding checks the first bit and estimates the character size in bytes. If an octet begins with 0, the character is represented by 1 byte. 110 - 2 bytes, 1110 - 3 bytes, 11110 - 4 bytes.

Just like in case with ISO-8859-1, the first 128 will match ASCII. That’s why texts using ASCII characters will look absolutely the same in binary representation, regardless of the encoding used: US-ASCII, ISO-8859-1 or UTF-8.

UTF-8 in the Message Body

Before we proceed to headers, let’s see how UTF-8 is used in message bodies. For that purpose, we'll use the "Content-Type" header.

If "Content-Type" isn’t assigned, a browser must process messages as if they were in ISO-8859-1. A browser shouldn't try to guess the encoding, and it certainly shouldn’t ignore "Content-Type." So, if we transfer UTF-8 messages, but do not assign encoding in the headers, they will be read as if they were encoded with ISO-8859-1.

Entering a UTF-8 Message in a Header’s Value

In case of a message body, everything’s rather simple. A message body always follows a header, so there’re no technical problems here. But what shall we do with headers? The specification claims directly that the order of headers in the messages doesn’t matter; i.e., it’s not possible to assign an encoding for one header via another.

What if we just write a UTF-8 value into a header? We saw that with such a trick, applied to a message body, the value will be read as ISO-8859-1. Therefore, we can assume that the same will happen to a header. But no, it won’t. In fact, for most, maybe even for all, cases such a solution will work out. Such cases include old iPhones, IE11, Firefox, and Google Chrome. The only browser refusing to read such a header, of all the browsers I had when writing this article, was Edge.

Such behavior is not described in specifications. The browsers' developers probably decided to go easy on other developers and detect the UTF-8 encoding of messages automatically. Generally speaking, it’s a simple task. Check the first bit: if it’s 0, then it’s ASCII, if it's 1, then it’s probably UTF-8.

Is there anything in common with ISO-8859-1, in that case? Actually, almost none. Let’s use a UTF-8 character of 2 octets as an example (Russian letters are represented by two octets). In binary representation, the character will look the following: 110xxxxx 10xxxxxx. In HEX representation: [0xC0-0x6F] [0x80-0xBF]. In ISO-8859-1 those symbols can hardly be used to express something sensible. Therefore, there’s very little chance that a browser will decode a message wrong.

However, you can face some problems when trying to use this method: your web server or framework can simply forbid writing UTF-8 characters into a header. For example, Apache Tomcat enters 0x3F (question mark) instead of UTF-8 symbols. Of course, this restriction can be circumvented, but if an application slaps you on the wrist and doesn’t let you do something, then you probably shouldn’t do it.

No matter if your framework or server forbids or lets you write UTF-8 messages in a header, I don’t recommend doing that. It’s not a documented solution and can stop working in browsers at any moment.

Transliteration

As I see it, transliteration is a better solution. Many popular Russian resources don’t mind using transliteration in filenames. It’s a guaranteed solution which wouldn’t break with new a browser version release and doesn't need to be tested separately on each platform. Though, you should certainly think of a way of transferring all the possible ranges of characters, which isn't so easy. For example, if an application is aimed at a Russian audience, the filename can contain Tatar letters ә and ң, which should be somehow handled, not just replaced with "?".

RFC 2047

As I’ve already mentioned, Tomcat didn’t let me enter UTF-8 in a message header. Is this feature mentioned in Javadocs for servlets? Yes, it is:

It mentions RFC 2047. I tried to encode messages using this format — the browser didn’t get the idea. This encoding method doesn’t work for HTTP anymore, but it used to. For example, here's a ticket on deleting this encoding support from Firefox.

RFC 6266

The ticket mentioned above says, that even after the support of RFC 2047 stops, there’s still a way to transfer UTF-8 values in downloaded files names: RFC 6266. From my point of view, today, this is the optimal and correct decision. Many popular internet resources use it. We here at the CUBA Platform also use this RFC for "Content-Disposition" generation.

RFC 6266 is a specification describing the use of “Content-Disposition” header. The encoding itself is closely described in RFC 8187.

Parameter “filename” contains the name of the file in ASCII, “filename*” in any other necessary encoding. If both attributes are defined, “filename” is ignored in all modern browsers (including IE11 and old Safari versions). The oldest browsers, on the contrary, ignore “filename*”.

In this encoding method, first, you assign the encoding in a parameter, after "" comes the encoded value. Observed characters of ASCII don’t require encoding. Other characters are just written in hex representation with "%" before each octet.

What Should Be Done With Other Headers?

Encoding described in RFC 8187 isn’t generic. Of course, you can enter a parameter with the * prefix in a header, and it would probably work for some browsers, but specification dictates not to do so.

Currently, in each case, where UTF-8 is supported in headers, there’s a direct mention of the relevant RFC. Apart from "Content-Disposition," this encoding is used, for example, in Web Linking and Digest Access Authentication.

It should be taken into account that the standards of this area are constantly changing. The usage of above-mentioned encoding in HTTP was offered only in 2010. The usage of this encoding in the very "Content-Disposition" was committed in the standard of 2011. Despite the fact that these standards are only on "Proposed Standard" stage, they are supported everywhere. It is quite possible that new standards are going to appear, the standards that will allow us to work with various encodings in headers more consistently. So, we only need to follow the news about the HTTP standards and their support level on the browser side.

UTF-8

Published at DZone with permission of Stan Surov. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending