DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Leetcode: Improving String Performance in Swift
  • RION - A Fast, Compact, Versatile Data Format

Trending

  • Why High-Performance AI/ML Is Essential in Modern Cybersecurity
  • Unmasking Entity-Based Data Masking: Best Practices 2025
  • Docker Base Images Demystified: A Practical Guide
  • AI-Based Threat Detection in Cloud Security

UTF-8 in HTTP Headers

When working with UTF-8, we can use the encoding described in RFC 6266 for "Content-Disposition" and transliterate it to Latin in other cases.

By 
Stan Surov user avatar
Stan Surov
·
Nov. 26, 18 · Tutorial
Likes (5)
Comment
Save
Tweet
Share
64.6K Views

Join the DZone community and get the full member experience.

Join For Free

HTTP 1.1 is a well-known hypertext protocol for data transfer. HTTP messages are encoded with ISO-8859-1 (which can be nominally considered as an enhanced ASCII version, containing umlauts, diacritic and other characters of West European languages). At the same time, the message body can use another encoding assigned in "Content-Type" header. But what shall we do if we need to assign non-ASCII characters not in the message bodies, but in the header? Probably the most well-known case is setting a filename in the "Content-Disposition" header. It seems like a common task, but its implementation isn’t obviou

TL;DR: Use the encoding described in RFC 6266 for "Content-Disposition" and transliterate it to Latin in other cases.

Introduction to Encodings

The article mentions US-ASCII (usually named just ASCII), ISO-8859-1, and UTF-8 encodings. Here is a small introduction to these encodings. This paragraph is for developers who rarely work with these encodings or doesn’t use them at all, and have partially forgotten them.

ASCII is a simple encoding with 128 characters, including Latin alphabet, digits, punctuation marks, and utility characters.

7 bits is enough to represent any ASCII character. The word "test" in HEX representation would look like: 0x74 0x65 0x73 0x74. The first bit of any character would always be 0, as the encoding has 128 characters, and a bite gives 2^8 = 256 variants.

ISO-8859-1 is an encoding aimed at West European languages. It has diacritics, German umlauts, etc.

The encoding has 256 characters, thus, can be represented with 1 byte. The first half (128 characters) fully matches ASCII. Hence, if the first bit = 0, it’s a usualy an ASCII character. If it’s 1, we recognize a specific ISO-8859-1 character.

UTF-8 is one of the most famous encodings alongside with ASCII. It is capable of encoding 1,112,064 characters. Each character size is varied from 1 to 4 bites (previously the values could be up to 6 bites).

The program processing this encoding checks the first bit and estimates the character size in bytes. If an octet begins with 0, the character is represented by 1 byte. 110 - 2 bytes, 1110 - 3 bytes, 11110 - 4 bytes.

Just like in case with ISO-8859-1, the first 128 will match ASCII. That’s why texts using ASCII characters will look absolutely the same in binary representation, regardless of the encoding used: US-ASCII, ISO-8859-1 or UTF-8.

UTF-8 in the Message Body

Before we proceed to headers, let’s see how UTF-8 is used in message bodies. For that purpose, we'll use the "Content-Type" header.

If "Content-Type" isn’t assigned, a browser must process messages as if they were in ISO-8859-1. A browser shouldn't try to guess the encoding, and it certainly shouldn’t ignore "Content-Type." So, if we transfer UTF-8 messages, but do not assign encoding in the headers, they will be read as if they were encoded with ISO-8859-1.

Entering a UTF-8 Message in a Header’s Value

In case of a message body, everything’s rather simple. A message body always follows a header, so there’re no technical problems here. But what shall we do with headers? The specification claims directly that the order of headers in the messages doesn’t matter; i.e., it’s not possible to assign an encoding for one header via another.

What if we just write a UTF-8 value into a header? We saw that with such a trick, applied to a message body, the value will be read as ISO-8859-1. Therefore, we can assume that the same will happen to a header. But no, it won’t. In fact, for most, maybe even for all, cases such a solution will work out. Such cases include old iPhones, IE11, Firefox, and Google Chrome. The only browser refusing to read such a header, of all the browsers I had when writing this article, was Edge.

Such behavior is not described in specifications. The browsers' developers probably decided to go easy on other developers and detect the UTF-8 encoding of messages automatically. Generally speaking, it’s a simple task. Check the first bit: if it’s 0, then it’s ASCII, if it's 1, then it’s probably UTF-8.

Is there anything in common with ISO-8859-1, in that case? Actually, almost none. Let’s use a UTF-8 character of 2 octets as an example (Russian letters are represented by two octets). In binary representation, the character will look the following: 110xxxxx 10xxxxxx. In HEX representation: [0xC0-0x6F] [0x80-0xBF]. In ISO-8859-1 those symbols can hardly be used to express something sensible. Therefore, there’s very little chance that a browser will decode a message wrong.

However, you can face some problems when trying to use this method: your web server or framework can simply forbid writing UTF-8 characters into a header. For example, Apache Tomcat enters 0x3F (question mark) instead of UTF-8 symbols. Of course, this restriction can be circumvented, but if an application slaps you on the wrist and doesn’t let you do something, then you probably shouldn’t do it.

No matter if your framework or server forbids or lets you write UTF-8 messages in a header, I don’t recommend doing that. It’s not a documented solution and can stop working in browsers at any moment.

Transliteration

As I see it, transliteration is a better solution. Many popular Russian resources don’t mind using transliteration in filenames. It’s a guaranteed solution which wouldn’t break with new a browser version release and doesn't need to be tested separately on each platform. Though, you should certainly think of a way of transferring all the possible ranges of characters, which isn't so easy. For example, if an application is aimed at a Russian audience, the filename can contain Tatar letters ә and ң, which should be somehow handled, not just replaced with "?".

RFC 2047

As I’ve already mentioned, Tomcat didn’t let me enter UTF-8 in a message header. Is this feature mentioned in Javadocs for servlets? Yes, it is:

It mentions RFC 2047. I tried to encode messages using this format — the browser didn’t get the idea. This encoding method doesn’t work for HTTP anymore, but it used to. For example, here's a ticket on deleting this encoding support from Firefox.

RFC 6266

The ticket mentioned above says, that even after the support of RFC 2047 stops, there’s still a way to transfer UTF-8 values in downloaded files names: RFC 6266. From my point of view, today, this is the optimal and correct decision. Many popular internet resources use it. We here at the CUBA Platform also use this RFC for "Content-Disposition" generation.

RFC 6266 is a specification describing the use of “Content-Disposition” header. The encoding itself is closely described in RFC 8187.

Parameter “filename” contains the name of the file in ASCII, “filename*” in any other necessary encoding. If both attributes are defined, “filename” is ignored in all modern browsers (including IE11 and old Safari versions). The oldest browsers, on the contrary, ignore “filename*”.

In this encoding method, first, you assign the encoding in a parameter, after "" comes the encoded value. Observed characters of ASCII don’t require encoding. Other characters are just written in hex representation with "%" before each octet.

What Should Be Done With Other Headers?

Encoding described in RFC 8187 isn’t generic. Of course, you can enter a parameter with the * prefix in a header, and it would probably work for some browsers, but specification dictates not to do so.

Currently, in each case, where UTF-8 is supported in headers, there’s a direct mention of the relevant RFC. Apart from "Content-Disposition," this encoding is used, for example, in Web Linking and Digest Access Authentication.

It should be taken into account that the standards of this area are constantly changing. The usage of above-mentioned encoding in HTTP was offered only in 2010. The usage of this encoding in the very "Content-Disposition" was committed in the standard of 2011. Despite the fact that these standards are only on "Proposed Standard" stage, they are supported everywhere. It is quite possible that new standards are going to appear, the standards that will allow us to work with various encodings in headers more consistently. So, we only need to follow the news about the HTTP standards and their support level on the browser side.

UTF-8

Published at DZone with permission of Stan Surov. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Leetcode: Improving String Performance in Swift
  • RION - A Fast, Compact, Versatile Data Format

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!