DZone
Web Dev Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > Web Dev Zone > Forget ISO-8859-1

Forget ISO-8859-1

Let’s forget ASCII and ISO-8859-1. It’s not even okay to call them ''legacy'' after 24 years of UTF-8. After 24 years they should’ve died.

Bozhidar Bozhanov user avatar by
Bozhidar Bozhanov
·
Jan. 19, 17 · Web Dev Zone · Opinion
Like (17)
Save
Tweet
5.29K Views

Join the DZone community and get the full member experience.

Join For Free

UTF-8 was first presented in 1993. One would assume that 24 years is enough to time for it to become ubiquitous, especially given that the Internet is global. ASCII doesn’t even cover French letters, not to mention Cyrillic or Devanagari (the Hindi script). That’s why ASCII was replaced by ISO-8859-1, which kind of covers most of western languages’ orthographies.

88.3% of the websites use UTF-8. That’s not enough, but let’s assume these 11.7% do not accept any input and are just English-language static websites. The problem of the still-pending adoption of UTF-8 is how entrenched ASCII/ISO-8859-1 is. I’ll try to give a few examples:

  • UTF-8 isn’t the default encoding in many core Java classes. FileReader, for example. It’s similar for other languages and runtimes. The default encoding of these Java classes is the JVM default, which is most often ISO-8859-1. It is allegedly taken from the OS, but I don’t remember configuring any encoding on my OS. Just locale, which is substantially different.
  • Many frameworks, tools, and containers don’t use UTF-8 by default (and don’t try to remedy the JVM not using UTF-8 by default). Tomcat’s default URL encoding I think is still ISO-8859-1. Eclipse doesn’t make the files UTF-8 by default (on my machine it’s somethings even windows-1251 (Cyrillic), which is horrendous). And so on. I’ve asked for having UTF-8 as default in the past, and I repeat my call
  • Regex examples and tutorials always give you the [a-zA-Z0-9]+ regex to “validate alphanumeric input”. It is built-in in many validation frameworks. And it is so utterly wrong. This is a regex that must never appear anywhere in your code, unless you have a pretty good explanation. Yet, the example is ubiquitous. Instead, the right regex is [\p{L}0-9]+. Using the wrong regex means you won’t be able to accept any special character. Which is something you practically never want. Unless, probably, due to the next problem.
  • Browsers have issues with UTF-8 URLs. Why? It’s complicated. And it almost works when it’s not part of the domain name. Almost, because when you copy the URL, it gets screwed (pardon me – encoded).
  • Microsoft Excel doesn’t work properly with UTF-8 in CSV. I was baffled to realize the UTF-8 CSVs become garbage. Well, not if you have a BOM (byte order mark), but come on, it’s [the current year].

As Jon Skeet rightly points out – we have issues with the most basic data types – strings, numbers, and dates. This is partly because the real world is complex. And partly because we software engineers tend to oversimplify it. This is what we’ve done with ASCII and other latin-only encodings. But let’s forget ASCII and ISO-8859-1. It’s not even okay to call them “legacy” after 24 years of UTF-8. After 24 years they should’ve died.

Let’s not give regex examples that don’t work with UTF-8, let’s not assume any default different than UTF-8 is a good idea and let’s sort the URL mess.

Maybe I sound dogmatic. Maybe I exaggerate because my native script is non-latin. But if we want our software to be global (and we want that, in order to have a bigger market), then we have to sort our basic encoding issues. Having UTF-8 as a standard is not enough. Let’s forget ISO-8859-1.

UTF-8

Published at DZone with permission of Bozhidar Bozhanov, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Top 7 Features in Jakarta EE 10 Release
  • When Writing Code Isn't Enough: Citizen Development and the Developer Experience
  • 12 Modern CSS Techniques For Older CSS Problems
  • Is DataOps the Future of the Modern Data Stack?

Comments

Web Dev Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo