An Insight Into Unicode, UTF-8, and Their Usage

DZone 's Guide to

An Insight Into Unicode, UTF-8, and Their Usage

In this article, we will be talking about Unicode, UTF8 and what is their usage on an international level.

· Web Dev Zone ·
Free Resource

Computers have come a long way since their creation. Today we are surrounded by things which are operated and programmed by some sort of coding language. These languages are encoded using some international encoding standard to assign value to each letter, digit or symbol so that they can be applied across different programs and platforms.  

Computer hardware as well as software works on a set list of defined characters and each character is defined by a number. So if you want to know about the number game then this is the article for you. In this article we will be talking about Unicode, UTF8 and what is their usage on an international level.

Unicode: It was proposed in the late 1980s. It would assign a unique number to every letter of every language that would have more than 256 slots. Today it has 110,000 code points. Its first 128 code points are alike ASCII.  From 128-255 it comprises of currency symbols, accented characters and other common signs. Beyond 256 there are more accented characters. After 880 Unicode code points it gets into Greek letters, afterward Cyrillic, Hebrew, Arabic and so on.  This is great as there is no ambiguity present as each letter is denoted by its very own unique number.

Unicode is not a code page or character set and it does not fit into 8 bit and 16 bits. A unique fact about Unicode is that though it uses only 110,116 code points but it has the ability to define code points up to 1,114,112 which would need 21 bits. Today’s computers are far more advanced then the computers of 1970s and their 8-bit microprocessor. Modern world computers have 64-bit processor, so it has become easy to move beyond 8-bit character and into a 32bit or 64-bit character.

Moreover, Unicode characters can be used in network’s SSID as well. It’s not guaranteed to work with all wireless home routers and hardware though.

Today lots of software are written in C and C++ which supports 32-bit character called wchar. Modern web browser internally uses Unicode. Web browsers use these 32 bit wide characters and can deal with more than 4 billion characters. “As per Inceva, a Bangkok-based website design agency, most modern browsers are built to deal with Unicode internally, but still the developers have to call the data from a Web server to the browser and back again, and therefore will need to save it as a file or in the database somewhere.”

UTF-8: So you might be thinking when browsers can deal with Unicode in 32 bit character then why there is need of UTF-8. The problem lies in sending, receiving, reading and writing of characters because still there are so many previous software and protocols (i.e. Google Adwords feedback system is rapid in execution and offers the customers to give inputs and feedback) that read/write and send/receive 8-bit characters and sending/storing English text using the 32 bits would create quadruple the bandwidth/space amount. This is where the need of UTF-8 was felt. It is very clever and treats numbers from 0-127 as ASCII, 128-192 as keys to be shifted and 192-247 as shift keys. 

UTF-8 is a multi-byte variable-width encoding and it is backward compatible with ASCII. Therefore it has become one of the most favorite international characters set on the web.

unicode characters ,unicode ,utf-8 ,web dev

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}