Stranger Things About Java Characters
The mystery of the comment error and other stories...
Join the DZone community and get the full member experience.
Join For FreeIntroduction
Did you know that the following is a valid Java statement?
xxxxxxxxxx
\u0069\u006E\u0074 \u0069 \u003D \u0038\u003B
You can try to copy and paste it inside the main method of any class and compile it. If you then also add the following statement after you compile that class
xxxxxxxxxx
System.out.println(i);
and run that class, the code will print the number 8!
And did you know that this comment instead produces a syntax error at compile time?
xxxxxxxxxx
/*
* The file will be generated inside the C:\users\claudio folder
*/
Yet, comments shouldn't produce syntax errors. In fact, programmers often comment out pieces of code just to make the compiler ignore them... so what's going on?
To find out, spend a few minutes reviewing a bit of basic Java: the primitive type, char
.
Primitive Character Data Type
As everyone knows, the char
type is one of the eight primitive Java types. It allows us to store characters, one at a time. Below is a simple example where the character value is assigned to a char
type:
xxxxxxxxxx
char aCharacter = 'a';
Actually, this data type is not used frequently, because in most cases, programmers need character sequences and therefore prefer strings. Each literal character value must be included between two single quotes, not to be confused with double quotes used for string literals. A string declaration follows:
xxxxxxxxxx
String s = "Java melius semper quam latinam linguam est";
There are three ways to assign a literal value to a char
type, and all three modes require the inclusion of the value between single quotes:
- use a single printable character on the keyboard (for example
'&'
). - use the Unicode format with hexadecimal notation (for example
'\u0061'
, which is equivalent to the decimal number 97 and which identifies the'a'
character). - use a special escape character (for example
'\n'
which indicates the line feed character).
Let's add some details in the next three sections.
Printable Keyboard Characters
We can assign any character found on our keyboard to a char
variable, provided that our system settings support the required character and that the character is printable (for example, the "Canc" and "Enter" keys are not printable). In any case, the literal assignable to a char
primitive type is always included between two single quotes. Here are some examples:
xxxxxxxxxx
char aUppercase = 'A';
char minus = '-';
char at = '@';
The char
data type is stored in 2 bytes (16 bits), with a range consisting only of positive numbers ranging from 0 to 65,535. In fact, there is a 'mapping' that associates a certain character to each number. This mapping (or encoding) is defined by the Unicode standard (further described in the next section).
Unicode Format (Hexadecimal Notation)
We said that the char
primitive type is stored in 16 bits and can define as many as 65,536 different characters. Unicode encoding deals with standardizing all the characters (and also symbols, emojis, ideograms, etc.) that exist on this planet. Unicode is an extension of the encoding known as UTF-8, which in turn is based on the old 8-bit Extended ASCII standard, which in turn contains the oldest standard, ASCII code (an acronym for American Standard Code for Information Interchange).
We can directly assign a Unicode value to a char
in hexadecimal format using 4 digits, which uniquely identifies a given character, prefixing it with the prefix \u
(always lower case). For example:
xxxxxxxxxx
char phiCharacter = '\u03A6'; // Capital Greek letter Φ
char nonIdentifiedUnicodeCharacter = '\uABC8';
In this case, we’re talking about literal in Unicode format (or literal in hexadecimal format). In fact, when using 4 digits with the hexadecimal format, exactly 65,536 characters are covered.
Actually, Java 15 supports Unicode version 13.0, which contains many more characters than 65,536. Today, the Unicode standard has evolved a lot and now allows us to represent potentially over a million characters, although only 143,859 numbers have already been assigned to a character. But the standard is constantly evolving. Anyway, to assign Unicode values that are outside the 16-bit range of a char
type, we usually use classes like String
and Character
, but since it is a very rare case and not interesting for the purpose of this article, we will not talk about it.
Special Escape Characters
In a char
type, it is also possible to store special escape characters, that is, sequences of characters that cause particular behaviors in the printing:
\b
is equivalent to a backspace, a cancellation to the left (equivalent to the Delete key).\n
is equivalent to a line feed (equivalent to the Enter key).\\
equals only one \ (just because the \ character is used for escape characters).\t
is equivalent to a horizontal tab (equivalent to the TAB key).\'
is equivalent to a single quote (a single quote delimits the literal of a character).\"
is equivalent to a double quote (a double quote delimits the literal of a string).\r
represents a carriage return (a special character that moves the cursor to the beginning of the line).\f
represents a form feed (disused special character representing the cursor moving to the next page of the document).
Note that assigning the literal '"'
to a character is perfectly legal, so the following statement:
xxxxxxxxxx
System.out.println('"');
which is equivalent to the following code:
xxxxxxxxxx
char doubleQuotes = '"';
System.out.println(doubleQuotes);
is correct and will print the double-quote character:
xxxxxxxxxx
"
If we tried not to use the escape character for a single-quote, for example, with the following statement:
xxxxxxxxxx
System.out.println(''');
we would get the following compile-time errors, since the compiler will not be able to distinguish the character delimiters:
xxxxxxxxxx
error: empty character literal
System.out.println(''');
^
error: unclosed character literal
System.out.println(''');
^
2 errors
Since the string literal delimiters are represented with double-quotes, then the situation is reversed. In fact, it is possible to represent single-quotes within a string:
xxxxxxxxxx
System.out.println("'IQ'");
that will print:
xxxxxxxxxx
'IQ'
On the other hand, we must use the \"
escape character to use double-quotes within a string. So, the following statement:
xxxxxxxxxx
System.out.println(""IQ"");
will cause the following compilation errors:
xxxxxxxxxx
error: ')' expected
System.out.println(""IQ"");
^
error: ';' expected
System.out.println(""IQ"");
^
2 errors
Instead, the following instruction is correct:
xxxxxxxxxx
System.out.println("\"IQ\"");
and will print:
xxxxxxxxxx
"IQ"
Write Java Code With the Unicode Format
The Unicode literal format can also be used to replace any line of our code. In fact, the compiler first transforms the Unicode format into a character and then evaluates the syntax. For example, we could rewrite the following statement:
xxxxxxxxxx
int i = 8;
in the following way:
xxxxxxxxxx
\u0069\u006E\u0074 \u0069 \u003D \u0038\u003B
In fact, if we add the following to the statement to the previous line:
xxxxxxxxxx
System.out.println("i = " + i);
it will print:
xxxxxxxxxx
i = 8
Undoubtedly, this is not a useful way to write our code. But it can be useful to know this feature, as it allows us to understand some mistakes that (rarely) happen.
Unicode Format for Escape Characters
The fact that the compiler transforms the Unicode hexadecimal format before it evaluates the code has some consequences and justifies the existence of escape characters. For example, let's consider the line feed character, which can be represented with the escape character \n
. Theoretically, the line feed is associated in the Unicode encoding to the decimal number 10 (which corresponds to the hexadecimal number A). But, if we try to define it using the Unicode format:
xxxxxxxxxx
char lineFeed = '\u000A';
we will get the following compile-time error:
xxxxxxxxxx
error: illegal line end in character literal
char lineFeed = '\u000A';
^
1 error
In fact, the compiler transforms the previous code into the following before evaluating it:
xxxxxxxxxx
char lineFeed = '
';
The Unicode format has been transformed into the new line character, and the previous syntax is not valid syntax for the Java compiler.
Likewise, the single quote character '
that corresponds to the decimal number 39 (equivalent to the hexadecimal number 27) and that we can represent with the escape character \'
, cannot be represented with the Unicode format:
x
char singleQuote = '\u0027';
Also, in this case, the compiler will transform the previous code in this way:
x
char singleQuote = ''';
which will give rise to the following compile-time errors:
x
error: empty character literal
char singleQuote = '\u0027';
^
error: unclosed character literal
char singleQuote = '\u0027';
^
2 errors
The first error is because the first pair of quotes does not contain a character, while the second error indicates that specifying the third single quote is an unclosed character literal.
Also, with regard to the carriage return character, represented by the hexadecimal number D (corresponding to the decimal number 13), and already representable with the escape character\r
, there are problems. In fact, if we write:
xxxxxxxxxx
char carriageReturn = '\u000d';
we will get the following compile-time error:
xxxxxxxxxx
error: illegal line end in character literal
char carriageReturn = '\u000d';
^
1 error
In fact, the compiler has transformed the number in Unicode format into a carriage return by returning the cursor to the beginning of the line, and what was supposed to be the second single quote became the first.
As for the character \,
represented by the decimal number 92 (corresponding to the hexadecimal number 5C), and represented by the escape character \\
, if we write:
xxxxxxxxxx
char backSlash = '\u005C';
we will get the following compile-time error:
xxxxxxxxxx
error: unclosed character literal
char backSlash = '\u005C';
^
1 error
This is because the previous code will have been transformed into the following:
xxxxxxxxxx
char backSlash = '\';
and therefore the \'
pair of characters is considered as an escape character corresponding to a single-quote, and therefore the literal closure is missing another single quote.
On the other hand, if we consider the character "
, represented by the hexadecimal number 22 (corresponding to the decimal number 34), and represented by the escape character \"
, if we write:
xxxxxxxxxx
char quotationMark = '\u0022';
there will be no problem. But if we use this character within a string:
xxxxxxxxxx
String quotationMarkString = "\u0022";
we will get the following compile-time error:
xxxxxxxxxx
error: unclosed string literal
String quotationMarkString = "\u0022";
^
1 error
since the previous code will have been transformed into the following:
xxxxxxxxxx
String quotationMarkString = """;
The Mystery of the Comment Error
An even stranger situation is found when using single-line comments for Unicode formats such as carriage return or line feed. For example, despite being commented out, both of the following statements would give rise to compile-time errors!
xxxxxxxxxx
// char lineFeed = '\u000A';
// char carriageReturn = '\u000d';
This is because the compiler always transforms the hexadecimal formats with the line feed and carriage return characters, which are not compatible with the single-line comments; they print characters outside the comment!
To solve the situation, use the multi-line comment notation, for example:
xxxxxxxxxx
/* char lineFeed = '\u000A';
char carriageReturn = '\u000d'; */
Another mistake that can cause a programmer to lose a lot of time is when the sequence \u
is used in a comment. For example, with the following comment, we will get a compile-time error:
xxxxxxxxxx
/*
* The file will be generated inside the C:\users\claudio folder
*/
If the compiler does not find a sequence of 4 hexadecimal characters valid after \u
, it will print the following error:
xxxxxxxxxx
error: illegal unicode escape
* The file will be generated inside the C:\users\claudio folder
^
1 error
Conclusions
In this article, we have seen that the use of the char
type in Java hides some truly surprising special cases. In particular, we have seen that it is possible to write Java code using the Unicode format. This is because the compiler first transforms the Unicode format into a character and then evaluates the syntax. This implies that programmers can find syntax errors where they would never expect, especially inside the comments.
Author Note: This article is a short excerpt from section 3.3.5 Primitive Character Data Type of Volume 1 from my book Java for Aliens. For more information, please visit www.javaforaliens.com (you can download the section 3.3.5 from the Samples area).
Published at DZone with permission of Claudio De Sio Cesari. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments