String to Unicode Converter Utility

Java utility to convert strings to Unicode sequences and back, solve encoding issues, debug text problems, and ensure cross-system text compatibility.

Michael Gantman

Feb. 05, 26 · Analysis

Likes (1)

Comment

Save

619 Views

This is a technical article for Java developers. It describes a Java utility that can convert strings to Unicode sequences and back. There are many websites and other services that allow various text conversions. This utility allows you to do the conversion in Java code. It allows converting any string into a String containing a Unicode sequence that represents characters from the original string.

The utility can do backwards conversion as well — convert a Unicode sequence String into a textual String. Just to show an example, a String "Hello World" can be converted into "\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064".

What Is This Utility Needed For?

There are several use cases where such a utility would be needed.

I used it multiple times to diagnose and debug very tough encoding problems when text appears garbled, and you need to understand the root cause or what the text is supposed to be.
Another case is when your application needs to send info to systems that do not support certain encodings, or it is not known which encodings they support. This way, if the info is sent as a Unicode sequence, there are no encoding issues, and the codes are universal. As such, they are good for different systems interoperability.
Property files: Sometimes, when your application reads some properties from files and some text values are in languages that use non-Latin scripts (such as Hebrew, Arabic, Chinese, Japanese Russian and many others) and those files need to be distributed to different systems, you might want to convert your non-Latin text values into Unicode sequences to avoid a danger of file being saved in incorrect encoding. But then your application that reads the properties needs to convert them back into the original text.

How to Use the Utility

The utility is provided as part of an open-source Java library called MgntUtils. It is available on Maven Central and GitHub (including source code and Javadoc). Here is a direct link to Javadoc. The solution implementation is provided in the class StringUnicodeEncoderDecoder. So, the MgntUtils library needs to be included in your project, and then the usage example may look like this:

    Java
   
 

   String testStr = "Hello World"; 
String encodedStr = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(testStr); 
System.out.println(encodedStr); 
String restoredStr = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(encodedStr); 
System.out.println(restoredStr);
  

The output of this code snippet would be:

    Plain Text
   
   \u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064 
Hello World

A Little More About Encoding Issues

Most encoding issues arise when non-Latin scripts are used, or when text contains portions in different languages with a mix of Latin and non-Latin scripts. So, sometimes when a text (or parts of it) is displayed as question marks or as garbled, the main question should be if this is a display error related to encoding, or the data itself is corrupted or lost. So, in this case, it makes sense to convert the problematic String to a Unicode sequence (and sometimes backwards to see if it fixes the problem).

In Unicode sequences, it is easy to determine whether the data is actually there behind the garbled display. If the issue is just an incorrect display and the actual data is not lost or corrupted, you can also determine in what language the original data was by the first 2 digits of \uXXXX character code and what actual character it was supposed to be (by the last 2 digits).

Limitations and Trade-Offs

Long Strings Problem

Note that when a String is converted to Unicode sequences, each character is replaced with 6 characters. For example, converting a String "H" will result in String "u\0048". So, if a long String is converted, the result will be 6 times longer.

In addition, the primary value of this conversion is that it allows you to map each character to its Unicode. So, when you work with String up to 10 to 20 characters or so, everything is great. But think about converting a String containing 500-character-long text, and you need to find a Unicode mapping for your 300th character! This is just not practical. Except when you need that text converted to be passed to another system, not for human analysis. But even in this case, be aware of the significant increase in String size.

Converting Strings That Already Contain Unicode Sequences

This utility does not recognize Unicode sequences when converting a String to them. Here is a short example: Converting a String "H" will result in String "u\0048".

However, if you now convert the String "u\0048" again into Unicode sequences, the result will be "\u005c\u0075\u0030\u0030\u0034\u0038". This is because the utility will take each of the 6 characters from the String "u\0048" one by one and will convert each one into its Unicode sequence. (Don’t confuse it with the method decoding Unicode Sequences — StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString() that converts a Unicode sequence String back to the original String. That method, of course, will convert String "u\0048" back into String "H".)

Conclusion

I have used this utility very effectively for testing thorny encoding issues, and others have used it for this purpose, notably in projects that work with non-Latin languages. So, it is battle-tested. The MgntUtils library is lightweight and very easy to integrate. Give it a try. I hope it will save you some headaches.

Strings

Published at DZone with permission of Michael Gantman. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending