Counting Duplicate Characters, Including Unicodes
Learn more about counting duplicate characters in Java.
Join the DZone community and get the full member experience.Join For Free
This article provides two solutions for counting duplicate characters in the given
String, including Unicode characters.
The solution to counting the characters in a string (including special characters such as #, $, and %) implies taking each character and comparing them with the rest. During the comparison, the counting state is maintained via a numeric counter that's increased by one each time the current character is found.
You may also like: Duplicate Objects in Java: Not Just Strings
There are two solutions to this problem.
The first solution iterates the string characters and uses
Map to store the characters as keys and the number of occurrences as values. If the current character was never added to
Map , then add it as (
character, 1). If the current character exists in
Map , then simply increase its occurrences by 1, for example, (
character, occurrences+1). This is shown in the following code:
Another solution relies on Java 8's stream feature. This solution has three steps. The first two steps are meant to transform the given string into
Stream<Character>, while the last step is responsible for grouping and counting the characters. Here are the steps:
- Call the
String.chars()method on the original string. This will return
IntStreamcontains an integer representation of the characters from the given string.
TransformIntStreaminto a stream of characters via the
mapToObj()method (convert the integer representation into the human-friendly character form).
- Finally, group the characters (
Collectors.groupingBy()) and count them (
The following snippet of code glues these three steps into a single method:
What About Unicode characters?
We are pretty familiar with ASCII characters. We have unprintable control codes between 0-31, printable characters between 32-127, and extended ASCII codes between 128-255. But what about Unicode characters? Consider this section for each problem that requires that we manipulate Unicode characters.
So, in a nutshell, early Unicode versions contained characters with values less than 65,535 (0xFFFF). Java represents these characters using the 16-bit
char data type. Calling
charAt(i) works as expected as long as
i don't exceed 65,535. But over time, Unicode has added more characters and the maximum value has reached 1,114,111 (0x10FFFF). These characters don't fit into 16 bits, and so 32-bit values (known as code points) were considered for the UTF-32 encoding scheme.
Unfortunately, Java doesn't support UTF-32! Nevertheless, Unicode has come up with a solution for still using 16 bits to represent these characters. This solution implies the following:
- 16-bit high surrogates: 1,024 values (U+D800 to U+DBFF)
- 16-bit low surrogates: 1,024 values (U+DC00 to U+DFFF)
Now, a high surrogate followed by a low surrogate defines what is known as a surrogate pair. Surrogate pairs are used to represent values between 65,536 (0x10000) and 1,114,111 (0x10FFFF). So, certain characters, known as Unicode supplementary characters, are represented as Unicode surrogate pairs (a one-character (symbol) fits in the space of a pair of characters) that are merged into a single code point. Java takes advantage of this representation and exposes it via a suite of methods such as
offsetByCodePoints() (take a look at the Java documentation for details).
codePointAt() instead of
codePoints() instead of
chars(), and so on helps us to write solutions that cover ASCII and Unicode characters as well.
For example, the well-known two hearts symbol is a Unicode surrogate pair that can be represented as a
char containing two values:
\uDC95. The code point of this symbol is
128149. To obtain a
String object from this code point, call
String str = String.valueOf(Character.toChars(128149)). Counting the code points in
str can be done by calling
str.codePointCount(0, str.length()), which returns 1 even if the
str length is 2. Calling
128149 and calling
Character.toChars(128149) returns 2 since two characters are needed to represent this code point being a Unicode surrogate pair. For ASCII and Unicode 16- bit characters, it will return 1.
So, if we try to rewrite the first solution (that iterates the string characters and uses
Map to store the characters as keys and the number of occurrences as values) to support ASCII and Unicode (including surrogate pairs), we obtain the following code:
The code demarcated as highlighted can be written as follows as well:
Finally, trying to rewrite the Java 8 functional style solution to cover Unicode surrogate pairs can be done as follows:
For third-party library support, please consider Guava:
The complete code is available on GitHub.
If you enjoyed this article, then you'll love my book, Java Coding Problems, that contains a dedicated chapter including 39 problems that involve strings, numbers, and math.
Opinions expressed by DZone contributors are their own.