{{announcement.body}}
{{announcement.title}}

Counting Duplicate Characters, Including Unicodes

DZone 's Guide to

Counting Duplicate Characters, Including Unicodes

Learn more about counting duplicate characters in Java.

· Java Zone ·
Free Resource

Learn more about counting duplicate characters in Java.

This article provides two solutions for counting duplicate characters in the given String, including Unicode characters.

The solution to counting the characters in a string (including special characters such as #, $, and %) implies taking each character and comparing them with the rest. During the comparison, the counting state is maintained via a numeric counter that's increased by one each time the current character is found.

You may also like: Duplicate Objects in Java: Not Just Strings

There are two solutions to this problem.

The first solution iterates the string characters and uses Map to store the characters as keys and the number of occurrences as values. If the current character was never added to Map , then add it as (character, 1). If the current character exists in Map , then simply increase its occurrences by 1, for example, (character, occurrences+1). This is shown in the following code:

Java




xxxxxxxxxx
1
12


1
public Map<Character, Integer> countDuplicateCharacters(String str) {
2
  
3
  Map<Character, Integer> result = new HashMap<>();
4
  
5
  // or use for(char ch: str.toCharArray()) { ... }
6
  for (int i = 0; i<str.length(); i++) {
7
    char ch = str.charAt(i);
8
    result.compute(ch, (k, v) -> (v == null) ? 1 : ++v);
9
  }
10
  
11
  return result;
12
}



Another solution relies on Java 8's stream feature. This solution has three steps. The first two steps are meant to transform the given string into Stream<Character>, while the last step is responsible for grouping and counting the characters. Here are the steps:

  1. Call the String.chars() method on the original string. This will return IntStream. This  IntStream contains an integer representation of the characters from the given string.
  2.  TransformIntStream into a stream of characters via the mapToObj() method (convert the integer representation into the human-friendly character form).
  3. Finally, group the characters (Collectors.groupingBy()) and count them (Collectors.counting()).

The following snippet of code glues these three steps into a single method:

Java




x


 
1
public Map<Character, Long> countDuplicateCharacters(String str) {
2
  
3
  Map<Character, Long> result = str.chars()
4
    .mapToObj(c -> (char) c)
5
    .collect(Collectors.groupingBy(c -> c, Collectors.counting()));
6
  
7
  return result;
8
}



What About Unicode characters?

We are pretty familiar with ASCII characters. We have unprintable control codes between 0-31, printable characters between 32-127, and extended ASCII codes between 128-255. But what about Unicode characters? Consider this section for each problem that requires that we manipulate Unicode characters.

So, in a nutshell, early Unicode versions contained characters with values less than 65,535 (0xFFFF). Java represents these characters using the 16-bit char  data type. Calling charAt(i) works as expected as long as i  don't exceed 65,535. But over time, Unicode has added more characters and the maximum value has reached 1,114,111 (0x10FFFF). These characters don't fit into 16 bits, and so 32-bit values (known as code points) were considered for the UTF-32 encoding scheme.

Unfortunately, Java doesn't support UTF-32! Nevertheless, Unicode has come up with a solution for still using 16 bits to represent these characters. This solution implies the following:

  • 16-bit high surrogates: 1,024 values (U+D800 to U+DBFF)
  • 16-bit low surrogates: 1,024 values (U+DC00 to U+DFFF)

Now, a high surrogate followed by a low surrogate defines what is known as a surrogate pair. Surrogate pairs are used to represent values between 65,536 (0x10000) and 1,114,111 (0x10FFFF). So, certain characters, known as Unicode supplementary characters, are represented as Unicode surrogate pairs (a one-character (symbol) fits in the space of a pair of characters) that are merged into a single code point. Java takes advantage of this representation and exposes it via a suite of methods such as codePointAt(),  codePoints(),  codePointCount(), and offsetByCodePoints() (take a look at the Java documentation for details).

Calling codePointAt() instead of charAt(),  codePoints() instead of  chars(), and so on helps us to write solutions that cover ASCII and Unicode characters as well.

For example, the well-known two hearts symbol is a Unicode surrogate pair that can be represented as a char[] containing two values: \uD83D and \uDC95. The code point of this symbol is 128149. To obtain a String object from this code point, call  String str = String.valueOf(Character.toChars(128149)). Counting the code points in  str  can be done by calling str.codePointCount(0, str.length()), which returns 1 even if the str length is 2. Calling str.codePointAt(0) returns 128149 and calling  str.codePointAt(1) returns 56469. Calling Character.toChars(128149) returns 2 since two characters are needed to represent this code point being a Unicode surrogate pair. For ASCII and Unicode 16- bit characters, it will return 1.

So, if we try to rewrite the first solution (that iterates the string characters and uses Map to store the characters as keys and the number of occurrences as values) to support ASCII and Unicode (including surrogate pairs), we obtain the following code:

Java
xxxxxxxxxx
1
19
 
1
public static Map<String, Integer> countDuplicateCharacters(String str) {
2
  
3
  Map<String, Integer> result = new HashMap<>();
4
  for (int i = 0; i < str.length(); i++) {
5
    
6
    // -- highlighted code start --
7
    int cp = str.codePointAt(i);
8
      
9
    String ch = String.valueOf(Character.toChars(cp));      
10
    if(Character.charCount(cp) == 2) { // 2 means a surrogate pair
11
      i++;
12
    }
13
    // -- highlighted code end --
14
      
15
    result.compute(ch, (k, v) -> (v == null) ? 1 : ++v);
16
  }
17
 
          
18
  return result;
19
}


The code demarcated as highlighted can be written as follows as well:

Java




xxxxxxxxxx
1


 
1
String ch = String.valueOf(Character.toChars(str.codePointAt(i)));
2
if (i < str.length() - 1 && str.codePointCount(i, i + 2) == 1) {
3
  i++;
4
}



Finally, trying to rewrite the Java 8 functional style solution to cover Unicode surrogate pairs can be done as follows:

Java




xxxxxxxxxx
1


 
1
public static Map<String, Long> countDuplicateCharacters(String str) {
2
  
3
  Map<String, Long> result = str.codePoints()
4
    .mapToObj(c -> String.valueOf(Character.toChars(c)))
5
    .collect(Collectors.groupingBy(c -> c, Collectors.counting()));
6
  
7
  return result;
8
}



For third-party library support, please consider Guava: Multiset<String> .

The complete code is available on GitHub.

If you enjoyed this article, then you'll love my book, Java Coding Problems, that contains a dedicated chapter including 39 problems that involve strings, numbers, and math.

Further Reading

Duplicate Objects in Java: Not Just Strings

The Right Way to Reverse a String in Java

Topics:
java ,duplicate ,string ,unicode ,tutorial ,count

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}