DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Coding
  3. Java
  4. Canonical Equivalence in Unicode Pattern Matching

Canonical Equivalence in Unicode Pattern Matching

In this article, we will define and explain the term Canonical Equivalence as applied to pattern matching according to the Unicode character specification.

Bengi Egima user avatar by
Bengi Egima
·
Oct. 03, 16 · Tutorial
Like (3)
Save
Tweet
Share
4.82K Views

Join the DZone community and get the full member experience.

Join For Free

Introduction

In this article, we will define and explain the term Canonical Equivalence as applied to pattern matching according to the Unicode character specification.

Pattern matching is one of the most common concepts in computer science. Take a look at the Wikipedia definition:

In computer science, pattern matching is the act of checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact. The patterns generally have the form of either sequences or tree structures.

Notice how it differs from pattern recognition, the word EXACT is key here. Canonical Equivalence is the term used in the Unicode Equivalence specification of the Unicode character encoding system to emphasize the EXACTing nature of its matching system.

Canonical Explained

Wikipedia defines Canonicalization as:

A process for converting data that has more than one possible representation into a “standard” canonical representation.

In this context, the term Canonical means to have many forms or representations which must all boil down to a single standard value which is hereby known as the canonical value.

Another common example in programming to help us understand canonical representation is to look at the XML schema datatype definition of “boolean”:

The “lexical representation” of boolean can be one of: true, false, 1, 0.

On the other hand, the “canonical representation” can only be one of: true, false.

This means that at the end of it all, every accepted boolean value must be able to map to its canonical equivalent:

1 – > true

0 – > false.

Another useful way to look at the word canonical is in theological terms where it refers to the real truth that has been and will always be the real truth regardless of any layers of information added to it.

In Java, when we compare two objects:

a.equals(b)

The JVM will compare them based on different aspects such as hashCode and attribute values to establish their sameness with each other. If there is even one aspect that differs, then the test will fail. If you are sure of their sameness, however, and want a quick test, then you are better off using the == operator:

a==b

This comparison is faster because it only bases on the one canonical truth that is used to judge the sameness of the two objects and that is the object id.

Canonical Equivalence In Unicode

Having looked at the term Canonical in depth, it is now easy to conclude that two different characters are canonically equivalent if, and only if what is considered their canonical value is the same.

In file systems, this can be the absolute path to a file which will work regardless of the present working directory.

Finally, in Unicode terms, canonical equivalence is a fundamental equivalence between individual Unicode characters and sequences of Unicode characters.

Let’s take a look at an example:

  • In Unicode, there is a character e, whose code point is u0065. However, Unicode also has a composite character é which is the same character e with an acute accent superimposed on top of it. The code point for the accented e is u00E9. However, the accent character on top of the composite character also has a Unicode code point, u0301. This means that the composite character u00E9 is canonically equivalent to the character sequence u0065 u0301.

Canonical Equivalence In Java

The concept of Canonical equivalence is encountered in a number of areas in Java programming language. Knowledge of this will save you hair loss. We looked at one example earlier with Object comparison. We can look at some more.

Java Reflection

When we are inspecting the different aspects of a Java class using reflection, we can retrieve the class name of an object in different ways. Consider an empty class called Person.java:

package com.egima;

public class Person {
    private String name;
    private int age;

    public Person() {
        super();
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public int getAge() {
        return age;
    }

    public void setAge(int age) {
        this.age = age;
    }
}

Let’s run some inspection on this class:

public static void main(String[] args) {
    Person ob = new Person();
    Class<?> clazz = ob.getClass();
    String name = clazz.getSimpleName();
    String canonicalName = clazz.getCanonicalName();
    System.out.println("Simple Class Name: " + name);
    System.out.println("Canonical Class name= " + canonicalName);
}

Console output:

Simple Class Name: Person
Canonical Class name= com.egima.Person

Notice that if we don’t really care much about the detailed truth about the class we call the getSimpleName method. We can easily dispute this by having another Person.java class in a different package.

The JVM is so wise that when you instantiate a class object reflectively, you will only get success when your class name can be traced through an explicitely defined class hierarchy. And this is only possible with the canonical name.

The following code will fail and throw a ClassNotFoundException:

public static void main(String[] args) {
    Class<?> clazz = Class.forName("Person");
    Person person = (Person) clazz.newInstance();
}

While this one will pass:

public static void main(String[] args) {
    Class<?> clazz = Class.forName("com.egima.Person");
    Person person = (Person) clazz.newInstance();
}

Java Regex

When searching Unicode characters with Java Regular Expression API, we must be very informed about the notion of Canonical Equivalence or else we will waste a lot of time and effort with little success.

Let’s consider our earlier example of accented character é . We concluded that it’s unicode code points can be represented as a composite: u00E9 or a sequence:  u0065 u0301 .

Now let us see this in action:

public static void main(String[] args) {
    Pattern pattern = Pattern.compile("\u00E9");
    Matcher matcher = pattern.matcher("Text with unicode accented character \u0065\u0301");
    boolean found = matcher.find();
    System.out.println("Was the character found: " + found);
}

And the output is:

Was the character found: false

The character was not found and yet we know the composite character used as the regex pattern and the sequence of characters in the input text are canonically equivalent.

What this means is that the Java regex engine by default does not take into consideration canonical equivalence, so it compares the Unicode character \u00E9 against \u0065 \u0301 as is without regard to their underlying final values.

If we want the engine to take into account the canonical equivalence of Unicode characters while searching, we must specify the Pattern.CANON_EQ flag of the java.util.regex.Pattern  class during regex compilation:

public static void main(String[] args) {
    Pattern pattern = Pattern.compile("\u00E9", Pattern.CANON_EQ);
    Matcher matcher = pattern.matcher("Text with unicode accented character \u0065\u0301");
    boolean found = matcher.find();
    System.out.println("Was the character found: " + found);
}

Now the output is:

Was the character found: true

Conclusion

In this article, we discussed the notion of Canonical Equivalence which is common in Unicode character encoding schemes which automatically makes it common in all programming languages.


Java (programming language)

Published at DZone with permission of Bengi Egima, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • DevOps Roadmap for 2022
  • PostgreSQL: Bulk Loading Data With Node.js and Sequelize
  • Secrets Management
  • Using AI and Machine Learning To Create Software

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: