In this article, we discuss how to measure how far apart two sequences are when one of the sequences if a permutation of the other.
Join the DZone community and get the full member experience.Join For Free
If you have two sequences, and one is a permutation of the other, how do you measure how far apart the two sequences are?
This post was motivated by the previous post on anagram statistics. For a certain dictionary, the code used in that post found that the longest anagram pairs were the following:
Anagrams are more amusing when the two words are more dissimilar, at least in my opinion.
There are several ways to measure (dis)similarity. The words in the first two pairs above are dissimilar in meaning, but the words in the last pair are synonyms . The pairs of words above are not that dissimilar as character strings.
It would be hard to quantify how dissimilar two words are in meaning, but easier to quantify how dissimilar they are as sequences of characters. We’ll look at two approaches: Hamming distance and transposition distance.
You may also like: Mathematical Functions and Converting Data Types in Groovy.
The Hamming distance between two words is the number of positions in which they differ. We can see the Hamming distances between the words above by viewing them in a fixed-width font.
certifications rectifications impressiveness permissiveness teaspoonsful teaspoonfuls
The Hamming distances above are 2, 5, and 4.
The anagrams with a maximum Hamming distance in my dictionary are “imperfections” and “perfectionism.” (There’s something poetic about these words being anagrams.) The Hamming distance is 13 because these 13-letter words differ in every position.
Another way to measure the distance between two permutations is how many elements would have to be swapped to turn one list into the other. This is called the transposition distance. For example, the transposition distance between “certifications” and “rectifications” is 1 because you only need to swap the first and third characters to turn one into the other.
It’s not so obvious what the transposition distance is between “impressiveness” and “permissiveness.” We can find an upper bound on the distance by experimentation. The two words only differ in the first five letters, so the distance between “impressiveness” and “permissiveness” is no larger than the distance between “impre” and “permi.” The latter pair of words are no more than 4 transpositions apart, as the following sequence shows:
impre pmire peirm perim permi
But is there a shorter sequence of transpositions? Would it help if we used the rest of the letters “ssiveness” as the working space?
Computing transposition distance is hard, as in NP-hard. This is unfortunate since transposition has more practical applications than just word games. It is used, for example, in genetic research. It may be practical to compute short sequences, but, in general, one must rely on bounds and approximations.
Note that transposition distance allows swapping any two elements. If we only allowed swapping consecutive elements, the problem is much easier, but the results are not the same. When restricted to consecutive swaps, the distance between “certifications” and “rectifications” is 3 rather than 1. We can swap the “c” and the “r” to turn “cer” into “rec,” so we have to do something like
cer ecr erc rec
I don’t know a name for the distance where you allow rotations. The words “teaspoonsful” and “teaspoonfuls” differ by one rotation, turning “sful” into “fuls.” While this can be done in one rotation, it would take several swaps.
- Levenshtein distance.
- Translating Robert Burns.
- MachineX: Cosine Similarity for Item-Based Collaborative Filtering.
- Introduction to Classification Algorithms.
 Although “teaspoonfuls” is far more common, I remember a schoolmarm teaching us that this is incorrect.
And I still recoil at “brother in laws” and say “brothers in law” instead; the majority is on my side on this one.
Published at DZone with permission of John Cook , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.