Algorithm of the Week: BKTrees (from "Damn Cool Algorithms")
Join the DZone community and get the full member experience.
Join For FreeThis is the first post in (hopefully) a series of posts on Damn Cool Algorithms  essentially, any algorithm I think is really Damn Cool, particularly if it's simple but nonobvious.
BKTrees, or BurkhardKeller Trees are a treebased data structure engineered for quickly finding nearmatches to a string, for example, as used by a spelling checker, or when doing a 'fuzzy' search for a term. The aim is to return, for example, "seek" and "peek" if I search for "aeek". What makes BKTrees so cool is that they take a problem which has no obvious solution besides bruteforce search, and present a simple and elegant method for speeding up searches substantially.
BKTrees were first proposed by Burkhard and Keller in 1973, in their paper "Some approaches to best match file searching". The only copy of this online seems to be in the ACM archive, which is subscription only. Further details, however, are provided in the excellent paper "Fast Approximate String Matching in a Dictionary".
Before we can define BKTrees, we need to define a couple of preliminaries. In order to index and search our dictionary, we need a way to compare strings. The canonical method for this is the Levenshtein Distance, which takes two strings, and returns a number representing the minimum number of insertions, deletions and replacements required to translate one string into the other. Other string functions are also acceptable (for example, one incorportating the concept of transpositions as an atomic operation could be used), as long as they meet the criteria defined below.
Now we can make a particularly useful observation about the Levenshtein Distance: It forms a Metric Space. Put simply, a metric space is any relationship that adheres to three basic criteria:
 d(x,y) = 0 <> x = y (If the distance between x and y is 0, then x = y)
 d(x,y) = d(y,x) (The distance from x to y is the same as the distance from y to x)
 d(x,y) + d(y,z) >= d(x,z)
The last of these criteria is called the Triangle Inequality. The Triangle Inequality states that the path from x to z must be no longer than any path that goes through another intermediate point (the path from x to y to z). Look at a triangle, for example: it's not possible to draw a triangle such that it's quicker to get from one point to another by going along two sides than it is by going along the other side.
These three criteria, basic as they are, are all that's required for something such as the Levenshtein Distance to qualify as a Metric Space. Note that this is far more general than, for example, a Euclidian Space  a Euclidian Space is metric, but many Metric Spaces (such as the Levenshtein Distance) are not Euclidian. Now that we know that the Levenshtein Distance (and other similar string distance functions) embodies a Metric Space, we come to the key observation of Burkhard and Keller.
Assume for a moment we have two parameters, query, the string we are using in our search, and n the maximum distance a string can be from query and still be returned. Say we take an arbitary string, test and compare it to query. Call the resultant distance d. Because we know the triangle inequality holds, all our results must have at most distance d+n and at least distance dn from test.
From here, the construction of a BKTree is simple: Each node has a arbitrary number of children, and each edge has a number corresponding to a Levenshtein distance. All the subnodes on the edge numbered n have a Levenshtein distance of exactly n to the parent node. So, for example, if we have a tree with parent node "book" and two child nodes "rook" and "nooks", the edge from "book" to "rook" is numbered 1, and the edge from "book" to "nooks" is numbered 2.
To build the tree from a dictionary, take an arbitrary word and make it the root of your tree. Whenever you want to insert a word, take the Levenshtein distance between your word and the root of the tree, and find the edge with number d(newword,root). Recurse, comparing your query with the child node on that edge, and so on, until there is no child node, at which point you create a new child node and store your new word there. For example, to insert "boon" into the example tree above, we would examine the root, find that d("book", "boon") = 1, and so examine the child on the edge numbered 1, which is the word "rook". We would then calculate the distance d("rook", "boon"), which is 2, and so insert the new word under "rook", with an edge numbered 2.
To query the tree, take the Levenshtein distance from your term to the root, and recursively query every child node numbered between dn and d+n (inclusive). If the node you are examining is within d of your search term, return it and continue your query.
The tree is Nary and irregular (but generally wellbalanced). Tests show that searching with a distance of 1 queries no more than 58% of the tree, and searching with two errors queries no more than 1725% of the tree  a substantial improvement over checking every node! Note that exact searching can also be performed fairly efficiently by simply setting n to 0.
Looking back on this, the post is rather longer and seems more involved than I had anticipated. Hopefully, you will agree after reading it that the insight behind BKTrees is indeed elegant and remarkably simple.
Published at DZone with permission of Nick Johnson. See the original article here.
Opinions expressed by DZone contributors are their own.
Trending

How To Scan and Validate Image Uploads in Java

Integration Testing Tutorial: A Comprehensive Guide With Examples And Best Practices

Generics in Java and Their Implementation

Demystifying SPF Record Limitations
Comments