Ruzzle is a popular word game that has as an objective to find as many words as possible into a 4x4 grid like this:
On the train to Rome last week, me and my colleagues from the Onebip team exercised with a new kata: a solver for Ruzzle that would find as many words as possible in the given time frame.
Words must be valid in a chosen language, which for us was Italian, but it doesn't really matter for the purpose of the kata. Each solver can suppose to have a list of all the valid words for the language stored in a file: this information is a prerequisite to build a solver.
All 26 letters can be used, and can be repeated inside a grid. Words can be constructed by moving from one cell to the (at most) 8 adjacent ones, but without using cells more than once in each word. The score is calculated in the Scrabble way, by summing up the weight of all the letters involved in your words (but there is no score calculation involved in this kata).
This kata has a medium algorithmic difficulty: the main problem to solve is the generation of valid paths inside the graph composed of the 16 cells. We didn't go into the grid-as-a-graph-route, but I guess good results can come from this model due to the wide availability of algorithms that work on graphs.
However, the strongest requirement is that of maintaining an acceptable performance, since Ruzzle is a time-limited game (2 minutes length).
The simplest solutions that comes to mind, and that everyone of us dived into, was to generate all the possible paths starting from each cell, and check if any path of length L is a valid word in the given dictionary.
First of all, caution is needed to implement this strategy: a breadth-first search will probably perform better, and lets you define an arbitrary maximum word length.
However, each cell has an average of 5.25 neighbors (from a maximum of 8 in the middle to just 3 in the corners); so when this search arrives at 7-letter words, is already a number of paths 16*5.25^7=1.7M, which is more than the number of words in the English language (less than 1M).
Thus, to avoid this explosion in complexity, we inverted the problem and started searching every single word in the dictionary into the grid. This is a problem prone to optimization, since you can exclude many words without starting to search them (if they contain a letter that is not in the grid) or even in the first few branches in the tree, when no 2-cell path match them.
For example, this grid:
lets you exclude all words containing letters from E to Z without starting a new search, but also words like Aaron after just two steps.
Here are some sparse findings on our experience while developing the solver in TDD and the shock of having to drop the current approach.
- This kata is ideal for training on performance optimization. There are many points when computational complexity comes in: the development or reuse of data structures such as sets or hash tables; caching opportunities; the use of a profiler to find bottlenecks and validate the theoretical complexity of the algorithms.
- A bit of software engineering is capable of absorbing part of the shock. Even if we threw away the outer part of the search algorithm when starting to look for words instead of generating paths, most of the code could be reused (e.g. the Cell, and the CellSet classes). The same domain logic was still necessary, but it had to be composed differently.
- TDD always needs tests at the unit level, especially in an unfamiliar domain. When an algorithm doesn't work, the last thing you want to do is to fire up a debugger or inserting echo statements: even if it's possible to just debug with print() in a test suite, unit tests will tell you where exactly is the problem to fix along with the red bar, while end-to-end tests will only tell you "something's wrong".
Some test cases
If you want to try out this kata, here is a set of free test cases to get started. I realize I'm doing some of the analysis work for you, but when you just have a timebox I prefer to give out a series of scenarios to be tackled one at the time, to let coders concentrate on the interesting algorithmic aspects of the problem instead of on making up words.
Given a 1x1 grid And an empty word set Then no word is found Given a 1x1 grid 'a' And a word set containing 'a' Then the 'a' word is found Given a 2x1 grid 'ab' And a word set containing 'a' Then only the 'a' word is found Given a 2x1 grid 'ab' And a word set containing 'ab' Then only the 'ab' word is found Given a 2x1 grid 'ab' And a word set containing 'a', 'ab' Then the 'a' and the 'ab' word are found
You can then extend these scenarios to the 2x2, 3x3 and 4x4 cases, skipping intermediate steps depending on your implementation. Not only tests give feedback to the code, but also the code tells you what are the next tests to write.
If you have suggestions on how to best break down the problem, feel free to tell me some additional scenarios to insert into these ones to allow shorter steps in refactoring and modifying the solution.