In our recent and continuing effort to make the world a better place, we have been working with the illustrious Waldo Jaquith on a project called StateDecoded. Basically, we’re making laws easily searchable and accessible by the layperson. Here, check out the predecessor project, Virginia Decoded. StateDecoded will be similar to the predecessor but with extended and matured search capabilities. And instead of just Virginia state law, with StateDecoded, any municipality will be able to download the open source project index their own laws and give their citizens better visibility in to the rules that govern them.
For this post, though, I want to focus upon one of the good Solr riddlers that we encountered related to the hierarchical nature of the documents being indexed. Laws are divided into sections, chapters, and paragraphs and we have documents at every level. In our Solr, this hierarchy is captured in a field labeled “section”. So for instance, here are 3 examples of this section field:
<field name="section">30</field>– A document that contains information specific to section 30.
<field name="section">30.4</field>– A document that contains information specific to section 30 chapter 4.
<field name="section">30.4.15</field>– A document that contains information specific to section 30 chapter 4 paragraph 15.
And our goal for this field is that if anyone searches for a particular section of law, that they will be given the law most specific to their request followed by the laws that are less specific. For instance, if a user searches for “30.4″, then the results should contain the documents for section 30, section 30.4, section 30.4.15, section 30.4.16, etc., and the first result should be for 30.4. Other documents such as 40.4 should not be returned.
Initially we tackled the problem with the PathHierarchyTokenizerFactory.
Using this analyzer, here’s how several example documents get tokenized:
And if anyone searches for a law, the same tokenization occurs. So that if someone wants to look at law “30.4″ then this gets tokenized to <30> and <30.4>. In this case the first 5 documents match. In particular, documents 30.4, 30.4.15, and 30.4.16 all have two matching tokens while 30 and 30.5 have one matching token. Additionally, because of the field length normalization, a match in a shorter field – say 30.4 with 2 tokens gets boosted higher than a match on a field with 3 tokens such as 30.4.15 or 30.4.16.
All things considered, the resulting documents should come in the following order:
BUT, as fate would have it, our perfect plan has a hole in it somewhere and the documents come back in this order:
So… what’s up?! (Solr experts reading this, do you know where our theory breaks down?)
Our theory is actually sound, but the problem lies in the practical matter of storing the field norm. Each field of every document is stored with an additional piece of information called the field norm. Ideally, the field norm would be able to hold very specific information about how long a document is, but the more accurate the field norm is, the more space you need to store that information. In the end, the Lucene developers decided to store the field norm as a single byte. This means that the field norm can store exactly 256 different values. And if you look at the Lucene TFIDFSimilarity class (the class responsible for dealing with field norms and scoring documents) you can see exactly where these byte values are getting translated into floating point values. There’s even a comment that gives a hint into what our problem might be:
The encoding uses a three-bit mantissa, a five-bit exponent, and the zero-exponent point at 15, thus representing values from around 7×10^9 to 2×10^-9 with about one significant decimal digit of accuracy. Zero is also represented. Negative numbers are rounded up to zero. Values too large to represent are rounded down to the largest representable value. Positive values too small to represent are rounded up to the smallest positive representable value.
Because of this lack of precision, the field norm for fields of length 2 and length 3 is the same! … And that’s why we get the weird ordering. The first three documents have the same score and are therefore returned in index ordering.
So, how do we fix this? Well, one way would be to implement our own Similarity class that uses a different byte-to-float conversion than the default. But one of the main goals for this particular project is to keep this installation of Solr as simple as possible so that anyone can set it up. Therefore, we found that an easier solution was to use a slightly different analysis technique. Consider the following schema fragments:
<field name="section_descendent" type="descendent_path" indexed="true" stored="true"/> <field name="section_ancestor" type="ancestor_path" indexed="true" stored="true"/>and
<fieldType name="descendent_path" class="solr.TextField"> <analyzer type="index"> <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/" /> </analyzer> <analyzer type="query"> <tokenizer class="solr.KeywordTokenizerFactory" /> </analyzer> </fieldType> <fieldType name="ancestor_path" class="solr.TextField"> <analyzer type="index"> <tokenizer class="solr.KeywordTokenizerFactory" /> </analyzer> <analyzer type="query"> <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/" /> </analyzer> </fieldType>
In this case we’re splitting up section into two fields, section_descendent and section_ancestor which are tokenized slightly differently than before. The difference is that in the previous example, the PathHierarchyTokenizer was used on both index and query. But in this case descendent_path uses the PathHierarchyTokenizer only at index time and ancestor_path uses the PathHierarchyTokenizer only at query time. (The KeywordTokenizer tokenizes the entire field as a single token.) As a result section_descendent will only match queries for sections that are a descendent of the query – so 30.4 will match 30.4 and 30.4.15 and 30.4.16, but not 30. And similarly section_ancestor will only match queries for sections that are an ancestor of the query – so 30.4 will match 30.4 and 30, but not 30.4.15.
The final piece of the puzzle is the request handler. Here we simply use an edismax request parser with a qf (query fields) set to include both descendent_path ancestor_path. Now, whenever someone queries for 30.4, then in the descendent_path they get matches on 30.4, 30.4.15, and 30.4.16; and in the ancestor_path they get matches on 30.4 and 30. The only document that matches in both fields is 30.4 thus it gets an extra boost and appears at the top of the search results followed by the other documents. Problem solved!
Now it is interesting to note that we’ve lost sibling documents (for instance 30.5 in this example). Personally, that seems fine to me, but if we really wanted it, all we have to do is again include the original section field path-tokenized on both query and at index.