I talked about the Lucene formula and how it calculate things using tf-idf for best matches. Now I want to talk about the actual sorting implementation. As it turned out, the default sorting (by relevancy) is really simple. All you need is to get the relevant score for a query, then you shove the results through a heap with a specified size. The heap will take care of maintain the top results.
So far, that is pretty simple to understand. But the question is, how do you do sorting on a field value? The answer is, not easily.
GetStringIndex() does something very interesting. I returns a string index, which gives us:
- A string array containing all the distinct (sorted) value for this index.
- A int array with all the documents in the index, with the position of the value of that field in the string value array
Now we can compare fields by their field position on the field, which give us pretty good sorting. Unfortunately, this also require us to load all the values to memory. Let us see another example, which would probably be easier to follow:
Sorting by an integer is done like this:
Get an array (whose size match the number of documents), We can then sort things easily because accessing the relevant field value only require us to have the document id to index into the array.
The reason Lucene does this is that it uses an inverted index, and it has no easy way of going from the field values to the list documents it has. So it is easier to read all the values into memory and work with them there. I don’t like it, but off hand, I can’t think of a better way to handle this.