This week has been a productive week. Suddenly, there are three exciting new
features coming to Lucene.
The first feature, committed this week, is the new expressions module. This allows you to define a dynamic field for sorting using an arbitrary
String expression. There is built-in support for parsing
the parser is pluggable if you want to create your own syntax.
For example, if you want to offer a blended sort primarily by relevance and boosting by a popularity field, you could define a sort field using the following expression:
sqrt(_score) + ln(popularity)The code is very easy to use; there are some nice examples in the
TestDemoExpressions.javaunit test case, which will be available in Lucene's next stable release (4.6).
Updateable Numeric Doc-values Fields
The second feature, also committed this week, is updateable numeric doc-values fields, which let you change previously indexed numeric values using the new
IndexWriter. It works fine with near-real-time readers, so you can update the numeric values for a few documents and then re-open a new near-real-time reader to see the changes.
The feature is currently trunk only as we work out a few remaining issues involving a particularly controversial boolean. It also currently does not work on sparse fields, so you can only update a document's value if that document had already indexed that field in the first place.
Combined, these two features enable powerful use-cases where you want to sort by a blended field that is changing over time. For example, perhaps you measure how often your users click through each document in the search results, and then use that to update the
popularityfield, which is then used for a blended sort. This way the rankings of the search results change over time as you learn from users which documents are popular and which are not.
Of course, such a feature was always possible before using custom external code, but with both expressions and updateable doc-values now available, it becomes trivial to implement!
Free Text Suggestions
Finally, the third feature is a new suggester implementation,
FreeTextSuggester. It is a very different suggester than the existing ones; rather than suggest from a finite universe of pre-built suggestions, it uses a simple ngram language model to predict the "long tail" of possible suggestions based on the one or two previous tokens.
Under the hood, it uses
ShingleFilterto create the ngrams, and an FST to store and lookup the resulting ngram models. While multiple ngram models are stored compactly in a single FST, the FST can still get quite large; the 3-gram, 2-gram and 1-gram model built on the AOL query logs is 19.4 MB (the queries themselves are 25.4 MB). This was inspired by Google's approach.
Likely, this suggester would not be used by itself, but rather as a fallback when your primary suggester failed to find any suggestions; you can see this behavior with Google. Try searching for "the fast and the ," and you will see the suggestions are still full queries. However, if the next word you type is "burning," then suddenly Google (so far!) does not have a full suggestion and falls back to its free text approach.