More NOW evil
Prompted by a subtle issue a client raised, I was thinking about date boosting. According to the Wiki, a good way to bost by date is by something like the following:
(see: date boosting link). And this works well, no question.
However, there’s a subtle issue when paging. NOW evaluates to the current time, and every subsequent request will have a different value for NOW. This blog post about the effects of this on filter queries provides some useful background.
What does that have to do with date boosting?
Imagine that you have multiple pages of results. Typically, one constructs a series of page links to get to subsequent pages, something like
http://your solr addr/select?q=searchterms&start=10&rows=10
but you need to add the date boosting too, right? So each of these URLs will have the date boost appended from above (or you may have this in your default params in solrconfig.xml). And here’s where this fragment causes some “interesting” behavior ms(NOW,manufacturedate_dt)
There are two issues here.
- You can actually repeat or skip results as you page. This is due to the “bucketing” of results. A few seconds can change the boost calculations just enough to cause some documents to be skipped or repeated as you page.
- Your queryResultCache is useless.
A quick review of queryResultCache
The queryResultCache is just a map of the query and some number of documents, in order, the results of that search. How many documents are kept in the cache is configurable in solrconfig.xml. So typically people will store 2 or 3 pages of results per query. This is adequate to handle the usual user experience; rarely do users page to the second page, much less the third. When a page request comes in such that the results aren’t in the queryResultCache, the query is re-executed.
But, critically for this discussion, the use of NOW in date boosting means that no query that uses date boosting is ever fetched from the queryResultCache!
I’m exaggerating a bit. It’s possible to do limited “date math” with the date boost function, things like …ms(NOW/MINUTE,manufacturedate_dt)…. are possible. Using this techinque reduces the problem, but doesn’t eliminate it.
What can be done?
I haven’t thought of a clean way to change the Solr query process to handle this. I can imagine a new parameter like “nowIs=2012-03-28T10:30:29Z”, with the understanding that all references to NOW in the query get this substituted, but that feels kludgy. Not to mention that doing this right would touch lots of places. And I guarantee that it would be much harder to get right than I think…
Another possibility is that you limit the problem. Using some expression like NOW/DAY+1DAY would confine the problem to queries page requests that span across midnight. And this will affect the scoring of documents put in the index today. Do note if you try this on a raw URL, you need to url-escape the ‘+’ as %2B.
A third possibility is to use the fact that Solr happily ignores URL parameters it doesn’t understand. You could create a custom QueryComponent and do the substitutions there. This allows you the possibility of recognizing that your index has changed and re-executing the query in that case. There are some interesting new capabilities in the SearcherLifetimeManager coming up, see Mike McCandless’ blog post here that could help as well, although I haven’t looked at it closely. One could perhaps just write a custom query component that recognized “ms(NOW” and substituted the formatted time into the query, but anything that simple would probably have unexpected side effects.
Another solution is to simply construct your paging URLs with a raw time rather than NOW. This would look like:
The easiest solution is to ignore the problem entirely. Mainly I’m posting this as an interesting dive into the subtleties with NOW, and how it can produce effects you don’t anticipate. If you’re interested in squeezing every last bit of performance out of your Solr instances, and you do heavy boosting by date, you might want to address this “problem”.
But except for date rounding (e.g. using NOW/DAY+1DAY rather than a bare NOW), I’d never do this kind of thing unless I had absolute proof that I needed to because:
- Any solution that implements this kind of process will take time and effort you could put into other parts of your application.
- In most applications, your users will never notice anyway. The only time this shows up is when you page and you happen to hit an edge case. Users rarely go to even the second page of search results so it’s a vanishingly small ROI for the coding/QA effort unless and until there is a demonstrated need.