Solr Date Math, NOW and filter queries
Join the DZone community and get the full member experience.
Join For FreeOr “How to never re-use cached filter query results even though you meant to”:
Filter queries (“fq” clauses) are a means to restrict the number of documents that are considered for scoring. A common use of “fq” clauses is to restrict the dates of documents returned, things like “in the last day”, “in the last week” etc. You find this pattern often used in conjunction with faceting. Filter queries make use of a filterCache (see solrconfig.xml) to calculate the set of documents satisfying the query once and then re-use that result set. Often, using NOW in filter queries causes this caching to be useless. Here’s why.
Solr maintains a filterCache, where it stores the results of “fq” clauses. You can think of it as a map, where the key is the “fq” clause and the value is the set of documents satisfying that clause. I’m going to skip the details of how the document set (the “value” in this map) is stored, since this post is really concentrating on the key.
So, let’s say you have two filter queries (whether they’re in the same query or not is irrelevant), something like: “fq=category:books&fq=source:library”. There will be two entries in the filterCache, something like:
category:books => 1, 2, 5, 89…
source:library => 7, 45, 101…
All well and good so far. I’ll add one short diversion here. This bears on why it is often better to have several “fq” clauses than a single one. The same results could be obtained by “fq=category:books AND source:library”, but then the filter cache would look like:
category:books AND source:library => 1, 2, 5, 7, 45, 89, 101…..
and an fq like “fq=category:books” would NOT re-use the cache since the key is much different. But enough of a diversion…
OK, you mentioned dates. Get to the point.
It’s common to have date ranges as filter queries, things like “in the last day”, “in the last week”, etc. And there’s the convenient date math to make this easy. So it’s tempting, very tempting to have filter clauses on date ranges like “fq=date:[NOW-1DAY TO NOW]“. Be careful when using NOW!
Here’s the problem. In the above example, date:[NOW-1DAY TO NOW] is
not what’s used as the key for the fq in the filterCache, the expansion
is used as the key. This translates into a form like:
“date:[2012-01-20T08:56:23Z TO 2012-01-27T8:56:23Z]”
for the key into the filter cache.
Now the user adds a term to the “q”
and re-submits the query 30 seconds after the first one. The fq clause
now looks something like:
“fq=date:[2012-01-20T08:56:53Z TO 2012-01-27T8:56:53Z]” note that the seconds went from 23 to 53!
The key for this fq does not match the key for the first, even though it’s often the case that the intent is that submitting this kind of fq 30 seconds later would result in the same set of documents matching the filter. Bare NOW entries in filter clauses will pretty much guarantee that the cached result sets will never be reused.
Fine. What do you do to make it better?
Here’s where rounding makes sense. Using midnight can make sense from two perspectives.
- The sense you often want is “anything with a timestamp in a particular day” (or month or year or hour or….). So just using NOW for the lower bound would miss anything published between midnight and whenever the user happens to submit the query on the day (in this example) of the lower bound.
- Re-using the filter cache can substantially speed up your queries, especially if you’re providing links like “in the last day”, “1-7 days ago” etc.
So your fq clauses start to look like “fq=date:[NOW/DAY-7DAYS TO
NOW/DAY+1DAY]“. The thing to note about the date math “/” operator is
that is is a “round down” operator. So let’s break this up a bit:
NOW/DAY rounds down to midnight last night.
-7DAYS subtracts 7 whole days. So the lower limit is really “midnight 7 days ago”.
Similarly, NOW/DAY rounds to midnight last night and +1DAY
moves that to midnight tonight for the upper limit.
These clauses are
invariant until after midnight tonight so these clauses will return the
same results all day today, and only the first submission of this fq
will incur the cost of figuring out which documents satisfy it, all the
queries after the first will just read the cached result set from the
filterCache. Of course the caches are invalidated if you update your
index and/or a replication happens, but that’s always the case.
You will note that there is a bit of “slop” here. If your index has dates in the future, you may get them too. Suppose you have a situation where your index contains documents you don’t want the users to see until it’s later than their timestamp. I actually have a hard time contriving an example here, but let’s just assume it’s the case. Also say it’s noon and your index contains timestamps on documents through midnight tonight. The above technique will show documents that will be officially published at, say, 15:00 even though it’s only 12:00 and you may not want that. In that case, you’ll have to use a bare NOW clause and live with the fact that your cache isn’t being used for these clauses. Like I said, this is contrived, but I mention it for completeness’ sake.
A couple of notes about dates:
Before I finish, a couple of notes about dates.
- Use the coarsest dates you can and still satisfy your use case. This is especially true if you’re sorting by dates. The sorting resource requirements go up by the number of unique terms. So storing millisecond resolution when all you care about is day can be wasteful. This is also true when faceting.
- It’s often useful to index multiple fields with some date data, especially if you intend to facet.
- The above examples in the 3.x code line have a slight problem when more than one adjoining range is required. The range operator “[]” is inclusive, so if you have a document indexed at exactly midnight in these examples, it might be included in two ranges. Trunk Solr (4.0) allows mixing inclusive “[]” and exclusive “{}” endpoints, so expressions like “date:[NOW/DAY-1DAY TO NOW/DAY+1DAY}” are possible.
- An exercise for the reader: What are the consequences of using different kinds of rounding? E.g. NOW/5MIN, NOW/72MIN (does this even work?).
Published at DZone with permission of Erick Erickson. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments