DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Using Envoy Proxy’s PostgreSQL and TCP Filters to Collect Yugabyte SQL Statistics
  • How Milvus Realizes the Delete Function
  • Text Analysis Within a Full-Text Search Engine
  • Fixing Common Oracle Database Problems

Trending

  • Analyzing “java.lang.OutOfMemoryError: Failed to create a thread” Error
  • Scalability 101: How to Build, Measure, and Improve It
  • The Role of Artificial Intelligence in Climate Change Mitigation
  • Mastering React App Configuration With Webpack
  1. DZone
  2. Data Engineering
  3. Databases
  4. Solr Date Math, NOW and filter queries

Solr Date Math, NOW and filter queries

By 
Erick Erickson user avatar
Erick Erickson
·
Feb. 27, 12 · Interview
Likes (0)
Comment
Save
Tweet
Share
51.9K Views

Join the DZone community and get the full member experience.

Join For Free

Or “How to never re-use cached filter query results even though you meant to”:

Filter queries (“fq” clauses) are a means to restrict the number of documents that are considered for scoring. A common use of “fq” clauses is to restrict the dates of documents returned, things like “in the last day”, “in the last week” etc. You find this pattern often used in conjunction with faceting. Filter queries make use of a filterCache (see solrconfig.xml) to calculate the set of documents satisfying the query once and then re-use that result set. Often, using NOW in filter queries causes this caching to be useless. Here’s why.

Solr maintains a filterCache, where it stores the results of “fq” clauses. You can think of it as a map, where the key is the “fq” clause and the value is the set of documents satisfying that clause. I’m going to skip the details of how the document set (the “value” in this map) is stored, since this post is really concentrating on the key.

So, let’s say you have two filter queries (whether they’re in the same query or not is irrelevant), something like: “fq=category:books&fq=source:library”. There will be two entries in the filterCache, something like:

category:books => 1, 2, 5, 89…
source:library => 7, 45, 101…

All well and good so far. I’ll add one short diversion here. This bears on why it is often better to have several “fq” clauses than a single one. The same results could be obtained by “fq=category:books AND source:library”, but then the filter cache would look like:

category:books AND source:library => 1, 2, 5, 7, 45, 89, 101…..

and an fq like “fq=category:books” would NOT re-use the cache since the key is much different. But enough of a diversion…

OK, you mentioned dates. Get to the point.

It’s common to have date ranges as filter queries, things like “in the last day”, “in the last week”, etc. And there’s the convenient date math to make this easy. So it’s tempting, very tempting to have filter clauses on date ranges like “fq=date:[NOW-1DAY TO NOW]“. Be careful when using NOW!

Here’s the problem. In the above example, date:[NOW-1DAY TO NOW] is not what’s used as the key for the fq in the filterCache, the expansion is used as the key. This translates into a form like:
“date:[2012-01-20T08:56:23Z TO 2012-01-27T8:56:23Z]” for the key into the filter cache.

Now the user adds a term to the “q” and re-submits the query 30 seconds after the first one. The fq clause now looks something like:

“fq=date:[2012-01-20T08:56:53Z TO 2012-01-27T8:56:53Z]” note that the seconds went from 23 to 53!

The key for this fq does not match the key for the first, even though it’s often the case that the intent is that submitting this kind of fq 30 seconds later would result in the same set of documents matching the filter. Bare NOW entries in filter clauses will pretty much guarantee that the cached result sets will never be reused.

Fine. What do you do to make it better?

Here’s where rounding makes sense. Using midnight can make sense from two perspectives.

  • The sense you often want is “anything with a timestamp in a particular day” (or month or year or hour or….). So just using NOW for the lower bound would miss anything published between midnight and whenever the user happens to submit the query on the day (in this example) of the lower bound.
  • Re-using the filter cache can substantially speed up your queries, especially if you’re providing links like “in the last day”, “1-7 days ago” etc.

So your fq clauses start to look like “fq=date:[NOW/DAY-7DAYS TO NOW/DAY+1DAY]“. The thing to note about the date math “/” operator is that is is a “round down” operator. So let’s break this up a bit:

NOW/DAY
rounds down to midnight last night.

-7DAYS subtracts 7 whole days. So the lower limit is really “midnight 7 days ago”.

Similarly, NOW/DAY rounds to midnight last night and +1DAY moves that to midnight tonight for the upper limit.

These clauses are invariant until after midnight tonight so these clauses will return the same results all day today, and only the first submission of this fq will incur the cost of figuring out which documents satisfy it, all the queries after the first will just read the cached result set from the filterCache. Of course the caches are invalidated if you update your index and/or a replication happens, but that’s always the case.

You will note that there is a bit of “slop” here. If your index has dates in the future, you may get them too. Suppose you have a situation where your index contains documents you don’t want the users to see until it’s later than their timestamp. I actually have a hard time contriving an example here, but let’s just assume it’s the case. Also say it’s noon and your index contains timestamps on documents through midnight tonight. The above technique will show documents that will be officially published at, say, 15:00 even though it’s only 12:00 and you may not want that. In that case, you’ll have to use a bare NOW clause and live with the fact that your cache isn’t being used for these clauses. Like I said, this is contrived, but I mention it for completeness’ sake.

A couple of notes about dates:

Before I finish, a couple of notes about dates.

  • Use the coarsest dates you can and still satisfy your use case. This is especially true if you’re sorting by dates. The sorting resource requirements go up by the number of unique terms. So storing millisecond resolution when all you care about is day can be wasteful. This is also true when faceting.
  • It’s often useful to index multiple fields with some date data, especially if you intend to facet.
  • The above examples in the 3.x code line have a slight problem when more than one adjoining range is required. The range operator “[]” is inclusive, so if you have a document indexed at exactly midnight in these examples, it might be included in two ranges. Trunk Solr (4.0) allows mixing inclusive “[]” and exclusive “{}” endpoints, so expressions like “date:[NOW/DAY-1DAY TO NOW/DAY+1DAY}” are possible.
  • An exercise for the reader: What are the consequences of using different kinds of rounding? E.g. NOW/5MIN, NOW/72MIN (does this even work?).

Database Filter (software)

Published at DZone with permission of Erick Erickson. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Using Envoy Proxy’s PostgreSQL and TCP Filters to Collect Yugabyte SQL Statistics
  • How Milvus Realizes the Delete Function
  • Text Analysis Within a Full-Text Search Engine
  • Fixing Common Oracle Database Problems

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!