Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

MongoDB Indexing Tip #2: How to Implement Faceted Search

DZone's Guide to

MongoDB Indexing Tip #2: How to Implement Faceted Search

· Java Zone
Free Resource

Just released, a free O’Reilly book on Reactive Microsystems: The Evolution of Microservices at Scale. Brought to you in partnership with Lightbend.

Make sure you read up on the previous tip to understand the following post!

The Feature

Faceted search is any type of search that can involve a wide variety of predicates and sort criteria. It typically involves a web UI where the user can pick from many fields (user, account id, date range, etc) and also sort on most of those fields. It is a hard problem to solve for a database, since in theory you would have to create all the possible combinations of compound indices.

The goals are:

  • Keep number of indices low, cannot end up with all combinations
  • Performance must be fast, in millisecond range.
  • Performance must stay constant over time (same with 10k or 1 billion documents). Often, faceted search seems fast during initial development but degrades quickly in production

First Step: Understand the Requirements

An important first step is to identify the fields that are required for any search. Those will be the anchor to your indices. A typical example would be customer id, account id, project id, etc.

The 2nd step is to define an acceptable default time range that the search should use (e.g. 6 months). Even though the search may be applied since the beginning of times, you should define how far you want to look back by default, and that will make the default screen not get slower over time.

Last, you need to decide which fields are actually useful to sort on. You will need at least one index per sortable field, so while 10 fields is doable, 100 is not. Any extra index will take more RAM, storage and slow down the write operations.

All above points may force you to change the available features of the system and can make project managers and users slightly disappointed. But they will thank you when the page loads in milliseconds or seconds rather than minutes!

Indexing Strategy

The index should be structured as follows:

  • Start with the required fields (e.g. customer id). Those fields will use either an equality or a “$in“ predicate in the query.
  • Next is a coarse date, for example the day or the month. You will probably need to create an extra field for this, it is important not to use the millisecond time. Those fields will use an equality or “$in“ predicate in the query.
  • Next is the sort field, which can be very granular.
  • Last, any other field that you want “covered” by the index. Those fields will not be used efficiently but still will be readily available in the index instead of requiring an inspection of the document.

If sorting on “price“ for a customer id “cId“, the index may look like:

{ cId:1 , month: 1, price: 1, field1: 1, field2: 1, … }

To look at the 100 highest prices over 3 months for a few customers, the query would look like:

find({ cId: { $in: [1, 2, … N] }, month: { $in: [“2012-10”, "2012-11", "2012-12"] }).sort({ price: -1 }).limit(-100)

Note the negative limit is important. You would expect the server to just have to merge sort a few discrete branches of the btree, as illustrated below.

image

Issue with $in

Unfortunately there is one big hurdle to this working well, and it is described in great lengths here. The bottom line is that you should revert to one of the alternate solutions, but most of them are not applicable when dealing with faceted search:

  • it is difficult to cover every field with the index, unless the filtering is done client side. This prevents the v2.2 optimization.
  • don’t have an idea of ranges on the sort field
  • cannot use the fan out strategy

One Solution: Multiple Queries

One solution that works really well is the unwind the “$in“ predicates into multiple queries that cover all “required” fields combination (e.g. customer id and month). This may seem problematic but in reality it rarely leads to more than a dozen queries. For example:

find({ cId: { 1, month: “2012-10” }).sort({ price: -1 }).limit(-100)

find({ cId: { 1, month: “2012-11” }).sort({ price: -1 }).limit(-100)

find({ cId: { 2, month: “2012-10” }).sort({ price: -1 }).limit(-100)

Each query will be very fast, under a millisecond even if you use extra predicates (especially if covered by the index). Consequently even complex search should be able to fetch 100 items for each combination within a few milliseconds overall, and then the client code can merge sort those results very quickly. A little bit of pain, yes, but yielding great rewards.

Pagination

Pagination is slightly tricky… But there are very efficient and accurate ways to implement it. It will probably be part of another post.

Conclusion

Faceted search is a difficult problem to implement efficiently, and most databases are not good at it. With MongoDB you will encounter a few hurdles that should be removed during index API refactoring of v2.6. In the meantime you should use the above tips!

Strategies and techniques for building scalable and resilient microservices to refactor a monolithic application step-by-step, a free O'Reilly book. Brought to you in partnership with Lightbend.

Topics:

Published at DZone with permission of Antoine Girbal, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}