Over a million developers have joined DZone.
Silver Partner

Complex Solr Faceting

· Java Zone

The Java Zone is brought to you in partnership with JetBrains.  Learn about instant and clever code completion, on-the-fly code analysis and reliable refactoring tools with IntelliJ IDEA.

When using Solr faceting sooner or later there will be a request for a complex facet, one that at first sight seems impossible using standard Solr faceting. However with some creative use of facets and a small bit of extra logic in your application you can create a very advanced faceting interface. In this example I will explain:

  • How to create multiple facets on the same field by using a key for a unique name and the exclusion of tagged filterqueries for different counts
  • How to use the multiple facet count results to render a single interface element

The best way to explain this is by using a test setup with a good dataset. In the following examples I’ll be using the same test setup as used in the (excellent!) book ‘Solr 1.4 Enterprise Search Server‘, based on MusicBrainz data.
If you have read the book and have the test environment you already know this dataset and you can test the queries yourself, I’ll explain the most important parts of the index for others. But I do assume you have solid understanding of the Solr (faceting) basics for the rest of this example, as I won’t be explaining basic Solr usage.

I’ll be using the ‘mbtracks’ index which contains all kinds of (music)tracks. To keep the example simple I will only use a few fields for filterqueries and faceting, the most important is the ‘t_r_attributes’ field. This is a multivalued integer field with this content (enumerated):

facet-example0 = non album track
1 = Album
2 = Single
3 = EP
4 = Compilation
5 = Soundtrack
6 = Spokenword
7 = Interview
8 = Audiobook
9 = Live
10 = Remix
11 = Other
100 = Official
101 = Promotion
102 = Bootleg
103 = Pseudo-Release

Note that 0-11 are ‘release types’ and  100-103 are indicators of the official state. So actually two separate lists are saved in one multivalued field. Most rows will have one of each, for instance ’1′ and ’100′ for a track released on an official album.

Now imagine the following scenario: A website is build using this Solr index. There is a special page on the website for ‘Pseudo-Releases‘ that allows users to apply additional filters for ‘release type’ and ‘duration’. See my very simple sketch above.
The ‘release type’ facet has some special features, in the sketch you can see them all:

  • One selected facet ‘Album’ with a count.
  • One available facet ‘Single’ with a count.
  • One facet ‘EP’ that is greyed out. This indicates that there are audiobooks with the ‘pseudo release’ attribute, but they are ruled out by other selected facet options like ‘duration’. The count represents the number of ruled out items.
  • One facet ‘interviews’ with ‘N/A’, indicating that there are no ‘interviews’ with the ‘pseudo release’ attribute at all, but it does exist as a ‘release type’.

Of course you could question this design. Why show options that are not available anyway? How do I expect the user to understand the different types of options? Good questions for a real world design, but this was the best example I could make up for this dataset. There are many real world cases where this makes much more sense.

The special features of this facet make it impossible to be generated from a normal (single) Solr facet. It’s important to notice that the requirements actually combine several different types of faceting into one user interface element. Once you determine the different types of facet data you need, you can get that data with the usage of multiple facets and tagging/excluding. And after executing the Solr query you can render an element based on merged data.

Let’s first start with the base query. First a readable and commented version followed by a copy-pastable version for easy testing.

> path to solr index...
> query any documents and some standard Solr settings
> filterquery on 'pseudo releases', note the tag
> enable faceting and set mincount to 1
> facet on attributes, note the 'key' that allows for multiple facets on the same field
&facet.field={!key=na_attributes ex=releasefilter}t_r_attributes
> facet on the same field, but this time excluding the releasefilter!
&facet.query={!key=duration_short}t_duration:[*  TO 120]
&facet.query={!key=duration_long}t_duration:[120 TO *]
> facet queries for duration, using keys for a easy to use resultset
http://localhost:8983/solr/mbtracks/select/?q=*%3A*&fq={!tag=releasefilter}t_r_attributes:103&version=2.2&start=0&rows=10&indent=on&facet=on&facet.mincount=1&facet.field={!key=attributes}t_r_attributes&facet.field={!key=na_attributes ex=releasefilter}t_r_attributes&facet.query={!key=duration_short}t_duration:[* TO 120]&facet.query={!key=duration_long}t_duration:[120 TO *]

In the base query there are two facets for the release types. One for counting the available releasetypes, and one for the N/A ones. In this case the second facet is not necessary, the same result could be achieved by a single facet with facet.mincount set to 0. But his base query is easier to build upon for the next example, a complex query where the user has selected the release types 1 and 2 (Album and Single) and a the duration ‘short’:

&fq={!tag=durationfilter}t_duration:[* TO 120]
&fq={!tag=userchoice}t_r_attributes:1 OR 2
&facet.field={!key=attributes ex=userchoice}t_r_attributes
> the facet for 'active' attributes takes all filterqueries into account, except chosen categories (so the available ones also get a count)
&facet.field={!key=ruledout_attributes ex=durationfilter,userchoice}t_r_attributes
> the facets counts all attributes that would have been available, had the user not chosen a duration
&facet.field={!key=na_attributes ex=releasefilter,durationfilter,userchoice}t_r_attributes
> facet on all attributes (excludes all filters)
&facet.query={!key=duration_short ex=durationfilter}t_duration:[* TO 120]
&facet.query={!key=duration_long ex=durationfilter}t_duration:[120 TO *]
http://localhost:8983/solr/mbtracks/select/?q=*%3A*&fq={!tag=releasefilter}t_r_attributes:103&fq={!tag=durationfilter}t_duration:[*  TO 120]&fq={!tag=userchoice}t_r_attributes:1 OR  2&version=2.2&start=0&rows=10&indent=on&facet=on&facet.mincount=1&facet.field={!key=attributes}t_r_attributes&facet.field={!key=ruledout_attributes   ex=durationfilter,userchoice}t_r_attributes&facet.field={!key=na_attributes   ex=releasefilter,durationfilter,userchoice}t_r_attributes&facet.query={!key=duration_short  ex=durationfilter}t_duration:[* TO  120]&facet.query={!key=duration_long  ex=durationfilter}t_duration:[120 TO *]

The (shortened) result XML of this query:

<result name="response" numFound="2186" start="0">
<str name="id">Track:1820660</str>
<str name="t_a_id">133473</str>
<str name="t_a_name">Skoop On Somebody</str>
<int name="t_duration">23</int>
<str name="t_name">Introduction</str>
<int name="t_num">1</int>
<arr name="t_r_attributes">
<str name="t_r_id">178049</str>
<str name="t_r_name">Key of Love</str>
<int name="t_r_tracks">15</int>
<int name="t_trm_lookups">0</int>
<str name="type">Track</str>
<lst name="facet_counts">
<lst name="facet_queries">
<int name="duration_short">2186</int>
<int name="duration_long">11519</int>
<lst name="facet_fields">
<lst name="attributes">
<int name="1">2186</int>
<int name="103">2186</int>
<int name="0">2028</int>
<int name="5">142</int>
<int name="2">4</int>
<lst name="ruledout_attributes">
<int name="103">30810</int>
<int name="0">30445</int>
<int name="1">13690</int>
<int name="5">7895</int>
<int name="4">4353</int>
<int name="2">2670</int>
<int name="9">1007</int>
<int name="3">701</int>
<int name="10">399</int>
<int name="6">113</int>
<int name="11">107</int>
<int name="8">12</int>
<lst name="na_attributes">
<int name="0">6976389</int>
<int name="100">6009955</int>
<int name="1">3277697</int>
<int name="4">1987317</int>
<int name="9">407872</int>
<int name="102">263344</int>
<int name="5">242589</int>
<int name="2">211736</int>
<int name="3">166152</int>
<int name="101">143203</int>
<int name="10">51606</int>
<int name="11">48602</int>
<int name="8">48167</int>
<int name="6">39555</int>
<int name="103">30810</int>
<int name="7">2406</int>
<lst name="facet_dates"/>

The result set contains everything you need to render the release type filter element, and it’s actually quite easy:

  1. Render the items from the ‘attributes’ facet as available items, with the selected ones checked and then the rest unchecked. Display the Solr counts.
  2. Render the ‘ruledout_attributes’ items as greyed out. Ignore the Solr counts and display a zero, as the actual count is not relevant, just the fact that there would be a count if you exclude the filters.
  3. Render the ‘na_attributes’ as N/A. Again Solr count is not relevant (again the only relevant thing is that there is a count)

for each step exclude any rows that have already been rendered in a previous step (by name).

For this dataset all rows with a name >100 should be ignored, as that represents the ‘official state’ instead of a ‘release type’. Also the attributes with a zero value are ignored, as every single document has this attribute value.

After rendering the above data you would get this interface element:

The Java Zone is brought to you in partnership with JetBrains.  Learn about instant and clever code completion, on-the-fly code analysis and reliable refactoring tools with IntelliJ IDEA.


Published at DZone with permission of Bas De Nooijer , DZone MVB .

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}