Over a million developers have joined DZone.

Fun with Lucene's Faceted Search Module

DZone's Guide to

Fun with Lucene's Faceted Search Module

· Big Data Zone ·
Free Resource

How to Simplify Apache Kafka. Get eBook.

These days faceted search and navigation is common and users have come to expect and rely upon it. 

Lucene's facet module, first appearing in the 3.4.0 release, offers a powerful implementation, making it trivial to add a faceted user interface to your search application. Shai Erera wrote up a nice overview here and worked through nice "getting started" examples in his second post

The facet module has not been integrated into Solr, which has an entirely different implementation, nor into ElasticSearch, which also has its own entirely different implementationBobo is yet another facet implementation! I'm sure there are more... 

The facet module can compute the usual counts for each facet, but also has advanced features such as aggregates other than hit count, sampling (for better performance when there are many hits) and complements aggregation (for better performance when the number of hits is more than half of the index). All facets are hierarchical, so the app is free to index an arbitrary tree structure for each document. With the upcoming 4.1, the facet module will fully support near-real-time (NRT) search. 

Lucene's nightly performance benchmarks

I was curious about the performance of faceted search, so I added date facets, indexed asyear/month/day hierarchy, to the nightly Lucene benchmarks. Specifically I added faceting to allTermQuerys that were already tested, and now we can watch this graph to track our faceted search performance over time. The date field is the timestamp of the most recent revision of each Wikipedia page. 

Simple performance tests

I also ran some simple initial tests on a recent (5/2/2012) English Wikipedia export, which contains 30.2 GB of plain text across 33.3 million documents. By default, faceted search retrieves the counts of all facet values under the root node (years, in this case):

     Date (3994646)
       2012 (1990192)
       2011 (752327)
       2010 (380977)
       2009 (275152)
       2008 (271543)
       2007 (211688)
       2006 (98809)
       2005 (12846)
       2004 (1105)
       2003 (7)
It's interesting that 2012 has such a high count, even though this export only includes the first five months and two days of 2012. Wikipedia's pages are very actively edited! 

The search index with facets grew only slightly (~2.3%, from 12.5 GB to 12.8 GB) because of the additional indexed facet field. The taxonomy index, which is a separate index used to map facets to fixed integer codes, was tiny: only 120 KB. The more unique facet values you have, the larger this index will be. 

Next I compared search performance with and without faceting. A simple  TermQuery ( party), matching just over a million hits, was 51.2 queries per second (QPS) without facets and 3.4 QPS with facets. While this is a somewhat scary slowdown, it's the worst case scenario:  TermQuery is very cheap to execute, and can easily match a large number of hits. The cost of faceting is in proportion to the number of hits. It would be nice to speed this up (patches welcome!). 

I also tested a harder  PhraseQuery ( "the village"), matching 194 K hits: 3.8 QPS without facets and 2.8 QPS with facets, which is less of a hit because  PhraseQuery takes more work to match each hit and generally matches fewer hits. 

Loading facet data in RAM

For the above results I used the facet defaults, where the per-document facet values are left on disk during aggregation. If you have enough RAM you can also load all facet values into RAM using the  CategoryListCache class. I tested this, and it gave nice speedups: the  TermQuery was 73% faster (to 6.0 QPS) and the  PhraseQuery was 19% faster. 

However, there are downsides: it's time-consuming to initialize (4.6 seconds in my test), and not NRT-friendly, though this shouldn't be so hard to fix (patches welcome!). It also required a substantial 1.9 GB RAM, according to Lucene's  RamUsageEstimator. We should be able to reduce this RAM usage by switching to Lucene's  fast packed ints implementation from the current int[][] it uses today, or by using  DocValues to hold the per-document facet data. I just opened LUCENE-4602 to explore  DocValues and initial results look very promising. 


Next I tried sampling, where the facet module visits 1% of the hits (by default) and only aggregates counts for those. In the default mode, this sampling is used only to find the top N facet values, and then a second pass computes the correct count for each of those values. This is a good fit when the taxonomy is wide and flat, and counts are pretty evenly distributed. I tested that, but results were slower, because the date taxonomy is not wide and flat and has rather lopsided counts (2012 has the majority of hits). 

You can also skip the second pass and then present approximate counts or a percentage value to the user. I tested that and saw sizable gains: the  TermQuery was 248% (2.5X) faster (to 12.2 QPS) and the  PhraseQuery was 29% faster (to 3.6 QPS). The sampling is also quite configurable: you can set the min and max sample sizes, the sample ratio, the threshold under which no sampling should happen, etc. 

Lucene's facet module makes it trivial to add facets to your search application, and offers useful features like sampling, alternative aggregates, complements, RAM caching, and fully customizable interfaces for many aspects of faceting. I'm hopeful we can reduce the RAM consumption for caching, and speed up the overall performance, over time.

12 Best Practices for Modern Data Ingestion. Download White Paper.


Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}