Fun with Lucene's Faceted Search Module
Fun with Lucene's Faceted Search Module
Join the DZone community and get the full member experience.Join For Free
The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.
These days faceted search and navigation is common and users have come to expect and rely upon it.
Lucene's facet module, first appearing in the 3.4.0 release, offers a powerful implementation, making it trivial to add a faceted user interface to your search application. Shai Erera wrote up a nice overview here and worked through nice "getting started" examples in his second post.
The facet module has not been integrated into Solr, which has an entirely different implementation, nor into ElasticSearch, which also has its own entirely different implementation. Bobo is yet another facet implementation! I'm sure there are more...
The facet module can compute the usual counts for each facet, but also has advanced features such as aggregates other than hit count, sampling (for better performance when there are many hits) and complements aggregation (for better performance when the number of hits is more than half of the index). All facets are hierarchical, so the app is free to index an arbitrary tree structure for each document. With the upcoming 4.1, the facet module will fully support near-real-time (NRT) search.
Lucene's nightly performance benchmarks
I was curious about the performance of faceted search, so I added date facets, indexed as
year/month/day hierarchy, to the nightly Lucene benchmarks. Specifically I added faceting to all
TermQuerys that were already tested, and now we can watch this graph to track our faceted search performance over time. The date field is the
timestamp of the most recent revision of each Wikipedia page.
Simple performance tests
I also ran some simple initial tests on a recent (5/2/2012) English Wikipedia export, which contains 30.2 GB of plain text across 33.3 million documents. By default, faceted search retrieves the counts of all facet values under the root node (years, in this case):
Date (3994646) 2012 (1990192) 2011 (752327) 2010 (380977) 2009 (275152) 2008 (271543) 2007 (211688) 2006 (98809) 2005 (12846) 2004 (1105) 2003 (7)It's interesting that 2012 has such a high count, even though this export only includes the first five months and two days of 2012. Wikipedia's pages are very actively edited!
The search index with facets grew only slightly (~2.3%, from 12.5 GB to 12.8 GB) because of the additional indexed facet field. The taxonomy index, which is a separate index used to map facets to fixed integer codes, was tiny: only 120 KB. The more unique facet values you have, the larger this index will be.
Next I compared search performance with and without faceting. A simple
party), matching just over a million hits, was 51.2 queries per second (QPS) without facets and 3.4 QPS with facets. While this is a somewhat scary slowdown, it's the worst case scenario:
TermQueryis very cheap to execute, and can easily match a large number of hits. The cost of faceting is in proportion to the number of hits. It would be nice to speed this up (patches welcome!).
I also tested a harder
"the village"), matching 194 K hits: 3.8 QPS without facets and 2.8 QPS with facets, which is less of a hit because
PhraseQuerytakes more work to match each hit and generally matches fewer hits.
Loading facet data in RAM
For the above results I used the facet defaults, where the per-document facet values are left on disk during aggregation. If you have enough RAM you can also load all facet values into RAM using the
CategoryListCacheclass. I tested this, and it gave nice speedups: the
TermQuerywas 73% faster (to 6.0 QPS) and the
PhraseQuerywas 19% faster.
However, there are downsides: it's time-consuming to initialize (4.6 seconds in my test), and not NRT-friendly, though this shouldn't be so hard to fix (patches welcome!). It also required a substantial 1.9 GB RAM, according to Lucene's
RamUsageEstimator. We should be able to reduce this RAM usage by switching to Lucene's fast packed ints implementation from the current int it uses today, or by using
DocValuesto hold the per-document facet data. I just opened LUCENE-4602 to explore
DocValuesand initial results look very promising.
Next I tried sampling, where the facet module visits 1% of the hits (by default) and only aggregates counts for those. In the default mode, this sampling is used only to find the top N facet values, and then a second pass computes the correct count for each of those values. This is a good fit when the taxonomy is wide and flat, and counts are pretty evenly distributed. I tested that, but results were slower, because the date taxonomy is not wide and flat and has rather lopsided counts (2012 has the majority of hits).
You can also skip the second pass and then present approximate counts or a percentage value to the user. I tested that and saw sizable gains: the
TermQuerywas 248% (2.5X) faster (to 12.2 QPS) and the
PhraseQuerywas 29% faster (to 3.6 QPS). The sampling is also quite configurable: you can set the min and max sample sizes, the sample ratio, the threshold under which no sampling should happen, etc.
Lucene's facet module makes it trivial to add facets to your search application, and offers useful features like sampling, alternative aggregates, complements, RAM caching, and fully customizable interfaces for many aspects of faceting. I'm hopeful we can reduce the RAM consumption for caching, and speed up the overall performance, over time.
Published at DZone with permission of Michael Mccandless , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.