Use Cases of Faceted Search for Apache Solr
Join the DZone community and get the full member experience.
Join For Free
in this post i write about some use cases of facets for apache solr. please submit your own ideas in the comments.
this post is split into the following parts:
- what are facets?
- how do you enable and use simple facets?
-
what are other use cases?
- category navigation
- autocompletion
- trending keywords or links
- rss feeds
- conclusion
what are facets?
in apache solr, elements for navigational purposes are named facets . keep in mind that solr provides filter queries (specified via http parameter fq) which filter out documents from the search result. in contrast, facet queries only provide information (count of documents) and do not change the result documents, i.e. they provide ‘filter queries for future queries’. so define a facet query and then see how much documents i can expect if i would apply the related filter query.
but a picuture – from this great facet-introduction – is worth a thousand words:
what do you see?
- you see different facets like manufacturer, resolution, …
- every facet has some constraints, where the user can filter its search results easily
- the breadcrumb shows all selected contraints and allows removing them
all these values can be extracted from solrs’ search results and can be defined at query time, which looks surprising if you come from fast esp. nevertheless the fields on which you do faceting needs to be indexed and untokenized. e.g. string or integer. but the type of fields where you want to do faceting mustn’t be the default ‘text’ type, which is tokenized.
in solr you have
- normal facets used via facet.field
- facet queries and
- date facets similar to the new more general
- range queries
the normal facets can be useful if your documents have a manufacturer string field e.g. a document can be within the ‘sony’ or ‘nikon’ bucket. in contrast you will need facet queries for integers like pricing. for example if you specify a facet query from 0 to 10 eur solr will calculate on the fly all documents which fall into that bucket. but the facet queries becomes relative unhandy if you have several identical ranges like 0-10, 10-20, 20-30, … eur. then you can use range queries.
date facets are special range queries. as an example look into this screenshot from jetwick :
where here the interval (which is called gap) for every bucket is one day.
for a nice introduction into facets have a look into this publication or use the solr wiki here .
how do you enable and use simple facets?
as stated before they can be enabled at query time. for the http api you add “&facet=true&facet.field=manu” to your normal query “http://localhost:8983/solr/select?q=*:*”. for solrj you do:
new solrquery("*:*").setfacet(true).addfacetfield("manu");
in the xml returned from the solr server you will get something like this – again from this post :
<lst name="facet_fields">
<lst name="manu">
<int name="canon usa">17</int>
<int name="olympus">12</int>
<int name="sony">12</int>
<int name="panasonic">9</int>
<int name="nikon">4</int>
</lst>
<pre></lst></pre>
to retrieve this with solrj you don’t need to touch any xml, of course. just get the facet objects:
list<facetfield> facetfields = queryresponse.getfacetfields();
to append facet queries specify them with addfacetquery:
solrquery.addfacetquery("quality:[* to 10]").addfacetquery("quality:[11 to 100]");
and how you would query for documents which does not have a value for that field? this is easy : q=-field_name:[* to *]
now i’ll show you like i implemented date facets in jetwick :
q.setfacet(true).set(“facet.date”, “{!ex=dt}dt”).
set(“facet.date.start”, “now/day-6days”).
set(“facet.date.end”, “now/day+1day”).
set(“facet.date.gap”, “+1day”);
with that query you get 7 day buckets which is visualized via:
it is important to note that you will have to use local parameters like {!ex=dt} to make sure that if a user applies a facet (uses the facet query as filter query) then the other facet queries won’t get a count of 0. in the picture the filter query was fq={!tag=dt}dt:[2010-12-04t00:00:00.000z+to+2010-12-05t00:00:00.000z]. again: filter query needs to start with {!tag=dt} to make that working. take a look into the datefilter source code or this for more information.
be aware that you will have to tune the filtercache in order to keep performance green. it is also important to use warming queries to avoid time outs and pre-fill caches with old ‘heavy’ used data.
what are other use cases?
1. category navigation
the problem: you have a tree of categories and your products are categorized in multiple of those categories.
there are two relative similar solutions for this problem. i will describe one of them:
- create a multivalued string field called ‘category’. use the category id (or name if you want to avoid db queries).
- you have a category tree. make sure a document gets not only the leaf category, but all categories until the root node.
- now facet over the category field with ‘-1′ as limit
-
but what if you want to display only the categories of one level?
e.g. if you don’t want other level at a time or if they are too much.
then index the category field ala <level>_category. for that you will need the complete category tree in ram while indexing. then use facet.prefix=<level>_ to filter the category list for the level - clicking on a category entry should result in a filter query ala fq=category:”<levle>_categoryid”
- the little tricky part is now that your ui or middle tier has to parse the level e.g. 2 and the append 2+1=3 to the query: facet.prefix=3_
-
if you filter the level then one question remains:
q: how can you display the path from the selected category until the root category?
a: either get the category parents via db, which is easy if you store the category ids in solr – not the category names.
or get the parents from the parameter list which is a bit more complicated but doable. in this case you’ll need to store the category names in solr.
please let me know if this explanation makes sense to you or if you
want to see that in action – i don’t want to make advertisments for our
customers here
btw: the second approach i have in mind is: instead of using facet.prefix you can use dynamic fields ala category_<level>_s
2. autocompletion
the problem: you want to show suggestions as the user types.
you’ll need a multivalued ‘tag’ field. for jetwick i’m using a heavy noise word filter to get only terms ‘with information’ into the tag field, from the very noisy tweet text. if you are using a shingle filter you can even create phrase suggestions. but i will describe the “one more word” suggestion here, which will only suggest the next word (not a complete different phrase).
to do this create a the following query when the user types in some characters (see getquerychoices method of solrtweetsearch ):
- use the old query with all filter queries etc to provide a context dependent autocomplete (ie. only give suggestions which will lead to results)
-
split the query into “completed” terms and one “to do” term. e.g. if you enter “michael jack”
then michael is complete (ends with space) and jack should be completed - set the query term of the old query to michael and add the facet.prefix=jack
- set facet limit to 10
- read the 10 suggestions from facet field but exclude already completed terms.
the implementation for jetwick which uses apache wicket is available in the searchbox source file which uses myautocompletetextfield and the getquerychoices method of solrtweetsearch . but before you implement autocomplete with facets take a look into this documentation . and if you don’t want to use wicket then there is a jquery autocomplete library especially for solr – no ui layer required.
3. trending keywords or links
similar to autocomplete you will need a tag or link field in your index. then use the facet counts as an indicator how important a term is. if you now do a query e.g. solr you will get the trending keywords and links depending on the filters. e.g. you can select different days to see the changes:
the keyword panel is implemented in the tagcloudpanel and the link list is available as urltrendpanel .
of course it would be nice if we would get the accumulated score of every link instead of a simple ‘count’ to prevent spammers from reaching this list. for that, look into this jira issue and into the statscomponent . like i explained in the jira issue this nice feature could be simulated by the results grouping feature.
4. rss feeds
if you log into at jetwick.com you’ll see this idea implemented. every user can have different saved searches . for example i have one search for ‘apache solr’ and one for ‘wikileaks’. every search could contain additional filters like only german language or sort against retweets. now the task is to transform that query into a facet query:
- insert and’s between the query and all the filter query
- remove all date filters
- add one date filter with the date of the last processed search (‘last date’)
then you will see how many new tweets are available for every saved searches:
update : no need to click refresh to see the counts. the count-update is done in background via javascript.
conclusion
there are a lot of applications for faceted search. it is very
convinient to use them. okay, the ‘local parameter hack’ is a bit
daunting, but hey: it works
it is nice that i can specify different facets for every query in solr, with that feature you can generate personalized facets like it was explained under “rss feeds”.
one improvement for the facets implemented in solr could be a feature which does not calculate the count. instead it sums up a fielda for documents with the same value in fieldb or even returns the score for a facet or a facet query. to improve the use case “trending keywords or links”.
from http://karussell.wordpress.com/2010/12/08/use-cases-of-faceted-search-for-apache-solr/
Opinions expressed by DZone contributors are their own.
Comments