Options to Tune Document’s Relevance in Solr
Options to Tune Document’s Relevance in Solr
Join the DZone community and get the full member experience.Join For Free
Java-based (JDBC) data connectivity to SaaS, NoSQL, and Big Data. Download Now.
Working at Lucid Imagination a customer once asked me about how they could modify the score of the documents in Solr in order to get most relevant results higher in the results list. While I was trying to respond the question I realized that there are too many different options, and that not all of them are very easy to understand, so I decided to write some notes summarizing the most common/most used ways to do it. After that, many times I was asked the same question, so I decided to turn those notes into a blog post.
There are two stages where documents can be boosted: At index time and at query time.
Originally Authored by Tomás Fernández Löbbe
At Index Time
This is probably the simplest way, because there are not too many options. It is also the most static way of adding boosts, as changing the boost for a documents would require re-indexing it.
When updating documents using the XMLUpdateRequestHandler, the way to
boost a document is to add the optional attribute “boost” to the doc
element. When using SolrJ, the way to do it is by using the method
The default boost for a field is 1, so setting a value between 0 and 1 would down boost the document.
It is also possible to add different boosts to different fields of a
document. The only requirement here is that the boosted fields must
store the norms (“omitNorms” attribute in the schema must be set to
“false”). The way of applying the boosts when using the
XMLUpdateRequestHandler is similar to boosting the whole document, but
instead of adding the “boost” attribute to the doc element, add it to
the field element. When using SolrJ:
document.addField(“title”, “Foo Bar”, x);
It’s important to know that the boost (either for a document or for a field) will be considered when calculating the final score for a document given a search. It is not the final score of the document. Boosting documents is not the same as sorting documents.
At Query Time
Boosting at query time is a little bit different than index time. It is much more dynamic as it doesn’t require re-indexing and can be specified with every new request to Solr. Also, what gets boosted is not a document or a field, but a subquery on the search. The simplest way to achieve query time boosting is by using the ^ character plus the boost number on the query, for example:
Much more complex expressions can also be used for query time boosting, like:
title:(foo bar)^5 OR content:(foo bar)^2 OR foo OR bar
title:(foo bar)^5 OR title:”foo bar”^20 OR …
The syntax can be very simple for simple cases, but it will get more and more complex with more complex use cases.
The above syntax is Lucene’s query syntax, it is supported by the
Lucene Query Parser and the Extended Dismax Query Parser but not by the
Dismax Query Parser.
However, this syntax requires having an expert user who knows how to use it, or some application logic to inject it in the background after the user enters the query and before sending it to Solr. Dismax provides other alternatives for query time boosting, as dynamic as the previous one, but with a much easier syntax (all of them also supported by Extended Dismax).
Query Time Boosting with the Dismax Query Parser
The Dismax Query Parser (QP) will create a query that will be executed on many different fields, even if the user hasn’t specified any. This is one of the most important improvements of the Dismax QP over the Lucene QP. But sometimes, not all the fields have the same importance. Sometimes, a hit on the title field is more important than a hit on the content field, or a hit on the content can be more important than a hit on the comments field. The Dismax Query Parser provides the ability to consider some fields more important than others with the “qf” (named after “query fields”) parameter, the same that is used for specifying the different fields on which to execute the user query. A common value for this parameter could be:
qf=title^5 content^2 comments^0.5
This will translate a user query like “boo bar” into something similar to:
title:(foo bar)^5 OR content:(foo bar)^2 OR comments:(foo bar)^0.5
The same as with query fields, Dismax Query Parser will execute the user query as a phrase query on the specified “phrase” fields. In this parameter, and in a similar way as in the qf parameter, a different boost for each of the phrase fields can be specified:
This will translate a user query like foo bar into:
title:”foo bar”^20 OR content:”foo bar”^10
The last query will only be used for boosting the documents resulting from the original query.
Sometimes it is necessary to boost some documents regardless of the user query. A typical example of boost queries is boosting sponsored documents. The user searches for “car rental”, but the application has some sponsored document that should be boosted. A good way of doing this is by using boost queries. A boost query is a query that will be executed on background after a user query, and that will boost the documents that matched it.
For this example, the boost query (specified by the “bq” parameter) would be something like:
The boost query won’t determine which documents are considered a hit an which are not, but it will just influence the score of the result.
Boost Functions are very similar to boost queries; in fact, they can achieve the same goals. The difference between boost functions and boost queries is that the boost function is an arbitrary function instead of a query (see http://lucidworks.lucidimagination.com/display/solr/Function+Queries). A typical example of boost functions is boosting those documents that are more recent than others. Imagine a forum search application, where the user is searching for forum entries with the text “foo bar”. The application should display all the forum entries that talk about “foo bar” but usually the most recent entries are more important (most users will want to see updated entries, and not historical). The boost function will be executed on background after each user query, and will boost some documents in some way.
For this example, a boost function (specified by the “bf” parameter) could be something like:
The same as with the boost queries, this function will not determine which documents are a hit and which are not, it will just add additional score to them.
A note on boost functions: boost functions can also be used with the Lucene QP by using the “_val_” special key inside the query.
The “tie” (tie breaker) parameter is very important, but not easy to understand. First it is important to understand what is a dismax (http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/). With DisMax queries, the different terms of the user input are executed against different fields, if many of them hit (the term appears in different fields in the same document) the hit that scores higher is used, but what happens with the other sub-queries that hit in that document for the term? Well, that’s what the “tie” parameter defines. DisMax will calculate the score for a term query as:
score= [score of the top scoring subquery] + tie * (sum of other hitting subqueries)
In consequence, the “tie” parameter is a value between 0 and 1 that will define if the Dismax will only consider the max hit score for a term (setting tie=0), all the hits for a term (setting tie=1) or something between those two extremes.
The boost Parameter
The “boost” parameter is very similar to the “bf” parameter, but instead of adding its result to the final score, it will multiply it. This is only available in the “Extended Dismax Query Parser” or the “Lucid Query Parser”.
A note on the parameters
All the above parameters can be specified when configuring Solr (in
the solrconfig.xml file) but they can also be changed on each request
just by sending the parameter on the request with the new value.
Opinions expressed by DZone contributors are their own.