One of my clients has some pretty heavy-duty requirements for boosting functions. It’s actually right on the boundary of what I think is appropriate for Solr. BUT, while I choose to continue within the bounds of Solr, I might as well expect the boosting functions to be as readable, and well-organized as possible. So let’s take a look at my strategy.
Let’s say that we’re Amazon, and we’re allowing our users to search
over books. But rather than just return the books based upon straight
TF-IDF search, we need to control the boosting behavior to guide users
towards newer books and books with a higher margin. The text of the book
is stored in the
text field, and the margin and release date of the books are stored in the corresponding fields
The problem is that the syntax that we must assemble to create such a query is utterly unwieldy. Allow me to demonstrate:
<requestHandler name="/booksearch" class="solr.SearchHandler"> <lst name="defaults"> <str name="defType">edismax</str> <str name="qf">text</str> <str name="pf">text</str> <str name="boost">sum(product(margin,0.34),product(div(1,ms(NOW,release_date)),1100)</str> </lst> </requestHandler>
Check out that boost parameter. Can you tell what it’s doing? Well, can you? (I’m pausing to let you try and figure it out.) Yeah… so the answer’s no. And as a matter of fact, I can’t tell what it does either – and I just wrote it. What’s more, if your eyeballs are a little better than mine at reading this stuff, you’ll notice that there are some hardwired constants in this equation: 0.34, and 1100. What do these do? Beats me! But they must be important, so let’s never ever touch them ever again.
I think I’ve made a good case for the problem. This type of function munging leads to brittle, inscrutable, and unchangeable configuration. Let’s take another swing at it!
Here’s my second attempt. Take a moment to read over it and see what you think.
<requestHandler name="/booksearch" class="solr.SearchHandler"> <lst name="defaults"> <str name="defType">edismax</str> <str name="qf">text</str> <str name="pf">text</str> <str name="boost">$totalBoost</str> <str name="totalBoost">sum($marginBoost,$recencyBoost)</str> <str name="marginBoost">product(margin,$valMarginBoost)</str> <str name="recencyBoost">product($inverseRecency,$valRecencyBoost</str> <str name="inverseRecency">div(1,ms(NOW,release_date))</str> <str name="valMarginBoost">0.34</str> <str name="valRecencyBoost">1100</str> </lst> </requestHandler>
So the first thing that you might notice is that it’s a little more verbose than the previous request handler, but I maintain that this verbosity is actually incredibly helpful. Because now, you can almost read this configuration as if it’s explaining to you exactly what it’s doing.
YOU: How is the total boost formed?
MR.REQUEST HANDLER: Oh, well it’s the sum of the margin boost and the recency boost. Duh!
YOU: Yeah, well what’s the margin boost?
MR.REQUEST HANDLER: Simple! We just multiply the value stored in margin field with the constant called valMarginBoost.
YOU: Oh… so I can just modify the valMarginBoost and change how important the margin is in the results?
MR.REQUEST HANDLER: Bingo!
Personally, I don’t like Handler’s tone, but he’s right, this is lots easier to read, and therefore maintain and modify. The labeling of the functional pieces makes it easier to keep track of everything and understand how each piece builds up to the total boost. The ordering of the named pieces is also important. I made sure that the definition of each piece is located just below the place where it is first mentioned. The only exception is the section at the bottom where I’ve placed the constants that the content curator or merchandising expert can fiddle with – thus there are no longer any magic constants in our configuration.
Shameless Plug for Quepid
Content curators, merchandising experts – now since the search team has built up the Solr request handler, and exposed the tunable parameters, it’s your job to find the perfect value for these parameters. This is hard! Why? Because you might find that the perfect parameter values for your top product is actually the worst possible configuration for all other products. And it’s hard to know this without looking at all those queries at once.