DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
11 Monitoring and Observability Tools for 2023
Learn more

Solr Powered ISFDB – Part #10: Tweaking Relevancy

Chris Hostetter user avatar by
Chris Hostetter
·
Nov. 27, 11 · Interview
Like (0)
Save
Tweet
Share
4.41K Views

Join the DZone community and get the full member experience.

Join For Free
This is Part 10 in a series of 11 (so far) articles by Chris Hostetter in 2011 on Indexing and Searching the ISFDB.org data using Solr.

Circumstances have conspired to keep my away from this series longer then I had intended, So today I want to jump right in talking about improving the user experience by improving relevancy.

(If you are interested in following along at home, you can checkout the code from github. I’m starting at the blog_9 tag, and as the article progresses I’ll link to specific commits where I changed things, leading up to the blog_10 tag containing the end result of this article.)

Academic vs Practical

In Academia, people who study IR have historically discussed “relevancy” in terms of “Precision vs Recall” (If these terms aren’t familiar to you, then I highly suggest reading the link) but in my experience, those kinds of metrics are just the starting point. While users tend to care that your “Recall” is good (results shouldn’t be missing), “Precision” is is usually less important then “ordering” — Most users (understandably) want the “best” results to come first, and don’t care about the total number of results.

Defining the “best” results is where things get tricky. Once again, there are lots of great algorithms out there that academics debate the pros and cons of all the time, but frequently the best approach you can take to give you users the “best” results first isn’t to get a PhD in IR, it’s to “cheat” and bias the algorithms and apply “Domain Specific Knowledge” — But I’m getting ahead of myself, let’s start with a real example.

Poor Results in Our Domain

Every domain is different, and the key to providing a good search experience is making sure you really understand your domain, and how your users (and data) relate to it.

Lets look at a specific example with our ISFDB Data. One of the most famous Sci-Fi short stories ever written is Nightfall by Isaac Asimov, who later collaborated with Robert Silverberg to expand it into a novel. If a user (who knows they are searching the ISFDB) searched for the word “Nightfall” they would understandable expect one of those two titles to appear fairly high up in the list of results, but that’s now quite what they get with page #1 of our search as it’s configured right now…


  1. Title: Nightfall INTERIORART – Author: Kolliker
  2. Title: Nightfall: Body Snatchers ANTHOLOGY- Author: uncredited
  3. Title: Cover: Nightfall One COVERART – Author: Ken Sequin
  4. Title: Cover: Nightfall One COVERART – Author: Ken Sequin
  5. Title: Nightfall SHORTFICTION – Author: Tom Chambers
  6. Title: Glossary (Nightfall at Algemron) ESSAY – Author: uncredited
  7. Title: Cover: Nightfall One COVERART – Author: Ken Sequin
  8. Title: Cover: Nightfall Two COVERART – Author: Ken Sequin
  9. Title: Nightfall POEM – Author: Susan A. Manchester
  10. Title: Cover: Nightfall One COVERART – Author: Ken Sequin

These results aren’t terrible surprising since so far in this series we’ve put no work into relevancy tuning, we’re just searching a simple “catchall” field. Before we can improve the situation, it’s important to make sure we understand why we’re getting what we’re getting and why we’re not getting what we want.

Score Explanations

One of the most hard to understand features of Solr is “Score Explanation” — not because it’s hard to use, but because the output really assumes you understand the core underpinnings of Lucene/Solr scoring. When we enable debugging on our query we get a new “toggle explain” links for each result that let us see the score and a break down of how that score was computed — but that doesn’t let us compare with documents that aren’t on page #1. To do that, we use the explainOther option, and switch to the XML view since the velocity templates don’t currently display explainOther info. Now we can compare the explanations between the two docs we really hoped to find, and the top scoring result…

  • TITLE_847094 (Nightfall INTERIORART)
    2.442217 = (MATCH) fieldWeight(catchall:nightfall in 274241), product of:
      <b>1.0</b> = tf(termFreq(catchall:nightfall)=<b>1</b>)
      9.768868 = idf(docFreq=98, maxDocs=636658)
      <b>0.25</b> = fieldNorm(field=catchall, doc=274241)
    
  • TITLE_11852 (Nightfall NOVEL)
    1.7269082 = (MATCH) fieldWeight(catchall:nightfall in 11741), product of:
      <b>1.4142135</b> = tf(termFreq(catchall:nightfall)=<b>2</b>)
      9.768868 = idf(docFreq=98, maxDocs=636658)
      <b>0.125</b> = fieldNorm(field=catchall, doc=11741)
    
  • TITLE_46434 (Nightfall SHORTFICTION)
    1.7269082 = (MATCH) fieldWeight(catchall:nightfall in 41784), product of:
      <b>1.4142135</b> = tf(termFreq(catchall:nightfall)=<b>2</b>)
      9.768868 = idf(docFreq=98, maxDocs=636658)
      <b>0.125</b> = fieldNorm(field=catchall, doc=41784)
    



The devil is in the differences, which I’ve put in bold. Without going into a lot of complicated explanation, the crux of the issue is that even though the documents we’re looking for match the word “nightfall” twice in the catchall field we’re searching (and the top scoring result only matches once) that is offset by the “fieldNorm” which reflects the fact that the catchall field is much longer for our “good” docs then for our “bad” docs.

Tweaking Our Scoring

This is one of those examples where academics doesn’t always match the reality of your domain. Typically when using the TF/IDF scoring model used in Lucene/Solr, you need a “length normalization” factor to offset the common case where a really long document inherently contains more words, so there is a statistical likely hood that the search terms may appear more times. In a nutshell: All other things being equal, shorter is better. This reasoning is generally sound, but the default implementation in Lucene/Solr can be a hinderence in a few common cases:

  • A corpus full of really short documents – our ISFDB index isn’t full books, just a bunch of metadata fields
  • A corpus where longer really is better – in the ISFDB data, more popular titles/authors tend to have more data, which means the catchall field is naturally longer.

There are some cool things we could do with tweaking the Similarity class to try and improve this, but the simplest thing to start with is to omitNorms on the catchall field to eliminate this factor from our scoring. With our new schema, we re-index and see some noticable changes…

  1. Title: Nightfall NOVEL – Authors: Robert Silverberg, Isaac Asimov
  2. Title: Nightfall and Other Stories COLLECTION – Author: Isaac Asimov
  3. Title: Nightfall SHORTFICTION – Author: Isaac Asimov
  4. Title: The Legend of Nightfall NOVEL – Author: Mickey Zucker Reichert
  5. Title: Nightfall NOVEL – Author: John Farris
  6. Title: The Road to Nightfall COLLECTION – Author: Robert Silverberg
  7. Title: Road to Nightfall SHORTFICTION – Author: Robert Silverberg
  8. Title: A Tiger at Nightfall SHORTFICTION – Author: Harlan Ellison
  9. Title: Nightfall SHORTFICTION – Author: David Weber
  10. Title: Nightfall SHORTFICTION – Author: Charles Stross

Domain Specific Biases

Omitting length norms has helped “level the field” for our docs, and in this one example it looks like a huge improvement at first glance, but that’s mainly a fluke. If you look at the score explanations now we get a lot of identical scores, and the final ordering is primarily because of the order they were indexed in.

This is where adding some Domain Specific Bias can be handy. If we review are schema, we see the views and annualviews fields which correspond to how many page views a given author/title has received (recently) on the ISFDB web site. By factoring these page view counts into our scoring, we provide some “Document Biasing” to ensure that documents which are more popular will “win” (ie: score higher) in the event of a tie on the basic relevancy score.

The most straightforward way to bias scoring is with the BoostQParser which will multiple the score of a query for each document against an arbitrary function (on that document). In it’s simplest form we can use it directly in our q param to multiple the scores by the simple sum of the two “views” fields: q={!boost b=sum(views,annualviews)}nightfall and now we get a much more interesting ordering…

  1. Title: Nightfall SHORTFICTION – Author: Isaac Asimov
  2. Title: Nightfall NOVEL – Authors: Robert Silverberg, Isaac Asimov
  3. Title: Nightfall and Other Stories COLLECTION – Author: Isaac Asimov
  4. Title: Nightfall SHORTFICTION – Author: Charles Stross
  5. Title: The Return: Nightfall NOVEL – Author: L. J. Smith
  6. Title: Nightfall SHORTFICTION – Author: Arthur C. Clarke
  7. Title: Nightfall SHORTFICTION – Author: David Weber
  8. Title: The Road to Nightfall COLLECTION – Author: Robert Silverberg
  9. Title: The Legend of Nightfall NOVEL – Author: Mickey Zucker Reichert
  10. Title: Nightfall Revisited ESSAY – Authors: Pat Murphy, Paul Doherty

This new ordering for the page #1 results is much more appropriate for the domain of the ISFDB, and represents a general rule of relevancy biasing: “Users unusually want to see the popular stuff.” However, users don’t usually want to have to type things like {!boost b=sum(views,annualviews)}... into the search box, so we need to encapsulate this into our config. It’s very easy to do this using Local Params, but unfortunately it does mean changing our “main” query param from q to something else.

We start by changing the defaults and invariants of our request handler so that our boost function is always used as the q param, but it uses a new qq param as the main query (whose score will be multiplied by the function). This works fine for our default query, but in order to be useful our UI also needs to be changed to know that the qq param is what is now used for the user input.

Conclusion (For Now)

And that wraps up this latest installment with the blog_10 tag. We’ve dramatically improved the user experience by tweaking our how our relevancy scores are computed based on some knowledge of our domain, particularly via Document Biases. In my next post, I hope to continue the topic of improving the user experience by using DisMax to add “Field Biases”.



Source: http://www.lucidimagination.com/blog/2011/06/20/solr-powered-isfdb-part-10/
User experience

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Cucumber.js Tutorial With Examples For Selenium JavaScript
  • What’s New in Flutter 3.7?
  • An End-to-End Guide to Vue.js Testing
  • Collaborative Development of New Features With Microservices

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: